The situation
Acme has 60 ECS services across three clusters, all running on Fargate. The security team’s recent audit turned up four classes of problem:
- Secrets baked into images. Three services have database passwords in
ENVlines in their Dockerfiles. The credentials are in the image’s layers, in the build logs, in ECR, and, worst, rotating them means rebuilding and redeploying. - Secrets in task definitions as environment variables. About fifteen services pass API keys via the task definition’s
environmentarray. These appear in plain text inecs:DescribeTaskDefinitionresponses, in CloudFormation templates, and in any IaC state file. - Task roles with over-broad permissions. The auditor picked three services at random; each task role had
*on an action/resource pair that shouldn’t have been. One hadiam:*, inherited from a long-gone template. - No distinction between the task role and the task execution role. Multiple services had merged the two into one role. The execution role was being used by the application code (because it had broad enough permissions), which meant the audit trail didn’t distinguish “Fargate agent pulling the image” from “my application calling S3.”
The consolidation target: AWS credentials come from the task role, never from code or environment. Non-AWS secrets (database passwords, API keys) come from Secrets Manager or Parameter Store, injected at container start, never stored in images or task definitions. Task role and execution role are two separate roles with different purposes.
What actually matters
The first question is who is the credential for? There are two distinct callers in any container runtime: the platform itself, pulling images and emitting logs and resolving secrets at task start; and the application, calling other cloud APIs at runtime. Conflating them into one identity destroys the audit trail because every API call attributes to “the task” rather than to a named caller. Two roles per task, one for the platform’s setup work, one for the application’s runtime work, is the boundary that lets an auditor distinguish “the agent pulled an image” from “my code called storage”.
The second is where do credentials live? Anything in the image is in the image’s layers, in the build log, in the registry, and in every cache that ever pulled it; rotation means a rebuild. Anything in the task definition is visible to every caller of the describe API, in version-control templates, and in IaC state files; rotation means a redeploy. The only place a credential value should live is a secret store that injects at task start and exposes the value only to the running container. The decision about where the value lives is the first-class one; everything else is mechanics.
The third is what does rotation look like? Some credentials rotate often by design (database passwords, OAuth refresh tokens, third-party API keys with policy-driven cadence); others rarely change (internal service identifiers, configuration values that are sensitive but stable). Rotation needs its own automation path, triggered, audited, and observable, and the secret store has to make rotation cheap. Where rotation doesn’t exist natively, the application either picks up the new value on next start or has to re-read on a schedule; long-running tasks need to be designed for that, not surprised by it.
The fourth is how fine-grained is the application’s identity? A role with * on actions or resources is what the auditor finds; a role scoped to specific resource ARNs and specific actions is what the auditor accepts. The default has to be one identity per service, with a policy that names exactly the resources the service touches. Anything broader is a design failure, not an oversight.
The fifth is what audit signal does the secret store emit? Every secret read should produce an entry: who, what, when. Whether that’s the cloud’s audit trail or an external secret store’s log doesn’t matter much; what matters is that “did anyone read this credential” is answerable in seconds, not weeks. A secret store with no audit log is barely better than an environment variable.
The sixth is what’s the cross-account story? Application identities are still cloud-platform IAM roles. Cross-account access to a bucket, table, or queue means the resource policy and the identity policy have to agree, and any conditions (organisation ID, source VPC) attach to the role like they would to any other principal. The credentials story has to extend cleanly to “this service, in this account, talking to that resource over there”, not just the single-account case.
What we’ll filter on
- Credentials out of code, does the application never see a literal credential?
- Credentials out of task definitions, does the task definition not contain secret values?
- Rotation supported, can the secret rotate without a redeploy?
- Two-role separation, distinct execution and task roles?
- Least-privilege scoping, is the role’s policy specific enough to pass audit?
- Audit trail, does CloudTrail show who / what accessed the secret?
The ECS credentials landscape
-
Hard-coded in image. Credentials in the Dockerfile, in source, or in a committed config file. Nothing good about this; every item above fails.
-
Environment variables in task definition (plain).
containerDefinitions[].environment. Visible inDescribeTaskDefinition, CloudFormation, state files. The rotation story is “redeploy.” Nothing else good. -
Secrets Manager via task definition
secrets.containerDefinitions[].secretswithvalueFrompointing at a secret ARN. Resolved at task start by the execution role. Application sees an environment variable; the value is never in the task definition. -
SSM Parameter Store via task definition
secrets. Same mechanism, different backing store. For secrets without rotation needs. -
Task role + direct SDK call. Application code calls
secretsmanager:GetSecretValueat startup using the task role’s credentials. Gives the application more control (e.g. reading multiple secrets, refreshing periodically), at the cost of adding code. For services that already have startup-config logic, a good fit. -
Application-managed external secret store (Vault, etc.). Outside the AWS-native story; legitimate when the org already has HashiCorp Vault or similar. AWS’s IAM auth method for Vault uses the task role to obtain a Vault token.
Side by side
| Option | Out of code | Out of task def | Rotation | Two roles | Least-priv | Audit trail |
|---|---|---|---|---|---|---|
| Hard-coded in image | ✗ | n/a | ✗ | n/a | n/a | ✗ |
| env vars (plain) | ✓ | ✗ | Redeploy only | n/a | n/a | ✗ |
secrets via Secrets Manager |
✓ | ✓ | ✓ | ✓ | ✓ | CloudTrail |
secrets via Parameter Store |
✓ | ✓ | Manual | ✓ | ✓ | CloudTrail |
| Task role + SDK | ✓ | ✓ | ✓ (on refresh) | ✓ | ✓ | CloudTrail |
| External store (Vault) | ✓ | ✓ | ✓ | ✓ | ✓ | Vault audit log |
For Acme’s mixed fleet, the answer is a combination: task role for AWS credentials, secrets field with Secrets Manager for rotating secrets, secrets with Parameter Store for non-rotating sensitive config.
How the two roles and secrets interact at task start
The task definition in depth
The relevant fields in a task definition JSON:
{
"family": "payments-service",
"taskRoleArn": "arn:aws:iam::111122223333:role/ecs/payments-service-task",
"executionRoleArn": "arn:aws:iam::111122223333:role/ecs/payments-service-execution",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "1024",
"memory": "2048",
"containerDefinitions": [
{
"name": "app",
"image": "111122223333.dkr.ecr.eu-west-1.amazonaws.com/payments:v147",
"essential": true,
"portMappings": [ { "containerPort": 8080, "protocol": "tcp" } ],
"environment": [
{ "name": "AWS_REGION", "value": "eu-west-1" },
{ "name": "SERVICE_NAME", "value": "payments" },
{ "name": "LOG_LEVEL", "value": "info" }
],
"secrets": [
{
"name": "DB_PASSWORD",
"valueFrom": "arn:aws:secretsmanager:eu-west-1:111122223333:secret:acme/payments/db:password::"
},
{
"name": "STRIPE_API_KEY",
"valueFrom": "arn:aws:secretsmanager:eu-west-1:111122223333:secret:acme/stripe/api-key"
},
{
"name": "FEATURE_FLAG_TOKEN",
"valueFrom": "arn:aws:ssm:eu-west-1:111122223333:parameter/acme/payments/flag-token"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/payments-service",
"awslogs-region": "eu-west-1",
"awslogs-stream-prefix": "app"
}
}
}
]
}
Three patterns matter. taskRoleArn and executionRoleArn are distinct. environment is only non-sensitive configuration. secrets takes a name (the env var the container will see) and a valueFrom pointing at a Secrets Manager secret, a Secrets Manager secret’s specific JSON field (:password::), or an SSM parameter. The execution role needs secretsmanager:GetSecretValue on each referenced secret and kms:Decrypt on the KMS key that encrypts them, a common oversight that causes cryptic task-start failures.
For the JSON-field-extraction syntax (arn:aws:secretsmanager:...:secret:acme/payments/db:password::), ECS parses the secret as JSON and extracts the password field. This is the pattern for storing multi-field secrets (a DB credential with username, password, and host) as a single Secrets Manager secret and injecting each field as a separate environment variable.
The task role in depth
A task role’s IAM policy is what separates “a service that works” from “a service that passes audit.” The working shape:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ReadWritePaymentsBucket",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": "arn:aws:s3:::acme-payments-prod/*"
},
{
"Sid": "ListPaymentsBucket",
"Effect": "Allow",
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::acme-payments-prod"
},
{
"Sid": "PaymentsTableAccess",
"Effect": "Allow",
"Action": [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:UpdateItem",
"dynamodb:Query"
],
"Resource": [
"arn:aws:dynamodb:eu-west-1:111122223333:table/payments",
"arn:aws:dynamodb:eu-west-1:111122223333:table/payments/index/*"
]
},
{
"Sid": "NotificationsQueue",
"Effect": "Allow",
"Action": "sqs:SendMessage",
"Resource": "arn:aws:sqs:eu-west-1:111122223333:notifications"
}
]
}
Specific bucket, specific table and its indexes, specific queue. No * anywhere except at the end of object keys. Actions listed explicitly. This is the policy the auditor wants to see.
The trust policy on the role:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": { "Service": "ecs-tasks.amazonaws.com" },
"Action": "sts:AssumeRole",
"Condition": {
"ArnLike": { "aws:SourceArn": "arn:aws:ecs:eu-west-1:111122223333:*" },
"StringEquals": { "aws:SourceAccount": "111122223333" }
}
}]
}
The aws:SourceArn / aws:SourceAccount conditions are the “confused deputy” protection that AWS has been pushing: they ensure the role can only be assumed when called from an ECS task in a specific account, not by an ECS task in a different account that someone else has configured to point at this role.
Rotation in depth
For a Secrets Manager secret rotated by an attached Lambda:
aws secretsmanager rotate-secret \
--secret-id acme/payments/db \
--rotation-lambda-arn arn:aws:lambda:eu-west-1:111122223333:function:RotateSecretsManagerDbCred \
--rotation-rules AutomaticallyAfterDays=30
AWS provides templates for RDS rotation Lambdas (MySQL, Postgres, Oracle, SQL Server) in the Serverless Application Repository. For third-party APIs, you write the Lambda; the four-step protocol (createSecret, setSecret, testSecret, finishSecret) is documented and consistent across rotations.
The task picks up the new secret on next task start. For a long-running service, this means a rolling restart at (or shortly after) rotation time. Two patterns:
- Rotation triggers a forced service update. EventBridge rule on
"detail-type": "AWS API Call via CloudTrail"forRotateSecret, target a Lambda that callsecs:UpdateServicewithforceNewDeployment=true. Tasks cycle through new-secret versions within ~5 minutes. - Application re-reads secret periodically. The app uses the task role (which can itself read Secrets Manager if granted) to re-fetch the secret every N minutes. More code; avoids the restart; requires the task role to have
secretsmanager:GetSecretValueon the secret, which somewhat blurs the execution-vs-task-role separation.
Most teams prefer pattern one: cleaner separation of responsibilities, no application code change.
A worked remediation
Acme’s fifteen services with plain-text environment secrets get remediated in a two-week sprint.
Week 1: inventory. Every task definition grepped for environment entries matching credential patterns (PASSWORD, API_KEY, TOKEN, SECRET). 37 secrets across 15 services identified. Secrets created in Secrets Manager (for anything needing rotation) or Parameter Store (for static config). KMS keys provisioned per environment.
Week 2: task definition updates. Each service’s task definition is updated: the environment entries for secret values are moved to secrets with valueFrom ARNs. Execution roles updated to include secretsmanager:GetSecretValue and kms:Decrypt on the specific resources. Services redeployed (rolling update).
End of week: 37 secret values removed from task definitions. The same values are now stored encrypted in Secrets Manager / Parameter Store, audit-logged via CloudTrail, and injected into containers at start. Rotation enabled on the RDS credentials; scheduled for every 30 days.
Simultaneously, task role policies are audited. Three roles with iam:* or *:* rewrites are produced. Specific bucket / table / queue ARNs; specific actions. Dry-run deploys validate nothing breaks.
Post-remediation, the auditor’s task-definition view shows no secret values. CloudTrail shows every secret fetch with the execution role’s identity and the task ARN. Task role policies are specific enough that a “show me what this service is allowed to touch” question has an answer in ten lines of JSON.
What’s worth remembering
- Two roles, two jobs. Task execution role for infrastructure (pull image, write logs, fetch secrets). Task role for application (S3, DynamoDB, SQS). They should never be the same role.
- Secrets go in
secrets, notenvironment. ECS resolves them at task start via the execution role; values never appear in task definitions, ECS APIs, or IaC state. - Secrets Manager for rotating secrets; Parameter Store SecureString for static. Cost, rotation, and audit shape the choice. Both work via the same task-definition mechanism.
- The execution role needs
secretsmanager:GetSecretValueandkms:Decrypt. Common failure mode: secret referenced, execution role can’t decrypt, task fails withResourceInitializationError. - Task role policies should not contain
*. Specific resource ARNs, specific actions. The audit wants to read the policy and understand what the service can do. aws:SourceArn/aws:SourceAccountin the trust policy prevents cross-account “confused deputy” abuse. AWS recommends these conditions on all service-assumed roles.- Rotation triggers a rolling restart. EventBridge on
RotateSecret->ecs:UpdateService --force-new-deploymentis the clean pattern. Tasks cycle, fetch the new secret, move on. - Non-AWS secrets use the task role to fetch more if needed. If the application legitimately needs to read secrets beyond the
secretsfield (e.g. dynamic secrets fetched on a schedule), grantsecretsmanager:GetSecretValueon the task role, not just the execution role.
The shape of “secrets and credentials in ECS” is simple when the two IAM roles are separate, when non-AWS secrets live in Secrets Manager or Parameter Store, and when the task definition carries only ARNs (not values). It’s messy when those separations collapse. Acme’s 15 remediated services go from failing an audit (“every task definition has plain-text API keys”) to passing it (“CloudTrail shows every secret read, every role is least-privileged, rotation happens automatically”). The application code changes not at all; the infrastructure boundary is where the change lives.