How to Set Up Blue/Green ECS Deploys With CodeDeploy

July 10, 2028 · 15 min read

DevOps Engineer Professional · DOP-C02 · part of The Exam Room

The situation

An e-commerce platform runs roughly forty microservices on ECS Fargate behind Application Load Balancers. Each service uses the default ECS rolling-update deployment controller with minimumHealthyPercent: 100 and maximumPercent: 200, standard Fargate rolling.

Two weeks ago the checkout service deployed a task definition that renamed a field on /v2/cart from total_cents to total_minor_units. For the eight minutes new and old tasks both sat behind the same ALB target group, roughly half of traffic hit a task speaking the new shape and half hit a task still speaking the old one. The upstream UI cached the first schema it saw per session and failed validation on the next request. Elevated 5xx for eight minutes. Rollback took another seven.

The post-incident list: no mixed-version traffic on the same listener, automatic rollback on alarm, pre-traffic smoke tests through the load balancer, and something that doesn’t require bespoke orchestration per service because there are forty of them.

What actually matters

Before reaching for a controller, it’s worth separating what the rolling-update model actually does wrong here from what can be papered over with tighter alarms.

The underlying failure was two revisions living behind one target group at the same time. The ALB has no notion of revisions; it balances across healthy targets. Rolling update says “spin up the new tasks, wait for health, drain the old” and during the overlap window the load balancer genuinely has no way to prefer one over the other. A contract-incompatible change during that window is 50/50 roulette for every request. Alarms catch the consequence (5xx is up) but not the cause (the mixed pool), by the time the alarm fires, traffic has already been served by both versions. The fix we want isn’t tighter alarms; it’s a topology where mixed-version serving is structurally impossible.

Structurally impossible means each version lives behind its own target group, and the load balancer switches between target groups rather than between tasks. That’s the blue/green shape. Inside a given target group the revision is uniform; a client hitting the “new” group gets the new version for every request; a client hitting the “old” group gets the old version for every request. No half-state, regardless of schema discipline.

Testing before customers see it is a separate lever. A canary that routes 10% of real traffic to the new version exposes, by design, 10% of customers to whatever’s wrong. A smoke test is the earlier check, invocations through the load balancer with controlled inputs, on a path only the test harness knows about, before any production weight moves. When the pre-traffic test catches the bug, zero customers saw it. When it doesn’t, the canary and the alarm gate are what’s left.

Blast radius vs cost. Blue/green runs two task sets during deployment and briefly after, roughly doubling the Fargate footprint for the deploy plus the termination wait. For services with small desired-counts, that’s free money; for services with large ones, it’s a real line item. The trade is “how much do we pay to make this class of incident structurally impossible?” The 8-minute outage cost more than a year of doubled Fargate footprint on this service, so the answer here is easy. It’s not always.

Rollback is instantaneous when blue is still running. The old task set sits at 0% weight after a successful cutover, kept alive for a configurable wait. During the wait, reversion is a listener-attribute change, there’s nothing to provision. After termination, rollback means a new deployment. That retention window is insurance priced in minutes of Fargate.

Observability scopes to the target group, not the service. Alarms wired to “tg-green’s 5xx rate” watch only the new version’s traffic, cleanly separated from the old. The rolling model can’t do that, metrics are service-scoped and get blended during the deploy. Structural separation makes the alarm signal clean.

Finally, operational surface area at 40 services. Anything that requires bespoke orchestration per service is forty times the maintenance. We want a managed state machine we wire up once per service via infrastructure-as-code and leave alone. “Step Functions workflow per service” and “DIY alias-weight scripts” both lose on arithmetic; CodeDeploy on the ECS compute platform doesn’t.

What we’ll filter on

  1. Atomic traffic cutover, no mixed-version target group; shift is between two uniform task sets.
  2. Alarm-driven rollback. CloudWatch alarms on the deployment itself trigger reversion without a human.
  3. Pre-traffic smoke test surface, real HTTP through the load balancer against the new tasks before any production weight.
  4. Managed orchestration. AWS-native state machine wired once per service, not a bespoke workflow each.
  5. Task definition immutability, rollback is a pointer flip, not a rebuild.

The ECS deployment landscape

ECS services support three deployment controllers. The choice is set at CreateService time and is almost irreversible, changing it requires re-creating the service.

ECS (rolling update). The default. ECS replaces tasks per minimumHealthyPercent and maximumPercent. Two safety features bolt on:

  • Deployment circuit breaker (deploymentCircuitBreaker: { enable: true, rollback: true }) watches task launches, consecutive failures to reach RUNNING, or health-check-driven replacements hitting a threshold, and marks the deployment FAILED with optional rollback.
  • Deployment alarms (deploymentConfiguration.alarms.alarmNames), any named alarm transitioning to ALARM during the deployment rolls back. A default bake under five minutes runs after tasks are healthy.

Both features are rolling-only. Neither rescues the shape of the incident: the target group holds old and new tasks simultaneously.

CODE_DEPLOY (blue/green). Hands orchestration to CodeDeploy. Two target groups on an ALB; a production listener and a test listener on a separate port. CodeDeploy creates a new task set running the new task definition revision, registers it with whichever target group isn’t currently serving production, runs optional Lambda hooks at five lifecycle events, then shifts the production listener between target groups per a deployment configuration. Alarms on the deployment group flip traffic back the moment anything goes wrong.

EXTERNAL. You orchestrate via CreateTaskSet, UpdateTaskSet, UpdateServicePrimaryTaskSet, DeleteTaskSet. Primitives only; you write the state machine. Right for the genuinely custom case; for 40 services that want the same blue/green shape, 40 bespoke state machines.

Task definitions are immutable revisions. Every RegisterTaskDefinition returns a new integer revision in the family; family:revision is the stable identifier. A CodeDeploy deployment is pinned to a specific revision, and rollback flips traffic back to the previous still-registered revision, nothing is rebuilt.

Side by side

Controller Atomic cutover Alarm rollback Pre-traffic test Managed orchestration
ECS rolling (circuit breaker + alarms) ✗ (mixed pool) ✗ (health ping only)
CODE_DEPLOY blue/green ✓ (two TGs) ✓ (test listener hook)
EXTERNAL ✓ (if you build it) you write it you write it

Matching the incident shape to a controller

Rolling default controller, mixed pool Blue/green two target groups, atomic cutover Why blue/green the incident shape, not tuning One target group new + old tasks both registered ALB round-robins across all contract mismatch = mixed responses Two target groups tg-blue holds rev 46 only tg-green holds rev 47 only prod listener shifts between them Within a TG, revision is uniform client served by canary 10%: rev 47 every request no in-session schema flip circuit breaker catches failures test listener on :8443 to tg-green alarms scope cleanly to tg-green alarms watch blended metrics consequence surfaces late cause still mixed-pool AfterAllowTestTraffic hook synthetic HTTP via test listener fails before prod weight moves Rolling: 8-min blended 5xx detection delayed blast: ~100% for minutes recovery: redeploy old still default for safe changes Blue/green: 10% for 2 min hook catches zero-impact bugs alarm catches canary bugs full-weight alarm catches the rest rollback instant while blue lives Structural fix atomicity from topology, not from alarm tuning 40 services times one template managed state machine
Rolling asks alarms to catch a structural problem. Blue/green removes the structural problem. No amount of alarm tuning on one target group gets atomicity.

Blue/green via CodeDeploy in depth

The ECS blue/green topology has five parts: two target groups on one ALB (tg-blue, tg-green); a production listener on 443 forwarding to whichever target group is live; a test listener on a separate port (e.g. 8443) forwarding to the target group being deployed; a service with deploymentController.type = CODE_DEPLOY; and a CodeDeploy deployment triggered on each release with an AppSpec that names the new task definition revision, the container/port, the deployment configuration, and hook lambdas.

The five predefined deployment configurations for ECS:

Name Behaviour
CodeDeployDefault.ECSCanary10Percent5Minutes Shift 10%, bake 5 min, then 100%
CodeDeployDefault.ECSCanary10Percent15Minutes Shift 10%, bake 15 min, then 100%
CodeDeployDefault.ECSLinear10PercentEvery1Minutes +10% every 1 min until 100%
CodeDeployDefault.ECSLinear10PercentEvery3Minutes +10% every 3 min until 100%
CodeDeployDefault.ECSAllAtOnce Shift 100% immediately

Within each target group the revision is uniform: a client on the canary 10% gets the new version for every request; a client on the 90% gets the old version for every request. That’s not the half-state the incident caused, the incident was two revisions behind one target group, which is what rolling does by default and CodeDeploy blue/green never does. NLB-backed services are restricted to ECSAllAtOnce; an ALB deployment has the full set available.

Five ECS lifecycle hooks, in execution order. Each names a Lambda in the AppSpec; the hook calls PutLifecycleEventHookExecutionStatus with Succeeded or Failed within one hour.

  1. BeforeInstall, before the replacement task set is created. Pre-flight checks. Cannot trigger rollback.
  2. AfterInstall, after the replacement task set is created, before any traffic. Pre-warm, token refreshes, config validation. Can trigger rollback from here onward.
  3. AfterAllowTestTraffic, after the test listener routes to the replacement task set, before production weight shifts. Synthetic HTTP with real payloads through the load balancer. A Failed here rolls back without a single real user seeing the new code.
  4. BeforeAllowTraffic, after the production target group is associated, before production weight moves. Last chance to abort.
  5. AfterAllowTraffic, after the production listener has fully shifted. End-to-end checks.

Alarm-based rollback. The deployment group carries an alarm configuration, a list of CloudWatch alarm names and enabled: true. Any named alarm transitioning to ALARM during the deployment triggers rollback. Alarms must be non-ALARM at start or CodeDeploy refuses to begin. A sensible alarm set scopes to the green target group: HTTPCode_Target_5XX_Count and TargetResponseTime p99 on tg-green, plus RunningTaskCount < DesiredCount on the new service.

Automatic rollback configuration. autoRollbackConfiguration names the trigger events: DEPLOYMENT_FAILURE, DEPLOYMENT_STOP_ON_ALARM, DEPLOYMENT_STOP_ON_REQUEST. The default-shape answer is [DEPLOYMENT_FAILURE, DEPLOYMENT_STOP_ON_ALARM].

Terminate blue after success. terminateBlueInstancesOnDeploymentSuccess.terminationWaitTimeInMinutes controls how long the old task set sits at 0% weight after cutover. The tutorial default is five minutes; values up to 2880 (two days) are accepted. Rollback during the wait is instantaneous because blue is still running.

A worked deployment trace

Checkout service. Blue is checkout-service:46. CI has published checkout-service:47 with the renamed field. Configuration: CodeDeployDefault.ECSCanary10Percent5Minutes. Alarms on ALB 5xx and p99 latency scoped to the green target group. AfterAllowTestTraffic hook runs five HTTP requests with known payloads.

T+0. CI calls CreateDeployment with the AppSpec pointing at 47. Alarm states: all OK. Precondition passes.

T+5s. BeforeInstall hook: target revision exists, feature flags set. Succeeded.

T+30s. CodeDeploy creates the green task set: four Fargate tasks at rev 47 in tg-green. Tasks reach RUNNING; target group health passes.

T+~3min. Tasks healthy. Test listener on 8443 now forwards to tg-green. Production listener still forwards 100% to tg-blue. Zero real users have hit the new version.

T+~3min. AfterInstall hook: Succeeded.

T+~3min 15s. AfterAllowTestTraffic hook hits https://checkout-service.internal:8443/v2/cart five times with synthetic payloads. Each response parsed; field checked (total_minor_units present, total_cents absent). All five pass.

Clean branch. BeforeAllowTraffic passes. Production listener shifts to 10% tg-green / 90% tg-blue. Real traffic hits the green task set, exactly 10%. At T+~8min the 5-minute bake completes with no alarm fire; listener shifts to 100% tg-green; AfterAllowTraffic passes; deployment Succeeded. After the 5-minute termination wait, blue is torn down.

Alarm branch. Revision 47 broke a code path that only shows under concurrent load. At T+~5min the tg-green 5xx alarm crosses threshold. Listener flips back to 100% tg-blue immediately. Deployment Failed - Auto Rolled Back. Blue was never torn down so rollback is instantaneous. Customer blast radius: 10% for roughly two minutes. Compare the original incident: 100% for eight.

Hook branch. A dev error published rev 47 with a missing environment variable; /v2/cart returns 500. At T+~3min 15s the hook sends five synthetic requests; all five 500. Hook Failed. CodeDeploy aborts and tears down the green task set. No production weight ever shifts. Customer blast radius: zero.

When rolling is still right

Rolling with circuit breaker and alarms is correct when API contracts don’t change across deploys (a backward-compatible schema discipline, add optional fields, never rename or remove, genuinely tolerates mixed-version pools); when per-deploy cost matters for large-desired-count services where briefly doubling the Fargate footprint is expensive; or when the topology doesn’t fit the two-target-group shape (heavy path-based routing, multiple listener rules, non-ALB listeners).

What’s worth remembering

  1. Three ECS deployment controllersECS (rolling), CODE_DEPLOY (blue/green), EXTERNAL (you orchestrate). Chosen at service-creation time.
  2. Rolling-update safety is the deployment circuit breaker plus deployment alarms. Both rolling-only; neither solves mixed-version target groups.
  3. Blue/green needs two target groups on one ALB, a production listener, and a test listener on a separate port.
  4. Five predefined ECS deployment configurations: two canary, two linear, and all-at-once. NLB-backed services are restricted to ECSAllAtOnce.
  5. Five ECS lifecycle hooks, in order: BeforeInstall, AfterInstall, AfterAllowTestTraffic, BeforeAllowTraffic, AfterAllowTraffic. Any except BeforeInstall can trigger rollback.
  6. AfterAllowTestTraffic is the smoke-test hook, fires after the test listener routes to the new task set, before any production weight shifts.
  7. CodeDeploy’s alarm configuration on the deployment group is the rollback mechanism. Alarms must be non-ALARM at start; transition to ALARM during the deploy rolls back.
  8. autoRollbackConfiguration names the trigger events: DEPLOYMENT_FAILURE, DEPLOYMENT_STOP_ON_ALARM, DEPLOYMENT_STOP_ON_REQUEST.
  9. Task definitions are immutable revisions. Rollback flips the deployment pointer; nothing is rebuilt.
  10. terminateBlueInstancesOnDeploymentSuccess.terminationWaitTimeInMinutes controls old-task-set retention after success (up to 2880 minutes). Rollback during the wait is instant.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.