How to Choose a CodeDeploy Config for Lambda With Auto-Rollback

June 28, 2028 · 13 min read

DevOps Engineer Professional · DOP-C02 · part of The Exam Room

The situation

A serverless team runs a customer-facing API on Lambda behind API Gateway. They deploy several times a week. They’ve had two incidents in the past quarter where a bad deploy increased the 5xx rate; both took about ten minutes to detect and another five to roll back manually.

The team wants four things from the next iteration of their deployment pipeline: gradual traffic shift to new Lambda versions, so a bad version never sees more than a small fraction of traffic; automatic rollback on alarm; pre-traffic validation that smoke-tests the new version before real customer traffic touches it; and operational simplicity, they don’t want to write the rollout state machine themselves.

What actually matters

Before reaching for a deployment configuration name, it’s worth thinking about what kind of deploy actually fails safely.

The core move is shrinking the blast radius of a bad version. The old, easy thing is to swap everything at once: cheap, fast, and catastrophic when wrong. The more thoughtful thing is to route a slice of live traffic to the new code while the old code keeps serving the majority, bake for long enough to decide, and only commit fully after the evidence comes in. Shift mechanics differ, jump then bake, or keep stepping up, but the shape is the same: small exposure, time to notice, reversible decision.

Noticing is worth thinking about separately. Humans reading dashboards at deploy time is the process that failed twice already; the rollback took five minutes after detection because a person had to realise, decide, and act. The shape we want instead is an alarm that both detects and acts. If an error-rate metric crossing a threshold can drive the traffic shift back to the previous version without a human in the loop, “five minutes to roll back” becomes seconds of alarm-propagation delay.

A subtler question is whether any real customer should see the new code before we’ve checked it. A canary exposes 10% to discover problems, by definition, some of those customers hit the bug. A pre-traffic smoke test flips the order: the new version is reachable on a path only the test harness knows, real invocations with controlled inputs run first, and only if those pass does the alias weight move. If a deploy goes wrong because of something synthetic testing would catch, zero customers are involved.

Cost shape matters less here than elsewhere, we’re not running two fleets for a long time, just briefly keeping two versions of a Lambda live, but operational ownership matters a lot. Gradual shift can be assembled by hand: Step Functions driving alias weight updates, custom alarm listeners, bespoke rollback logic. That works and gives maximum control; it also becomes a second control plane with its own on-call. The alternative is to pick up the managed orchestration AWS already ships and spend the engineering budget on the observability and the smoke tests instead of the state machine.

Finally, there’s a shape of failure the canary doesn’t catch: concurrency-shaped bugs. A function handling 10% of traffic may never hit the rate limit, the contention, or the downstream throttle that surfaces at 50%. Canary jumps from small to full in one step; linear steps through the middle. If the worry is “works at low load, fails at high,” the stepping shape pays for itself.

What we’ll filter on

  1. Traffic-shift shape, gradual across two versions, not all at once.
  2. Alarm-driven rollback, automatic reversion without human action.
  3. Pre-traffic smoke test, controlled invocations against the new version before any real traffic sees it.
  4. Managed orchestration. AWS owns the state machine, not the team.
  5. Concurrency-surfacing shift, steps through mid-traffic percentages where concurrency bugs appear.

The Lambda deployment landscape

AWS ships a few plausible ways to ship a new Lambda version to production.

Lambda alias manual weighting. Lambda aliases support a weighted split between two versions natively (e.g. 90/10 between v3 and v4). A team can write a Step Functions workflow that nudges the alias weight upward on a timer, listens to CloudWatch for alarm transitions, and rolls back on its own. Maximum flexibility, full ownership of the state machine, and a pile of code to maintain.

CodeDeploy on the Lambda compute platform. CodeDeploy drives exactly that alias-weight change per a deployment configuration, with built-in alarm gates, pre- and post-traffic hooks, and automatic rollback. Nine predefined configurations ship out of the box, four canary variants, four linear variants, and one all-at-once, and custom configurations exist for non-default schedules. The orchestration is AWS’s; the team writes an AppSpec file and a handful of hook lambdas.

CodeDeploy predefined, canary. Shift a small chunk of traffic (always 10% in the predefined set), bake for a configured duration, then jump to 100%. The bake window is where the alarm gate has time to fire on real traffic at low blast radius. Commits in one final cut.

CodeDeploy predefined, linear. Step traffic up in even 10% increments over a configured interval without a long bake at any single percentage. Surfaces concurrency issues at the percentage they appear. Errors at 30% roll back from 30% (smaller peak than canary’s post-bake jump).

CodeDeploy predefined, all-at-once. No traffic shift; immediate cutover. Right for hotfix rollback (speed beats safety), disposable internal traffic, or changes that can’t run in parallel with the previous version.

Side by side

Option Gradual shift Alarm rollback Hooks Managed orchestration Concurrency stepping
Hand-rolled alias weighting you write it you write it
CodeDeploy canary (Canary10Percent*) ✗ (jump after bake)
CodeDeploy linear (Linear10PercentEvery*)
CodeDeploy AllAtOnce

Matching the situation to a configuration

Canary small slice, long bake, then jump Linear even steps through the middle All-at-once immediate cutover, no shift Standard production API API Gateway to Lambda alias concurrency well-understood a 10-min bake fits the release rhythm Concurrency-sensitive path downstream with rate limits fails only above ~50% load canary's jump would miss it Hotfix rollback previous-version rollforward speed matters more than safety every minute of canary is damage need bake on real traffic? yes need mid-traffic steps? yes need gradual? no, speed wins pre-traffic hook smoke-tests alarms on 5xx + p99 latency auto-rollback on ALARM same hook + alarms concurrency alarm scoped tight alarm at N% rolls back from N% Canary10Percent10Minutes 10% for 10 minutes, then 100% BeforeAllowTraffic synthetic probes alarms on Errors, 5XXError AfterAllowTraffic end-to-end check Linear10PercentEvery2Minutes +10% every 2 min, 20 min total same hooks, same alarms concurrency bug surfaces mid-shift rollback from where it fired AllAtOnce instant cutover hooks still run alarms still trigger rollback just no gradual window
Three deployment situations, three shapes of shift. The alarm gate and the hook lambdas are identical across all three, what differs is how fast traffic gets to the new version.

Canary10Percent10Minutes, in depth

For a customer-facing API on Lambda behind API Gateway, where concurrency is well-understood and the release cadence tolerates a ten-minute bake, the convention is Canary10Percent10Minutes. Ten percent is a low enough slice that a genuinely broken version harms a manageable fraction of traffic; ten minutes is long enough for transient errors and most latency regressions to surface.

Lambda versions and aliases. Versions are immutable snapshots of function code and configuration. Aliases are pointers to a version, and critically, an alias can point at two versions at once with a weighted split (e.g. 90% to v3, 10% to v4). API Gateway routes to the alias, never the version directly, so adjusting the alias weight shifts production traffic without redeploying. CodeDeploy on the Lambda compute platform doesn’t deploy code; it deploys a new alias weight.

The AppSpec file names: the alias, the current version, the target version, optional pre- and post-traffic hook lambdas, and a deployment configuration.

CloudWatch alarm gate. Attach a list of CloudWatch alarms to the deployment group. If any enters ALARM during the deployment, CodeDeploy instantly shifts the alias weight back to the original version, effectively all-at-once in reverse. The deployment is marked Failed - Auto Rolled Back. Alarms that matter for a serverless API are usually a small set: function error rate (AWS/Lambda Errors), function throttles, API Gateway 5xx rate (AWS/ApiGateway 5XXError), and latency p99 (AWS/Lambda Duration). Alarms must be non-ALARM when the deployment starts. CodeDeploy refuses to begin otherwise.

Hook lambdas. BeforeAllowTraffic runs after the new version is created but before any traffic shifts to it. The hook invokes the new version directly (not via the alias) with synthetic payloads, verifies the response, and calls PutLifecycleEventHookExecutionStatus with Succeeded or Failed. A Failed result aborts the deploy before any customer sees the new code. AfterAllowTraffic runs after the alias is fully at 100%, end-to-end checks, downstream verification.

A worked deploy

The pipeline runs CodeDeploy with Canary10Percent10Minutes, alarm gates on Lambda Errors and API Gateway 5XXError, and a BeforeAllowTraffic hook that runs five synthetic requests through the new version.

T+0. New code commits. Pipeline builds and publishes v47. CodeDeploy starts a deployment for the alias prod.

T+10s. BeforeAllowTraffic hook invokes v47 directly with five synthetic events. All five return 200 with valid response bodies. Hook reports Succeeded.

T+15s. Alias weight shifts to 90% v46 / 10% v47. Real traffic begins reaching v47.

T+15s through T+10m15s. Bake. CodeDeploy monitors the configured alarms.

Clean branch. Both alarms stay OK. At T+10m15s the alias shifts to 100% v47, AfterAllowTraffic passes, deploy marked Succeeded.

Bad branch. At T+3m the 5XXError alarm crosses threshold, v47 is throwing 500s on a particular code path. CodeDeploy detects the transition. Alias snaps back to 100% v46 instantly. The deploy is marked Failed - Auto Rolled Back. Customer traffic on v47 was 10% for three minutes; no further bad traffic. Total elapsed: three minutes from shift to back-on-stable.

What’s worth remembering

  1. Lambda aliases support weighted routing between two versions; CodeDeploy drives that weight on a schedule. The schedule is the deployment configuration.
  2. Three families, nine predefined configurations. Canary (jump after bake), linear (continuous stepping), all-at-once (immediate cutover). The numbers in the names are the only constants: always 10% steps, with 1/2/3/5/10/15/30-minute variants.
  3. CloudWatch alarm gates turn any configuration into a safe deploy. Transition to ALARM during the deployment triggers instant rollback without human action.
  4. BeforeAllowTraffic runs synthetic invocations before any real traffic shifts. Failed hook aborts the deploy with zero customer exposure.
  5. Linear is the correct shape for concurrency-sensitive functions where failures appear only above some threshold percentage; canary’s post-bake jump lands at 100% too fast to catch mid-load bugs.
  6. All-at-once is correct for hotfix rollback, disposable internal traffic, and changes incompatible with parallel old/new versions.
  7. Custom deployment configurations exist for non-default percentages or schedules – CreateDeploymentConfig with a TrafficRoutingConfig of type TimeBasedCanary or TimeBasedLinear.
  8. Alarms must be non-ALARM when the deployment starts. CodeDeploy uses the alarm set as a precondition check, not just a rollback trigger.
  9. CodeDeploy is the orchestration, not a layer on top of one. Alias, weights, rollback, and hooks run without any custom state machine.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.