Shadow, Canary, or Linear: Rolling Out a New Model

June 07, 2028 · 14 min read

ML Engineer · MLA-C01 · part of The Exam Room

The situation

A content-recommendation team is about to deploy v14 of their ranking model to replace v13 in production. The current endpoint runs 6 × ml.m6i.4xlarge instances serving 200 RPS steady state, 500 RPS peak. v14 has:

  • Passed offline evaluation (held-out click-through-rate improved by ~4% against the test set).
  • Passed shadow-traffic on staging last week (run against synthetic traffic for 24 hours, no errors, latency within 10%).
  • Been approved in Model Registry.

What remains is the production rollout. The team has three pending worries:

  • Live-traffic shadow evidence: does the new model behave sanely on real production traffic distribution, which synthetic doesn’t fully capture?
  • User-facing regression: if v14 is actually worse on some segment, can we detect it and back out before most users see degraded recommendations?
  • Fast recovery: if something does break, how long until we’re back on v13?

Three SageMaker deployment patterns address different subsets of these worries. The work is knowing which pattern to pick for which worry, and, specifically, whether to use them in sequence or to pick one.

What actually matters

SageMaker endpoints support deployment guardrails, a set of options on UpdateEndpoint that govern how the new endpoint config replaces the old one. The three options are:

1. Shadow variant (shadow testing). Both the old and new models run simultaneously. Production traffic is duplicated: each request goes to the primary variant (which returns the response to the user) and, in parallel, to the shadow variant (whose response is captured but not used). Metrics on the shadow variant, latency, error rate, response distribution, are collected. Users never see the shadow’s output.

Shadow testing is the answer to “does the new model behave sanely under real production load?” without risking any user-visible impact. It’s the highest-signal, lowest-risk test, paid for by the extra compute running the shadow variant for the duration of the test.

2. Blue-green deployment with canary. The new model is deployed to a new fleet of instances alongside the old. A small fraction of production traffic is routed to the new model (the canary, typically 10%). The rest continues to the old. The deployment monitors CloudWatch alarms during the canary period; if any fire, it automatically rolls back. If none fire, traffic shifts to 100% new, the old fleet is torn down.

This is the answer to “route real users to the new model gradually, and watch for regression”. It costs an extra fleet-worth of compute during the deploy window but provides fast automatic rollback and limits blast radius to the canary fraction.

3. Linear traffic shifting. Like blue-green, but the shift from old to new happens in equal-sized steps (e.g. 10% every 10 minutes over an hour), not as a two-stage canary-then-all. Each step must pass the monitoring gate before the next one starts. Slower rollout, finer-grained detection of gradual regressions, same rollback mechanism.

4. All-at-once. Not really a guardrail: just point the endpoint at the new model config and let it swap. Zero canary, zero protection. Fine for dev environments, reckless for production.

And before or alongside any of these, there’s shadow testing as an offline-like verification step, you can shadow the new model for a day, then canary-deploy it, then linearly shift. They’re complementary.

The picks are not either/or for a production-critical model. Shadow gives you confidence the model won’t crash under real load. Canary or linear then validates it against user-observable metrics. For most mature ML deployments, both are part of the process.

What we’ll filter on

  1. User exposure, how many real users see output from the new model during the test?
  2. Rollback speed, if something breaks, how fast does it revert?
  3. Signal quality, how confident are we that the test reveals regressions?
  4. Extra cost during rollout, how much additional compute do we pay for?
  5. Automatic vs manual progression, does AWS advance the rollout, or do we?

The deployment-pattern landscape

1. Shadow. The new model runs alongside the old, receiving a copy of every request. Output is captured but never returned to users. Zero user exposure, full traffic-distribution signal. Pay for a parallel fleet for the shadow window (typically 1-7 days). Rollback is instant because users were never on the new model.

2. Blue-green canary. New model gets 10% of traffic (the canary) for a timed window, then 100% if alarms stay clear. Limited user exposure. Fast automatic rollback if alarms fire. Pay for both fleets during the canary window (typically 10-60 minutes).

3. Blue-green all-at-once with alarms. New model gets 100% of traffic immediately; alarms trigger rollback. Fastest deploy, full user exposure at once, still has rollback safety net. Pay for both fleets only during the brief transition.

4. Linear traffic shifting. Shift 10% at a time over N steps, each gated on alarms. Extends the exposure window but allows finer detection of regressions that ramp up with traffic. Pay for both fleets during the linear period.

5. Endpoint variants with manual weights. Deploy the new model as a second variant on the same endpoint and manually adjust DesiredWeight for each variant. Control is total but operator-driven; no automatic rollback. Useful for A/B experiments that need to persist for weeks, not for short deployment windows.

6. Two separate endpoints behind a router. Run old and new endpoints side by side; let the application route between them based on whatever logic we like (cohort, feature flag, header). Maximum flexibility, operated entirely by the application. Common for ML experiments that want per-user cohorting rather than percent-of-traffic.

Side by side

Pattern User exposure Rollback speed Signal quality Extra cost window Automatic progression
Shadow 0% n/a (not live) Full traffic distribution, no user impact Days (2x fleet) n/a
Blue-green canary 10% Seconds (automatic) User metrics on 10% Minutes to ~1 hour
Blue-green all-at-once 100% Seconds (automatic) User metrics on 100% Minutes
Linear traffic shifting 0% → 100% staged Seconds (automatic) User metrics at each stage Up to hours
Manual weights Configurable Manual (as fast as you) Configurable Indefinite
Two endpoints + router Router’s choice App-level App-level Indefinite

Reading by question:

  • “Does v14 break under production traffic distribution?” → Shadow. Run v14 as a shadow variant for 24-72 hours; verify latency, errors, and output distribution against v13.
  • “Will users notice if v14 regresses?” → Canary. 10% canary for 30 minutes, automatic rollback on alarm, promote if clear.
  • “Does v14 regress only under certain traffic levels?” → Linear. Step 10%, 25%, 50%, 100% with alarm gates at each.
  • “A/B test for 2 weeks to measure business impact” → Manual weights or separate endpoints. The guardrail patterns are deploy-time; A/B experiments are longer-lived.

For v14, the plan is: shadow for 48 hours first, then canary to 10%, then all-at-once to 100%. Shadow validates production-distribution behaviour; canary validates user metrics; final shift is fast.

Sequencing the patterns

Stage 1: Shadow 48 hours Traffic routing v13 (live) → 100% of users v14 (shadow) → copy of traffic What we watch v14 latency p50, p99 v14 error rate v14 response distribution compare top-K overlap vs v13 Cost 2x fleet for 48 h ~$560 extra Exit criteria latency within 15% of v13 error rate < 0.1% top-10 overlap > 0.75 Stage 2: Canary 30 minutes Traffic routing v13 → 90% v14 → 10% What we watch v14 error rate vs baseline v14 p99 latency user CTR on v14 slice (if available) 4xx + 5xx composite alarm Cost 2x fleet for 30 min ~$6 extra Exit criteria no alarm fires for 30 minutes (if any fire → auto-rollback) rollback ≈ 30 seconds Stage 3: Full ~5 minutes shift Traffic routing v14 → 100% v13 fleet torn down What we watch same alarms, persistent business metrics over days Model Monitor drift checks Rollback UpdateEndpoint → previous endpoint config (manual, seconds) Total deploy ~50 hours cost ~$566 extra
Shadow then canary then all-at-once. Each stage answers a different question: does it work, does it help, and then ship.

The picks in depth

Stage 1: Shadow for 48 hours. Add v14 as a shadow variant via UpdateEndpointWeightsAndCapacities with a ShadowProductionVariants entry. AWS duplicates incoming traffic to both variants; only the primary’s response is returned. A CloudWatch dashboard shows v13 and v14 side by side for the 48 hours: p50 latency, p99, error rate, and, via a custom metric, the top-10 overlap between v14’s rankings and v13’s for the same request (a domain metric indicating they’re “behaving similarly”). If everything looks good after 48 hours, move to canary.

Stage 2: Canary at 10% for 30 minutes. Call UpdateEndpoint with a new endpoint config referencing v14 and DeploymentConfig.BlueGreenUpdatePolicy.TrafficRoutingConfiguration.Type=CANARY, CanarySize.Type=CAPACITY_PERCENT, Value=10. Also set the AutoRollbackConfiguration.Alarms to our composite alarms for 5xx rate, p99 latency, and a custom CTR-proxy metric. SageMaker brings up a new fleet sized to 10%, routes 10% of traffic there, and monitors the alarms for the configured wait period (30 minutes). If no alarm fires, it promotes; if any fires, it instantly routes all traffic back to v13 and the new fleet tears down.

Stage 3: All-at-once (from canary to 100%). The canary period auto-ends after its wait time; if no alarms fired, SageMaker shifts all traffic to the new variant and tears down the old one. Total extra-cost window: 48 hours of shadow + ~30 minutes of canary + ~5 minutes of fleet transition.

The alarms do the work. The whole pattern depends on good alarms. Our composite alarm for v14 includes:

  • 5xx count > 10 in any minute (container failure signal)
  • ModelLatency p99 > 400 ms for 2 consecutive minutes (latency regression)
  • CustomMetric:CTRProxy drop > 20% compared to baseline (quality regression proxy)

If the alarms are wrong, too sensitive and they rollback a good deploy; too loose and they miss a bad one, the guardrails don’t help. Tuning the alarms is the actual engineering work; the SageMaker features are the mechanism.

A worked rollout

Tuesday 14:00 UTC. Team kicks off v14 shadow:

sagemaker_client.update_endpoint_weights_and_capacities(
    EndpointName='recsys-prod',
    DesiredWeightsAndCapacities=[{
        'VariantName': 'v14-shadow',
        'DesiredWeight': 0,
        'DesiredInstanceCount': 6,
    }]
)

Over the next 48 hours, the CloudWatch dashboard shows v14’s latency ~10% above v13 (acceptable), error rate identical (0.03%), top-10 overlap 0.82 (above 0.75 threshold). Good to proceed.

Thursday 14:05. Team initiates canary:

sagemaker_client.update_endpoint(
    EndpointName='recsys-prod',
    EndpointConfigName='recsys-v14-config',
    DeploymentConfig={
        'BlueGreenUpdatePolicy': {
            'TrafficRoutingConfiguration': {
                'Type': 'CANARY',
                'CanarySize': {'Type': 'CAPACITY_PERCENT', 'Value': 10},
                'WaitIntervalInSeconds': 1800,
            },
            'TerminationWaitInSeconds': 600,
        },
        'AutoRollbackConfiguration': {'Alarms': [
            {'AlarmName': 'recsys-prod-5xx-rate'},
            {'AlarmName': 'recsys-prod-latency-p99'},
            {'AlarmName': 'recsys-prod-ctr-proxy-drop'},
        ]}
    }
)

SageMaker provisions a new fleet alongside v13’s. At 14:10, 10% traffic starts flowing to v14. The team watches the dashboard and Slack; no alarms fire. At 14:40, the wait period ends with no rollback triggered. Traffic shifts to 100% v14 over the next ~5 minutes as the new fleet scales and the old one terminates. By 14:48, v13 is gone, v14 is serving 200 RPS. The team updates the model registry status and closes the deploy ticket.

What would have happened if an alarm fired at, say, 14:22: the AutoRollbackConfiguration would have instantly shifted all traffic back to v13’s variant. The v14 fleet would hold its state for the TerminationWaitInSeconds (10 minutes) while the team investigated; after that, if nobody intervened, it would tear down. Total user-observable impact: the ~12 minutes of 10% canary traffic before rollback.

What’s worth remembering

  1. Shadow, canary, and linear each answer different questions. Shadow: does it work on production traffic? Canary: does it regress for users? Linear: does it regress at higher load? Use them in sequence for high-stakes deploys, or pick the one that answers the question you have.
  2. Shadow variants have zero user exposure. Both models run; only the primary’s response is returned to the caller. Pay for parallel fleet for the test window. Ideal for validating latency and error behaviour before exposing anyone.
  3. Blue-green canary is the production deploy default. 10% canary for 10-60 minutes, CloudWatch alarms gate the promotion, automatic rollback on any alarm. Minimal user exposure, fast recovery.
  4. Linear traffic shifting extends the canary over multiple steps. Each step is its own gate. Catches regressions that emerge gradually or under higher load. Longer deploy window, finer-grained protection.
  5. All-at-once is only safe with excellent alarms. Without guardrails, it’s a recklessness. With them, it’s a reasonable deploy for a trusted pipeline after shadow and canary have already cleared.
  6. The alarms are the actual engineering. SageMaker’s mechanism is only as good as the composite alarms that gate it. Thresholds, evaluation periods, and metric choice all matter more than which deployment pattern we picked.
  7. Model Registry approval is orthogonal. Approval in Model Registry says “this model is ready for production”; the deployment pattern is how it gets there. Both need to happen.
  8. Separate endpoints or manual weights are for experiments, not deploys. A/B tests that run for weeks use manual weights or router-based cohorting. The built-in guardrails are for short deploy windows, not long experiments.

Deploying a new model version is a risk-transfer operation: we’re trading effort during the deploy (shadow setup, alarm tuning, canary waiting) for smaller blast radius and faster recovery if something breaks. The three patterns are three different ways to spend that effort. Shadow buys certainty about production-distribution behaviour. Canary buys limited user exposure with automatic rollback. Linear spreads the canary across multiple traffic levels. For anything user-facing, the question isn’t which pattern to pick, it’s how many to use in sequence. The answer for this recsys team is all three, and the total extra cost is under a thousand dollars for the peace of mind of a good rollout.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.