The situation
A payments-fraud ML team deploys a new version of their fraud classifier every six weeks. The endpoint serves ~800 inferences per second, p99 latency is under 40ms, and the business metric that matters (fraud recall without tanking precision) is measured weekly against settled disputes.
Recent deployment history:
- V3 → V4 (six months ago): blue/green via
UpdateEndpoint. Instant cutover. V4 regressed on a specific merchant category; it took three days to notice, and the rollback cost the team a credibility point with the payments team. - V4 → V5 (three months ago): same pattern. Went fine, but the team spent the deploy day watching dashboards manually.
- V5 → V6 (now): the team wants an automated rollout that shifts traffic gradually, monitors a metric during the shift, and rolls back on its own if something goes wrong.
SageMaker endpoints support this through production variants and deployment configurations. The interesting questions are: which shape of traffic shift fits which risk profile, what metrics the automation watches, and what “shadow testing” buys on top of it.
What actually matters
A deployment rollout is, fundamentally, a choice about which users see the new version and when, and what you do if the new version is bad.
At one extreme: blue/green, cut all users over at once. Minimum duration of exposure if the new version is bad, assuming you notice quickly; maximum blast radius when it is bad.
At the other extreme: full A/B, indefinitely, keep both versions running and split traffic 50/50 forever, make the new version’s superiority part of the experiment design. Maximum visibility; production is now two systems indefinitely.
Between these, a family of gradual-rollout patterns:
Canary: send a small percentage of traffic to the new version while the rest continues on the old. Watch metrics. If the canary looks fine after a dwell time, promote; if not, roll back.
Linear traffic shifting: move traffic in steps (10%, 25%, 50%, 75%, 100%) with a dwell at each step. Less risky than blue/green because you can catch problems at 10% before they hit 100%.
Shadow testing: send production traffic to both versions, but only use the old version’s responses for real users. The new version sees every request, its outputs are logged, nothing affects users. The purest form of “test in production with zero risk.” Often used before any of the above shifting patterns, as a prelude.
All of these shapes are built on the same two primitives: an endpoint configuration that can declare multiple model variants with weights, and a deployment configuration that the platform can drive on our behalf, shifting the weights on a schedule, watching alarms, rolling back automatically when they fire.
Shadow testing is the specific feature that lets a variant receive a percentage of inbound requests asynchronously, log its predictions, and never contribute to the response served to the client. Metrics captured; blast radius zero.
What we’ll filter on
Five filters that shape the deployment choice:
- Risk tolerance: what’s the cost of users seeing a bad version for ten seconds / five minutes / two hours?
- Observability: what metric tells us the new version is bad, and how fast?
- Rollback trigger: automatic (CloudWatch alarm) or manual?
- Visibility into differences: do we need to compare old and new predictions on the same input?
- Dwell time per stage: how long do we wait between traffic shifts?
The deployment landscape
1. Blue/green (status quo). UpdateEndpoint with a new EndpointConfig that replaces the model. Default behaviour: old is replaced, new is serving. All traffic cuts over at once. Fast; risky.
2. Linear traffic shifting. UpdateEndpoint with DeploymentConfig.AutoRollbackConfiguration and TrafficRoutingConfiguration.Type: LINEAR. Shifts traffic in equal steps over a configurable duration. E.g., 10% every 5 minutes over 50 minutes. Stops and rolls back if any specified CloudWatch alarm breaches during the shift.
3. Canary traffic shifting. Same DeploymentConfig mechanism, but Type: CANARY. Sends a configured percentage (e.g., 10%) to the new version for a configurable dwell period, then either promotes the remaining 90% or rolls back. One-step-then-jump, as opposed to linear.
4. All-at-once with rollback alarms. Full cutover but with alarms that trigger rollback. Fast deploy; protective of bad versions that fail catastrophically; less protective of subtle regressions.
5. Multi-variant with weighted split (stable). An endpoint hosts multiple variants indefinitely, with a weight on each. Useful for true A/B testing where you want to gather data on two models in parallel over weeks. Not a deployment pattern per se; a production configuration.
6. Shadow variant. An endpoint can host a shadow variant: receives a configured percentage of inbound traffic asynchronously, logs predictions, doesn’t affect the client response. Used pre-deployment to confirm the new version behaves sensibly on real traffic without any user-facing risk. Promotes to a real variant when satisfied.
7. InferenceRunning a trained model to produce output – as opposed to training it. Recommender + load test. Orthogonal to shifting: benchmark the new model on simulated traffic before any shift. Fits in ahead of the shift to de-risk capacity sizing.
Side by side
| Pattern | Risk profile | Observability | Rollback | Duration |
|---|---|---|---|---|
| Blue/green | High (all traffic at once) | Monitor post-cutover | Manual | Seconds |
| Linear shift | Low | Monitor at each step | Auto on alarm | 10 min - hours |
| Canary shift | Low-medium | Monitor canary cohort | Auto on alarm | Dwell + promote |
| All-at-once + alarm | Medium | Auto + fast-fail only | Auto on alarm (fast) | Seconds |
| Multi-variant stable | Controlled (split) | Compare both in prod | Manual (update weights) | Indefinite |
| Shadow variant | Zero (no user exposure) | Compare predictions offline | n/a (not serving) | Days-weeks |
| Recommender + load test | Pre-deploy only | Synthetic metrics | n/a | Hours |
Reading the table against the fraud team’s scenario:
- They want automated rollout with automatic rollback: canary or linear with
AutoRollbackConfiguration. - They want visibility into per-merchant-category regression: shadow testing is useful before the shift to catch category-specific issues with zero user exposure.
- They want a short total duration (30 min - 2 hours is reasonable), not a blue/green instant flip and not a multi-week A/B.
Combine: shadow test V6 for a week before deploy day; on deploy day, use a linear shift over 30 minutes with CloudWatch alarms on precision, recall, latency, and error rate.
The shift, drawn
The pick in depth
Shadow variant for the pre-deploy week. Add V6 as a shadow variant on the existing endpoint:
import boto3
sm = boto3.client("sagemaker")
config = sm.create_endpoint_config(
EndpointConfigName="fraud-ep-config-v5-plus-shadow",
ProductionVariants=[{
"VariantName": "v5",
"ModelName": "fraud-v5",
"InstanceType": "ml.c6i.2xlarge",
"InitialInstanceCount": 4,
"InitialVariantWeight": 1.0,
}],
ShadowProductionVariants=[{
"VariantName": "v6-shadow",
"ModelName": "fraud-v6",
"InstanceType": "ml.c6i.2xlarge",
"InitialInstanceCount": 4,
"InitialVariantWeight": 1.0, # 100% of traffic shadowed
}],
DataCaptureConfig={
"EnableCapture": True,
"InitialSamplingPercentage": 100,
"DestinationS3Uri": "s3://bucket/fraud-capture/",
"CaptureOptions": [{"CaptureMode": "Input"}, {"CaptureMode": "Output"}],
},
)
sm.update_endpoint(EndpointName="fraud-ep",
EndpointConfigName="fraud-ep-config-v5-plus-shadow")
V5 serves real responses. V6 receives each request asynchronously and writes its predictions alongside the request ID. At end-of-week, the team runs an offline comparison:
-- Hypothetical Athena query on the capture + shadow output
SELECT
merchant_category,
AVG(v5.predicted_fraud) - AVG(v6_shadow.predicted_fraud) AS score_diff,
COUNT(*)
FROM production_capture v5
JOIN shadow_capture v6_shadow USING (inference_id)
GROUP BY merchant_category
ORDER BY ABS(score_diff) DESC;
Any category with a large score divergence gets further investigation. Crucially, no user has yet seen V6’s output; the worst case of a terrible V6 is “we learn about it in the shadow log” rather than “we regressed in production.”
Linear shift for deploy day. Once shadow testing passes, reconfigure V6 as a real production variant with a linear traffic-shifting deployment:
sm.update_endpoint(
EndpointName="fraud-ep",
EndpointConfigName="fraud-ep-config-v6",
DeploymentConfig={
"BlueGreenUpdatePolicy": {
"TrafficRoutingConfiguration": {
"Type": "LINEAR",
"LinearStepSize": {"Type": "CAPACITY_PERCENT", "Value": 10},
"WaitIntervalInSeconds": 300, # 5 min dwell per step
},
"TerminationWaitInSeconds": 600,
"MaximumExecutionTimeoutInSeconds": 3600,
},
"AutoRollbackConfiguration": {
"Alarms": [
{"AlarmName": "fraud-ep-latency-p99"},
{"AlarmName": "fraud-ep-5xx-error-rate"},
{"AlarmName": "fraud-ep-precision-below-threshold"},
{"AlarmName": "fraud-ep-recall-below-threshold"},
],
},
},
)
SageMaker runs the shift: 10% traffic to V6 at minute 0, check alarms for 5 min, 20% at minute 5, and so on up to 100%. If any alarm breaches, the shift halts and SageMaker reverses to 100% V5.
Custom alarms on precision and recall need to be driven by the capture + label pipeline (separate from SageMaker). Model Monitor can power these metrics if configured, or the team’s own pipeline that joins predictions with labels and publishes to CloudWatch.
When canary is the correct choice instead. For very low-risk deploys with strong shadow confidence, canary (10% for 30 minutes, then jump to 100%) is faster than linear and nearly as safe. For high-risk deploys, linear’s gradual ramp gives more observation points per shift.
When stable multi-variant is the correct choice. When the team wants an explicit A/B to compare V5 and V6 over weeks, not a transient rollout. Configure both with 50/50 weights, gather data, promote based on statistical significance.
A worked rollout
Deploy day for V6:
- Day -7: shadow variant attached. Capture begins.
- Day -1: shadow analysis clean. No significant per-category divergence. V6 precision on the shadow capture is 0.82 (vs V5’s 0.81); recall is 0.74 (vs V5’s 0.73).
- Day 0, 09:00 UTC: linear shift initiated. Alarms armed.
- 09:05: 10% V6. Alarms quiet. Latency p99 on V6 variant: 38ms.
- 09:10: 20% V6. Alarms quiet.
- 09:15: 30% V6. Alarms quiet. Recall on the combined endpoint starts climbing.
- 09:20: 40% V6. Alarms quiet.
- … through 09:30 at 100% V6. Alarms quiet throughout.
- 09:35: SageMaker tears down V5 instances (
TerminationWaitInSecondselapsed). V6 is the sole variant. - 09:40: on-call pages go silent; deploy is complete.
Compared to previous “flip and pray” deploys: 30 minutes of gradual shift with automated safety, vs seconds of cutover followed by three days of manual dashboard-watching. Same engineering effort to set up; much smaller blast radius if the next deploy ever does regress.
What’s worth remembering
- SageMaker endpoints support multiple deployment shapes. Blue/green, canary, linear shift, multi-variant stable, shadow variant.
- Linear and canary live inside
DeploymentConfig.BlueGreenUpdatePolicy.TrafficRoutingConfigurationpicks the shape;WaitIntervalInSecondssets dwell;LinearStepSize.Valuesets step size. AutoRollbackConfigurationis the safety net. Point at CloudWatch alarms; any breach during the shift reverts to the previous variant.- Shadow variants let V6 see production traffic with zero user exposure. Configure as
ShadowProductionVariants; outputs logged to S3 via data capture; compare offline. - Custom alarms on ML metrics require a labelling pipeline. Precision and recall need labels; the CloudWatch alarm is fed by a job that joins predictions with labels and publishes the metric. Model Monitor can help.
- Multi-variant is for stable A/B, not rollout. Two variants with weights indefinitely; update weights to change the split. Use for experimental comparison over weeks, not for single-deploy safety.
TerminationWaitInSecondsdelays teardown of the old variant. Gives time to manually roll back viaUpdateEndpointwith the old config if an alarm didn’t catch a problem.- Inference Recommender and load testing are pre-deploy safety layers. Benchmark the new model under synthetic load before any traffic shift; de-risks capacity and latency.
“Shifting traffic without holding our breath” is an infrastructure story. The team can’t write better ML by watching dashboards harder; they can write better deployments by letting the endpoint do the shifting and letting alarms do the rolling back. The model getting better is a separate project.