How to Deploy a SageMaker Endpoint Across Multiple Regions

May 03, 2028 · 17 min read

The situation

The fraud-scoring model is a 280 MB XGBoost binary deployed to a SageMaker real-time endpoint in eu-west-1. The endpoint fronts a single variant running on three ml.m6i.xlarge instances behind SageMaker’s managed load balancer. Traffic is 1,200 inferences per second at peak, dominated by European customers. Each call sends ~400 features and gets back a score plus a top-reasons list. Today’s p50 latency is 40 ms end-to-end; the model’s own prediction time is 8 ms.

The Sydney customer is seeing 320-340 ms p50. The round-trip from Sydney to Dublin is about 290 ms; nothing about the model can fix physics. A Tokyo customer lands next month; a São Paulo customer the month after.

Four constraints shape the work:

Latency has to come down to roughly match the European experience, so regional p50 under 60 ms end-to-end.
A single model artifact is the source of truth. The data science team retrains nightly; they do not want four parallel training pipelines.
Consistency across regions matters. If São Paulo is running yesterday’s weights while Dublin has today’s, fraud analysts chasing anomalies will be chasing ghosts.
Cost will scale, but predictably. Four full-size endpoints is the honest cost of four regions; what we don’t want is four over-sized endpoints because we didn’t think about traffic shape per region.

What actually matters

“Multi-region” is not one decision. Unwinding it into pieces that can each be answered:

The first decision is where the artifacts live. The trained model, the inferenceInferenceRunning a trained model to produce output – as opposed to training it. container, the preprocessing feature pipeline (binary and YAML files) sit today in a single region’s object store. A multi-region deployment either needs one authoritative region and a replication story, or truly independent buckets with a publish pipeline. Object stores and container registries both have native cross-region replication features that copy on push, usually within seconds, preserving metadata. Both of these make it easy to say “there’s one place we push from, and four places it lands”.

The second decision is where inference runs. A real-time endpoint is a regional resource: it lives in one region’s control plane, scales inside that region, and has a regional DNS name. Moving a workload to four regions means creating and managing four endpoints, one per region, each with its own autoscaling configuration, alarms, and model data reference. Which in turn raises the question of how we create them consistently: multi-account infra-as-code, multi-region modules, or the CI/CD pipeline calling the region-specific API four times. The operational answer matters as much as the architecture.

The third decision is how traffic reaches the nearest endpoint. There are three shapes worth naming. Latency-based DNS with a health check per regional endpoint: clients resolve to the closest healthy endpoint, the pool rebalances if a region fails. A CDN with origin failover groups: the client always talks to the CDN edge, which picks a regional origin per failover config, useful when we want a single URL and caching semantics. Anycast over the cloud backbone: shared IPs that route to the healthiest regional endpoint over the provider’s network, giving some latency benefit on the client-to-network hop, not just the network-to-region hop.

The fourth decision is how the model stays consistent across regions. Nightly retraining produces a new artifact. Four regions each need to pick it up; the question is whether they pick it up simultaneously (synchronous rollout, risky if the new model has an issue) or in waves (canary one region first, then the rest). Per-endpoint deployment guardrails (blue/green, canary traffic shifting, rollback alarms) are per-region, which means each region has its own. A cross-region orchestrator promotes the artifact region by region, with verification between each step.

What we’ll filter on

Regional latency: does the architecture bring the model geographically close to the user?
Failover behaviour: what happens if a region fails mid-request, mid-training, or mid-deploy?
Artifact consistency: how do we know all regions are running the same model version?
Operational overhead: how many moving parts does the team run vs the single-region baseline?
Cost shape: do we pay for capacity in regions that aren’t used?

The multi-region landscape for ML endpoints

1. Single-region endpoint (status quo). One region, one endpoint, one artifact location. Lowest operational overhead, best artifact consistency (there’s only one), worst regional latency for anyone more than a few hundred milliseconds from the region. The baseline everyone else is measured against.

2. Active-active multi-region endpoints + Route 53 latency routing. N endpoints, one per target region, each a mirror of the others. Route 53 resolves the endpoint hostname to the regional variant with the lowest round-trip from the client. Health checks in Route 53 withdraw a failing region from the pool in 30-60 seconds. Artifact consistency enforced by an S3 CRR pipeline plus a deploy orchestrator. Higher cost (N endpoints, N minimum-instance floors) but native regional-latency story.

3. Active-active multi-region + CloudFront. Same endpoint layout, but CloudFront in front with origin failover groups. Makes more sense when the inference is HTTP-cacheable (rare for real-time fraud scoring) or when a single branded URL matters. Adds CloudFront latency overhead on cache-miss and another thing to operate; useful mostly when a CDN is already in the architecture for other reasons.

4. Active-active multi-region + Global Accelerator. Anycast IPs route clients over the AWS backbone to the healthiest regional endpoint, with sub-30-second failover on region-level health. Latency benefit is on the first hop (client-to-AWS edge) rather than only the AWS-to-region hop; when the client is on mobile or non-AWS carriers, the improvement is real.

5. Active-passive multi-region. One region takes all traffic; a second region is warm-standby with weights pre-loaded and endpoints at minimum capacity, failed over to via DNS when the primary fails. Cheaper than full active-active (the passive region’s endpoints are scaled to near-zero). Doesn’t address latency (distant users still talk to the primary), but is the correct shape when the driver is disaster recovery, not latency. Worth naming because “multi-region” often means this, not active-active.

6. SageMaker Multi-Model Endpoint (MME) per region. Orthogonal to the region question: if we have many models, MME lets one endpoint host thousands, loaded on demand. Combining MME with multi-region deployment is a reasonable design when per-customer or per-tenant models would otherwise need individual endpoints, but it doesn’t answer the regional-latency question by itself.

Side by side

Option	Regional latency	Failover	Artifact consistency	Operational overhead	Cost shape
Single region	✗	✗ (region loss = outage)	✓ (only one)	Low	1x
Active-active + Route 53 LBR	✓	✓ (DNS TTL delay)	Pipeline-enforced	High	~N x
Active-active + CloudFront	✓	✓ (origin failover)	Pipeline-enforced	High	~N x + CDN
Active-active + Global Accelerator	✓ (incl. first hop)	✓ (fast anycast)	Pipeline-enforced	High	~N x + GA
Active-passive	✗ (latency unchanged)	✓ (explicit promotion)	Pipeline-enforced	Medium	~1.2x
MME per region	orthogonal	orthogonal	orthogonal	adds MME complexity	depends on fleet

Reading the table by driver rather than by option:

Driver is latency: active-active, a regional endpoint per geographic cluster of customers. Route 53 latency-based routing is the default router because it’s the simplest; Global Accelerator is the choice when the first hop matters.
Driver is disaster recovery with strict RTO/RPO: active-passive with an orchestrated failover, scaled to the cost the business accepts.
Driver is both: active-active and accept the cost, because active-active is a DR story with lower RTO than active-passive by construction.

The fraud scoring workload wants active-active for latency. The rest of the work is about how to land that cleanly.

The deployment shape

One training pipeline, four regional endpoints. S3 Cross-Region Replication spreads the artifact; CodePipeline promotes it with verification gates; Route 53 latency-based routing steers each client to its closest healthy endpoint.

The picks in depth

S3 Cross-Region Replication for the artifact fan-out. CRR is configured once per source-destination pair, runs asynchronously, and preserves KMS encryption through a per-region CMK that the destination bucket’s replication role can use. The typical latency is seconds for a 280 MB artifact, faster than any pipeline step that follows it. CRR is the transport; the promotion is a separate decision. We don’t want all four regions picking up the latest artifact automatically; we want a pipeline that verifies each region before moving to the next.

CodePipeline + Lambda orchestration for region-by-region promotion. The pipeline’s stages: Build (training), Stage Artifact (writes to the eu-west-1 S3 bucket, triggers CRR to the other three), Promote EU (calls SageMaker UpdateEndpoint with the new model, waits for deployment guardrails to finish, checks a canary metric for 30 minutes), then Promote AU, JP, BR in sequence with the same verification. A rollback Lambda watches a CloudWatch composite alarm per region and reverts the endpoint to the previous model if ModelLatency or 5xx rate crosses threshold. The pipeline is the promotion lever; CRR is the transport.

SageMaker real-time endpoints in four regions with per-region autoscaling. Each regional endpoint is sized to its local traffic, not to the total. EU has ~600 rps peak; AU has ~200 rps; JP has ~150 rps; BR has ~100 rps. Minimum instance count scales down overnight; autoscaling targets InvocationsPerInstance at a value that keeps per-instance latency under 30 ms. The total hourly capacity across four regions is greater than one combined region (because each has its own minimum floor), but the real cost driver is the minimums, not the peaks. A two-instance minimum per region × four regions = eight instances idle overnight vs. three before. Roughly 2.5× the baseline compute cost; the business is willing to pay it.

Route 53 latency-based routing for the client DNS. The endpoint hostname fraud.acme.com has four A-alias records, each targeting the SageMaker endpoint’s regional VPC endpoint or an ALB in front of it. Each record has a latency-based routing policy keyed to the region, and a health check that polls a /health URL on the endpoint every 30 seconds. Route 53 returns the record with the lowest latency from the resolver’s location. If the EU region’s health check fails, Route 53 stops returning the EU record and clients converge on the next-closest healthy region within a DNS TTL (60 seconds is a common setting). Global Accelerator is a valid alternative (specifically for mobile-heavy traffic), but Route 53 is the lower-operational-weight default.

The promotion sequence matters. The default is “promote EU first, then AU, JP, BR” because EU has the most traffic and the fastest signal. If the new model breaks, we catch it on EU where we have the statistics, not on BR where 100 rps is too little signal to detect a subtle drift in 30 minutes.

A worked promotion trace

A new model version v427 finishes training at 03:00 UTC.

Training job writes s3://fraud-models-eu/v427/model.tar.gz. CRR kicks off; by 03:00:08 the replica exists in fraud-models-au, -jp, -br.
CodePipeline’s “Register” stage creates a new entry in SageMaker Model Registry with status PendingApproval, runs a shadow-evaluation Lambda on 10,000 held-out records, and, if AUC and false-positive rate are within tolerance, moves the status to Approved.
“Promote EU” stage calls UpdateEndpoint on the EU endpoint with the new endpoint config referencing v427. SageMaker’s deployment guardrails do a blue/green with 10% canary for 30 minutes. Alarm on ModelLatency > 20 ms p99 or 4xx > 1% rolls back automatically.
EU canary clears. Pipeline moves to “Promote AU”: same UpdateEndpoint call, same guardrails, but running against the AU endpoint. AWS-region-level operations are independent.
JP and BR follow in sequence. By 04:15 UTC all four regions are on v427.
Route 53 never changes; the model artifact change is invisible to DNS. Clients see a brief canary period in each region as the blue/green swap happens, but their endpoint addresses don’t move.

What happens if the Tokyo region has a SageMaker control-plane issue at 03:47 while JP is mid-promotion: the promotion Lambda fails, the rollback Lambda reverts to the previous endpoint config (automatic), Route 53 still resolves clients to JP because the data plane is up, and the on-call is paged because the pipeline failed. EU, AU, BR are on v427; JP is back on v426. A re-run tomorrow catches JP up without blocking the others.

What’s worth remembering

Multi-region is four decisions, not one. Artifact location, endpoint placement, client routing, and cross-region promotion are independent. Pick each explicitly.
S3 Cross-Region Replication is the artifact transport. Seconds to replicate a 280 MB model. Preserves KMS encryption through a destination-region key. ECR has the same pattern for containers.
SageMaker endpoints are regional. One endpoint, one region, one set of autoscaling rules. Four regions means four endpoints with four sets of CloudWatch alarms and four minimum-instance floors.
Route 53 latency-based routing is the default router. Lowest operational weight, sub-minute failover via health checks. Global Accelerator when the first hop matters; CloudFront when a branded URL and origin failover are already in the architecture.
Promotion happens region by region, not globally. CodePipeline or Step Functions orchestrates: promote the highest-traffic region first (fastest signal), verify, move on. Rollback is per-region and automatic via alarms.
Deployment guardrails are per-endpoint. Blue/green, canary, linear, all-at-once: these are endpoint configuration choices made when UpdateEndpoint is called. Each region has its own.
Minimum-instance floors drive the multi-region cost. The peak scales with the traffic; the floor scales with the region count. Four regions at a two-instance minimum is eight idle instances overnight, not two. Size minimums regionally, not uniformly.
Active-passive is a different architecture. It solves disaster recovery, not latency. If the driver is “our users are far from our region”, active-active is the only answer that works. If the driver is “our region might fail”, active-passive is cheaper.
Artifact consistency is a pipeline problem, not a storage problem. CRR guarantees the bytes arrive; the pipeline guarantees the endpoints are serving the same version. These are two different jobs and both have to happen.

The fraud model lands in four regions, each running the same artifact, each with its own autoscaling and alarms, each reachable by the customers closest to it. The training team still pushes once a day; the promotion happens region by region with signal-driven gates; the DNS layer makes it invisible to the client. What was one endpoint and one pipeline is now four of each: more moving parts, but moving in predictable directions, and Sydney is back down to 55 ms.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.