The situation
A B2B SaaS company serves three customer geographies from a single AWS account. Bedrock is the backbone for in-product AI features, a summarisation endpoint, an extraction endpoint, a chat assistant, all Claude Sonnet under the covers.
Measured over the last quarter of production traffic:
- Primary invocation region: us-east-1. Historical default; the application first shipped there.
- ~40 RPS sustained, bursting to ~80 RPS during the US/EU business-hours overlap.
- Throttles climbing. Bedrock returns
ThrottlingExceptionon roughly 2% of peak-hour requests against Sonnet, and the rate is growing month-on-month as adoption grows. - Idle capacity elsewhere. The team has separately probed eu-west-1 and ap-southeast-1 with the same model. Both regions accept traffic without throttling at the scenario’s peak rates.
- Customer distribution: roughly 50% US, 35% EU, 15% APAC. The product is browser-based; the AWS region the application calls doesn’t need to match the user’s location for latency.
- One product surface. Customers do not think of themselves as “EU customers” or “APAC customers”; the team wants the client-facing experience in one product, not three.
What actually matters
The first observation is that this isn’t a model problem, it’s a capacity-distribution problem. The team can make the throttles disappear without changing the model, changing the prompt, or changing the application logic. What has to change is where the request lands, and whoever owns that decision is also absorbing its operational weight: per-region quota tracking, retry ordering, health checks, version skew when a point release rolls to one region before another. That list is real work, and the cheapest version of the list is “AWS does it behind an API.”
The second is who pays for the spreading? Any mechanism that charges a premium for cross-region routing eats into the margin the product has on AI features, which for most SaaS companies is already tight. Any mechanism that adds a second system to operate charges in engineering time, which is worse. The shape worth holding out for is: same price, same SDK, same request, Bedrock decides which region serves it.
The third is what does residency mean for this product? US customers don’t automatically require US processing, but EU customers often contractually do. If the answer is “all EU customers must have their InferenceRunning a trained model to produce output – as opposed to training it. inside EU regions,” then the spreading mechanism has to respect that at the routing layer, not rely on the application remembering. A geography-scoped routing primitive is the cleanest expression of that constraint, the US call can’t accidentally land in Frankfurt, and the EU call can’t accidentally land in Ohio.
The fourth is what does the application have to know? A design where every call site knows “this tenant goes to eu-west-1, that tenant goes to ap-southeast-1, us-east-1 is the fallback” is a design where an IAM policy update touches twelve files. A design where tenant-to-geography is a lookup and a single virtual identifier is the only thing the SDK sees is a design where a new region entering the EU pool is picked up for free. The second shape is what makes the solution robust to the cloud provider’s own expansion.
The fifth is how does this compose with cost allocation? Finance will eventually ask for a per-feature or per-tenant breakdown of the AI bill, and the raw model invocation log doesn’t carry business context. A wrapper that lets tags propagate through cost reporting is the production-team default once finance asks the question; layering tagging on top of the routing primitive means the two don’t fight each other.
The sixth is what about the models that don’t participate? A managed routing mechanism typically covers the in-demand families but not every model. For a product on a popular tier, that’s fine. For a product using a niche embedding model or a custom imported model, the mechanism may not be available and the conversation shifts back to per-region direct calls or manual SDK-level balancing. The design has to check the coverage, not assume it.
Finally: what’s the fallback when every region in the routing pool is saturated? Rare, but possible during a platform-wide incident. The SDK’s existing exponential backoff on ThrottlingException still works, the pool exhausts every constituent before the client gives up. Raising the home-region quota remains additive: the pool gets the neighbours’ capacity, the quota-bump gets the floor at home.
What we’ll filter on
Five filters.
- Managed cross-region routing. A single API call that fans out across regions with capacity, without the application owning the decision per request.
- Quota relief. Throughput across the union of regional quotas, not the cap of a single region.
- No pricing penalty. The team is already paying per-token; load-spreading should not add a premium on top of the base model price.
- Model coverage. Whatever mechanism is chosen must support the actual models the product uses. Claude Sonnet here, not only a subset that happens to participate.
- Low operational overhead. The two-engineer platform team can’t afford a new dispatcher service. Policy changes should be a config edit, not a code release.
The Bedrock multi-region landscape
Four shapes for spreading Bedrock load across regions.
Cross-region inference profiles (also surfaced as system-defined inference profiles, wrapped by Application Inference Profiles when cost-allocation tagging is added). A Bedrock-managed virtual model ID that accepts a standard InvokeModel or Converse call and routes it to a constituent region with capacity. The ID is the base model ID prefixed with a geography:
us., routes across US Regions (typically us-east-1, us-east-2, us-west-2).eu., routes across EU Regions (Frankfurt, Ireland, Paris, Zurich, depending on the model).apac., routes across Asia-Pacific Regions (Tokyo, Sydney, Singapore, Mumbai, depending on the model).us-gov., routes across AWS GovCloud (US) Regions.global., routes across commercial Regions worldwide where the model is available.
On-demand pricing applies at the base model’s rate; no routing surcharge and no cross-Region data-transfer charge for the inference path. Only some models participate, the major Anthropic, Amazon, Meta, and Mistral tiers with multi-region footprints.
Manual DNS-based or SDK-level load balancing. Run the logic in the application: a list of regional endpoints, pick one per request, catch ThrottlingException and retry against another. Health checks, quota tracking per region, retry ordering, model-version skew, all owned by the application.
Sticky per-region routing based on tenant geography. Route US tenants to us-east-1, EU tenants to eu-west-1, APAC tenants to ap-southeast-1. Simple at the routing layer, but the quota problem doesn’t go away, it gets split, and us-east-1 still carries the 50% that was the bottleneck.
Provisioned throughput as an alternative. Buy model units in us-east-1 on a 1-month or 6-month commitment. Predictable latency and RPS, but spend is committed whether used or not; the scenario’s 80 RPS peak is modest by provisioned standards.
Side by side
| Option | Managed routing | Quota relief | Same price | Model coverage | Low ops |
|---|---|---|---|---|---|
| Cross-region inference profiles | ✓ | ✓ | ✓ | — | ✓ |
| Manual DNS/SDK load balancing | ✗ | ✓ | ✓ | ✓ | ✗ |
| Sticky per-region routing | ✗ | — | ✓ | ✓ | — |
| Provisioned throughput | — | ✓ | ✗ | ✓ | ✓ |
Model coverage on cross-region profiles is honest: they cover the in-demand families but not every model. When the chosen model has a profile, the inference-profile row is the only one ticking every column.
How the prefix routes the call
Three things worth spelling out.
The profile ID is the routing primitive. A call to us.anthropic.claude-sonnet-4-5-v1:0 is indistinguishable from a call to anthropic.claude-sonnet-4-5-v1:0 at the SDK level. The us. prefix tells Bedrock this invocation is fair game for any Region in the US geography that has the model and capacity. Swap the prefix for eu. and the same client routes across EU Regions instead.
Capacity is chosen per request, not per session. Two consecutive calls to the same profile can land in different Regions. The caller does not pin a Region. No session affinity. For a stateless RAG-style workload, this is exactly right. For stateful patterns the statelessness is enforced anyway, because Bedrock does not retain conversation state server-side.
Data stays in geography for geographic profiles. us. keeps inference payload within US Regions; eu. within EU Regions; apac. within APAC. The global. profile can route anywhere the model is available, with the residency trade-off that inference data may traverse geographies.
Why the prefix beats the other options
Manual DNS or SDK load balancing. Spreading by hand works; it isn’t cheap. The application tracks per-Region quota, owns retry ordering, detects region-level outages faster than the client timeout, and handles model-version skew when AWS rolls a point release to some Regions before others. Every time AWS adds a constituent Region, and they do, the application ships a release. The cross-region profile is doing the same job inside Bedrock, updated by AWS, priced the same.
Sticky per-region routing by tenant geography. Legitimate when customer data residency is the driver, an EU tenant whose contract requires EU processing gets calls pinned to eu. profiles. It is not a quota-relief strategy. The right use of per-tenant geography is which profile each tenant uses, not which Region inside the profile.
Provisioned throughput. The tool when the workload sustains enough RPS that committing MUs beats on-demand, or when hard latency SLAs demand isolated capacity. For 80 RPS against Sonnet the maths rarely lands on provisioned, and provisioned is single-Region by default.
Raising the quota in us-east-1 alone. A Support case can lift the per-Region cap; AWS will grant increases against demonstrated usage. This works until it doesn’t, the underlying capacity is a shared pool. Cross-region profiles give access to the pools of multiple Regions at once. Use quota increases to raise the floor in the home Region; use inference profiles to add the neighbouring Regions on top.
Application Inference Profiles for cost allocation
Two related but different things.
System-defined inference profiles are the ones AWS publishes with the us., eu., apac., us-gov., and global. prefixes. They are the mechanism for cross-region routing.
Application Inference Profiles are user-created profiles in the customer’s account that wrap either a direct model ID or a system-defined cross-region profile. They add tagging, invocations via the profile show up in Cost Explorer and Cost and Usage Reports tagged by business unit, tenant, feature, or any other dimension.
The combination most production teams settle on: create an Application Inference Profile per logical feature (summariser, extractor, chat assistant), each wrapping the system-defined us., eu., or apac. profile that matches the model and geography.
A worked configuration
- Model ID selection. The application stops calling
anthropic.claude-sonnet-4-5-v1:0directly. US tenants route tous.anthropic.claude-sonnet-4-5-v1:0, EU tenants toeu., APAC tenants toapac.. The tenant-to-geography mapping lives in tenant configuration. - IAM. The application role gets
bedrock:InvokeModelandbedrock:InvokeModelWithResponseStreamon the base model ARNs and the inference-profile ARNs in each geography. - Cost-allocation wrapping. Each feature wraps the appropriate system-defined profile in an Application Inference Profile tagged
Feature=<name>andEnvironment=<env>. - Throttle handling. SDK exponential backoff on
ThrottlingExceptionstays as-is. A profile exhausting every constituent Region at once is rare; when it happens, the retry behaves the same as single-Region. - Observability. CloudWatch metrics in the Bedrock namespace expose the invocation Region, showing the traffic split across constituents. The first post-migration run typically surfaces a surprising distribution.
- Quota increases, still. Raising us-east-1’s per-Region cap is additive: cross-region spreading gets the other Regions’ capacity; raising the home Region’s floor stacks on top.
- Monitoring for profile changes.
ListInferenceProfilesin a monthly audit notices when AWS adds a new constituent Region to a profile.
The team’s throttling problem ends up with a two-line change to the model ID string in configuration and an IAM policy update. No dispatcher service. No Route 53 record. No per-Region quota tracker. Spend unchanged.
What’s worth remembering
- Cross-region inference profiles are the Bedrock-native way to spread load across Regions. Prefix the model ID with
us.,eu.,apac.,us-gov., orglobal.and Bedrock picks a constituent Region with capacity per call. - The price is the base model on-demand rate. No routing surcharge, no cross-Region data-transfer charge for the inference path.
- Geographic profiles keep data in geography.
us.in US Regions,eu.in EU Regions,apac.in APAC,us-gov.in GovCloud.global.trades the residency guarantee for worldwide availability. - Constituent Regions are AWS-managed. AWS adds and removes Regions from each profile as model rollouts evolve; applications inherit the capacity without code changes.
- Coverage is broad but not universal. The major Anthropic, Nova, Llama, and Mistral tiers have profiles; niche and brand-new models may not. Check
ListInferenceProfiles. - Custom Model Import and provisioned throughput do not participate. Custom models are single-Region by construction; provisioned endpoints are separate from the cross-region path.
- System-defined profiles do the routing; Application Inference Profiles wrap them for tagging. Create an Application Inference Profile per feature to attach tags for cost allocation; wrap the system-defined
us.,eu., orapac.ID underneath. - Sticky per-region routing by tenant is for residency, not quota relief. Pin tenants to a geography to satisfy data-location rules; let the profile spread traffic inside that geography.
- Quota increases and cross-region profiles are additive. Raise the home Region’s cap and spread across the neighbours; they combine cleanly.
The answer: invoke Sonnet via us.anthropic.claude-sonnet-4-5-v1:0, eu.anthropic.claude-sonnet-4-5-v1:0, and apac.anthropic.claude-sonnet-4-5-v1:0 with the profile selected per tenant by residency geography; wrap each feature’s calls in an Application Inference Profile tagged for cost allocation; update the IAM policy to grant invoke on the profile ARNs alongside the base model ARNs. The saturated us-east-1 gets us-east-2 and us-west-2 added to its pool; the idle eu-west-1 becomes part of the EU tenants’ serving plane; the APAC tenants pick up Tokyo, Singapore, Sydney, and Mumbai automatically. One prefix, three Regions, same price, and the application code barely changes.