The situation
Two production services on Bedrock, both hitting Claude Sonnet 4.5, both starting to feel the limits of on-demand.
Service A: the customer-facing assistant. Runs 24/7. Peak traffic is 40 requests per second (US and EU business hours overlapping); trough is 8 requests per second (overnight in both regions). Median request consumes 1,500 input tokens and produces 200 output tokens. Runs ~10 million requests per month. Latency matters, product has a p95 SLA of 2 seconds end-to-end; Bedrock latency is most of that budget. The service has started hitting on-demand throttling during peak, seeing occasional ThrottlingException errors that the retry logic masks but that add latency spikes.
Service B: the weekly report generator. Runs for about 6 hours every Sunday morning. Generates ~80,000 reports in that window, each consuming ~3,000 input tokens and producing ~800 output tokens. Rest of the week, zero traffic. Latency per request doesn’t matter, reports aren’t interactive, but the job has to finish within the 6-hour window because downstream distribution kicks off on Sunday afternoon.
Both services are candidates for Provisioned Throughput. The question is whether PT is correct for each, and if so, what to commit.
What actually matters
Inference pricing for managed foundation models tends to come in two shapes.
Pay-as-you-go is per-TokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. : input tokens at one rate, output tokens at another (typically several times higher). No upfront commitment; pay exactly what’s used. Subject to account-level throttling quotas, requests-per-minute and tokens-per-minute, per ModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. , which can be raised by support request but cap the burst capacity.
Committed capacity is per-month-of-capacity. Reserve a slab of throughput on a model for a fixed term; that slab provides a guaranteed tokens-per-minute number, billed flat regardless of utilisation. Latency is more predictable because the capacity is reserved rather than shared.
The core trade: money for predictability. Committed capacity buys guaranteed throughput at a fixed price; pay-as-you-go charges only for what’s used but can throttle and has variable latency.
The first decision is whether a commitment is economically justified at all. For a workload running 24/7 at meaningful volume, the total token spend is large enough that a well-sized commitment can undercut pay-as-you-go. For a workload that runs a few hours a week, paying for a month of capacity to serve those hours is wasteful no matter how favourable the rate.
The second is how to size the commitment. The commit has to cover peak throughput, not average, or it has to be deliberately sized below peak with a plan for spillover. Over-committing wastes money on idle capacity; under-committing means peak traffic spills back to pay-as-you-go (usually permitted, usually at the pay-as-you-go rate).
The third is how latency behaves on the two modes. Pay-as-you-go latency is driven by shared-tenancy InferenceRunning a trained model to produce output – as opposed to training it. queueing; during peaks, requests wait in queue. Reserved capacity keeps queue depth low because nobody else is using it. The observable shape: p50 latency similar; p95 and p99 materially better on committed capacity during peak hours.
The fourth is commitment length. Shorter terms come at a higher monthly rate; longer terms come with a discount and a longer lock-in. The choice mirrors any other reserved-capacity calculus: cheaper per-month buys less flexibility.
The fifth is whether routing tricks can postpone the decision. Cross-region inference routing, a single call dispatched to whichever region has capacity, can absorb bursts on pay-as-you-go without a commitment. It works with pay-as-you-go; it’s less relevant once capacity is reserved in a single region.
And a softer one: what happens if traffic doubles or halves. A commitment is rigid for its term. If traffic doubles next quarter, the team needs another commitment (or spillover at pay-as-you-go rates). If it halves, the bill stays the same. Committed capacity suits workloads with stable, predictable envelopes, not workloads in the middle of a growth or shrinkage curve.
What we’ll filter on
- Latency predictability, p50, p95, p99 under peak load?
- Monthly cost at expected usage, which pricing shape wins at this traffic profile?
- Cost at worst-case traffic, what happens when actual usage deviates from plan?
- Commitment flexibility, scaling up, down, or out mid-term?
- Operational overhead, what changes in the day-to-day with each shape?
The provisioned-vs-on-demand landscape
-
Pure on-demand. Baseline. No commitment, pay per token, subject to account throttling limits. Variable latency; spiky. Right for unpredictable workloads, low volumes, bursty occasional jobs.
-
On-demand + throttle increase. Request a higher RPM/TPM limit via support. Buys more headroom on on-demand; doesn’t change latency characteristics. Free (you don’t pay more per token), just asks AWS for a bigger queue.
-
On-demand + cross-region inference. Configure an inference profile that routes across multiple regions. Increases effective throughput at the cost of slightly higher cross-region latency. Cross-region inference profiles can be applied to pure on-demand (no PT needed); very useful for bursty workloads.
-
Provisioned Throughput, 1 month. Monthly commitment. Reserved capacity. Higher bill floor, lower per-token effective cost if well-utilised. One-month term lets us re-evaluate monthly.
-
Provisioned Throughput, 6 month. Longer commitment, deeper discount. Six-month term assumes the workload shape is stable for at least six months.
-
PT + on-demand overflow. The hybrid. PT covers the baseline; on-demand absorbs peaks above the committed capacity. Billed for both. PT at monthly, overflow at on-demand per-token. Right for workloads with a steady baseline and known peaks.
-
Batch Inference API. A separate pricing tier for batch workloads. Submit up to 1GB of requests, Bedrock processes them within 24 hours at ~50% the on-demand rate. Not subject to real-time throttling. Perfect for Service B-style batch jobs.
Side by side
| Option | Latency | Cost at expected usage | Cost at worst-case | Commitment | Ops overhead |
|---|---|---|---|---|---|
| On-demand | Variable | Pay per token | Pay per token | None | None |
| On-demand + throttle | Variable | Same | Higher ceiling | None | Support ticket |
| On-demand + cross-region | Slightly higher | Same | Higher ceiling | None | Inference profile config |
| PT 1-month | Predictable | Flat monthly | Overflow at OD | 1 month | Capacity planning |
| PT 6-month | Predictable | Flat monthly (discounted) | Overflow at OD | 6 months | Capacity planning |
| PT + OD overflow | Mostly predictable | Flat + some OD | PT + OD | Same | Capacity planning |
| Batch Inference | N/A (batch) | ~50% of OD | N/A | None | Batch job plumbing |
Service A and Service B, placed
The picks in depth
Service A: PT 1-month + on-demand overflow. The traffic shape, steady daily pattern, stable week to week, fits PT’s monthly commitment model. Size the PT to cover roughly 75% of peak, not 100%. Covering 100% would pay for capacity that sits idle two-thirds of the day; covering 75% means the top of the curve spills to on-demand at on-demand prices, which is fine because those are the hours we were happy paying on-demand for anyway.
Rough math: average daily throughput is ~24M tokens/hour (peak 40 rps × 1700 tokens); 75% of peak is ~30 rps, or roughly 3M tokens/min. At 50k TPM per MU, that’s about 60 MUs. At typical PT pricing (figures vary by model and change regularly), the monthly commitment sits in the low five figures; on-demand for the same 75% of traffic was in the mid-five figures. Savings: meaningful, 30-40% depending on the exact rates at commit time.
Side effect: p95 latency drops because the dedicated capacity removes shared-tenancy queueing during peaks. Product sees the SLA compliance rate improve from 94% to 99%.
Service B: Batch Inference API. Provisioned Throughput would be wildly wasteful here, 6 active hours per 168-hour week. The Batch Inference API is exactly the correct tool: submit a manifest of requests, Bedrock processes them at roughly 50% the on-demand rate within 24 hours. The Sunday-morning window starts earlier and accepts the batch API’s less-than-24-hour turnaround; downstream distribution kicks off Sunday afternoon.
Rough math: 80,000 reports × 3,800 tokens average = 304M tokens per Sunday. At half on-demand, the weekly bill drops by roughly half. No ops overhead beyond the batch submission code. Zero commitment risk, a week when no reports run costs zero.
What we keep on on-demand. Everything else: evaluation jobs, experimentation notebooks, one-off queries, ad-hoc scripts. These are the use cases on-demand was designed for.
Rollout. Service A migrates to PT over two weeks. Week one: commit 30 MUs (half the target), monitor capacity utilisation and on-demand spill. Week two: true up to 60 MUs once the math is confirmed. Service B migrates to Batch Inference API in a sprint, the submit/poll code is straightforward; the existing real-time invocation loop replaces with a batch-job state machine.
A worked example: the bill before and after
Current monthly spend on both services, on-demand:
Service A (24/7 assistant):
17B tokens/month at Sonnet (5% input / 95% output-weighted average)
≈ $20,500/month on-demand
Service B (weekly batch):
300M tokens/week × 4 weeks = 1.2B tokens/month at Sonnet
≈ $1,800/month on-demand
After the changes:
Service A:
PT: 60 MUs × $N/month commitment ≈ $11,500
On-demand overflow (peaks above capacity, ~25% traffic) ≈ $4,200
Subtotal $15,700
Service B:
Batch Inference API (50% of on-demand) ≈ $900
Combined monthly spend drops from $22,300 to $16,600, 25% saving, latency improvement on Service A as a bonus, and two workloads better-matched to the pricing model that fits them.
What’s worth remembering
- On-demand prices predictability at zero; PT prices it per month. The question is whether you need predictability enough to pay for it.
- PT fits steady workloads with stable envelopes. Predictable daily traffic, month-over-month similarity, no imminent order-of-magnitude changes.
- PT sized at 70-80% of peak with on-demand overflow beats PT sized at 100%. You pay for the capacity that matters most; peaks spill at on-demand rates, which you were accepting before.
- The Batch Inference API is the correct tool for batch. ~50% of on-demand, no commitment, up to 24-hour turnaround. Use it for anything that doesn’t need real-time.
- Commitment length matters. 6-month PT discounts more than 1-month but locks you in. Use 1-month until the workload shape is genuinely settled; graduate to 6-month.
- PT latency benefit is real during peaks. p95 and p99 improve; p50 stays roughly the same. Measure the shape before committing.
- Cross-region inference is a free capacity lever on on-demand. Configure an inference profile, route to multiple regions, multiply effective throughput. Mix with or without PT.
- Throttle-limit increases are free headroom. Request higher RPM/TPM quotas via support; doesn’t change per-token pricing, does prevent the ThrottlingException class of errors.
Two workloads, two pricing models, one bill that dropped 25%, and a latency SLA that product stopped complaining about. The lever wasn’t a cheaper model; it was matching each workload to the pricing shape that fits its rhythm.