The situation
Bedrock spend for the company’s AI platform was $18k last quarter. It’s tracking toward $46k this quarter, and the graph keeps curving upward. No single service is responsible; the support assistant is up 40%, the ticket classifier is up 120%, the marketing-copy drafter is up 80%, and a new internal-search tool the research team launched last month is already second on the leaderboard.
Finance wants a plan with a target: get next quarter inside $35k without hurting user-facing quality. Product wants reassurance that the features they’ve scoped for next quarter can still ship. The platform team, which is where the bill lands, wants tools they can apply repeatedly rather than a one-time cost-cut exercise.
Token accounting has been on for a while. The breakdown reads:
- Input tokens: 68% of spend. Long retrieval contexts, uncompressed system prompts, verbose few-shot examples.
- Output tokens: 28% of spend. Chatty default response styles, unstructured output that the model elaborates on.
- EmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. calls: 3%.
- Other (evaluations, Fine-tuningContinuing to train an already-trained model on a smaller dataset to adapt its behaviour. runs): 1%.
All the spend is on-demand. No Provisioned Throughput. Model mix is 70% Claude Sonnet 4.5, 20% Nova Pro, 5% Haiku, 5% others.
What actually matters
Cost control for a foundation-model-heavy app has a handful of levers, each with a different blast radius and different unintended consequences. Worth walking through them before picking which to pull.
The first lever is model routing: not every request needs the most expensive model. A seven-category classifier doesn’t need top-tier capability; a smaller, cheaper model in the same family will produce the same classification at a fraction of the price. A summariser generating two sentences doesn’t need the flagship. The correct size of model for the task is often smaller than the default, and the default drifts up because “the best model” is easier to specify than “the smallest model that’s good enough.”
The second is prompt compression: system prompts, few-shot examples, and retrieved context often carry more tokens than they need to. A System promptThe instruction block that frames the model’s behaviour for a session, separate from the user’s messages. that reads like a formal spec can be rewritten in half the tokens without losing instruction coverage. Retrieved chunks that come back from the vector store include boilerplate (headers, navigation, citation markers) that adds tokens and doesn’t help the model. The work is unglamorous; the savings are linear with volume.
The third is caching. Some inference platforms charge a fraction of the normal input-token price when a cache-eligible prefix (system instructions, long context) is reused within a short window. For assistants with stable system prompts and lots of traffic, that shaves a big chunk off input costs for the cacheable portion. A layer up, caching the whole response for repeated user questions (FAQ-style) is even cheaper; the cache never calls the model at all.
The fourth is committed capacity: for high-volume, predictable workloads, paying for dedicated throughput up front is significantly cheaper per-token than pay-as-you-go, at the cost of paying for the commitment whether it’s used or not. Correct for workloads with a consistent baseline; wrong for bursty or unpredictable usage.
The fifth is output shape. A prompt that asks for “a summary” gets a chatty summary; a prompt that asks for “a summary in at most 50 words, no preamble” gets a smaller output. Structured output (JSON with defined fields) is shorter than prose and more predictable. Shaping outputs to their minimum useful form is free once; savings compound per call.
The sixth is retrieval tightening. A retriever that returns top-10 chunks when top-3 would do is paying for seven chunks of context per query. Re-ranking the candidate set before passing to the generator, or tightening the retrieval k, or using a smaller chunk size, all reduce input token count.
And a softer one: what’s worth measuring. The bill is an aggregate; the savings are per-call; the optimisations are per-service. Without per-service and per-model attribution, every optimisation is a hypothesis without a way to check it.
What we’ll filter on
- Savings magnitude, how many percent off the bill, realistically?
- Quality risk, does this change hurt user-facing quality?
- Implementation effort, hours, days, or weeks of engineering?
- Blast radius, how many services does this touch?
- Reversibility, can we roll this back if it goes wrong?
The cost-control landscape
-
Model right-sizing (Sonnet → Haiku for eligible calls). Take the top five services by spend, evaluate each against Haiku or Nova Micro, switch the ones that don’t regress. For the ticket classifier, Haiku at roughly 12% the cost of Sonnet is almost certainly enough. Typical savings when half the calls can move down: 30-50% of the affected service’s bill. Effort: one evaluation per service. Reversible via Prompt Management alias rollback.
-
Prompt caching on stable prefixes. Bedrock’s prompt caching lets us mark up to four cache points in a prompt. The first call at a cache point pays full price; subsequent calls within the 5-minute TTL pay ~10% of the input token price for the cached portion. For a support assistant with a 2000-token system prompt that runs 500 times an hour, 90%+ of those calls hit the cache. Typical savings on input costs: 30-50% when the cached portion is a significant fraction of the prompt. Effort: add cache points to the request; test. Reversible.
-
Provisioned Throughput (PT) for steady workloads. Commit to a fixed MTU per model for 1 or 6 months. PT is priced significantly below on-demand per-token, but billed by the hour whether used or not. Works when a service has a predictable baseline (the daily summariser that runs 100 times an hour around the clock). Doesn’t work for bursty traffic. Typical savings: 40-60% on the committed portion; zero if the usage doesn’t fill the commitment. Commitment lengths vary. Partial reversibility, can’t cancel a 6-month commit but can scale new work to other models.
-
Output-length constraints. System-prompt instructions and
max_tokenscaps that keep outputs minimal. “Respond in at most 75 words” cuts output tokens for chatty models by 30-50% on summary-style tasks. Structured JSON output (via JSON mode or schema) is similarly shorter. Effort: prompt edits. Reversible. -
Retrieval tightening. Top-k from 10 to 5; chunk size from 500 tokens to 300; re-ranker on top-20 candidates to select top-5. Each reduces the context token count. Typical savings on retrieval-heavy services: 20-40% on input. Trade-off: more aggressive trimming can hurt retrieval recall; evaluate before shipping. Effort: Knowledge Base configuration. Reversible.
-
Response caching for repeated queries. A hash of (prompt, model, params) → cached response, stored in ElastiCache or DynamoDB with a TTL. Skips the model call entirely for cache hits. Works for FAQ-style traffic where the same question is asked hundreds of times; works poorly for conversational traffic with long session context. Typical savings: up to 100% on cacheable calls; depends heavily on traffic shape. Effort: cache layer. Reversible.
-
Cross-account / cross-region consolidation. InferenceRunning a trained model to produce output – as opposed to training it. in some regions is cheaper than others; cross-region inference on Bedrock (profile-based routing) can shift traffic to cheaper regions with minimal latency penalty. Typical savings: 5-15% depending on region mix. Effort: enable inference profiles. Reversible.
Side by side
| Lever | Savings % | Quality risk | Effort | Blast radius | Reversibility |
|---|---|---|---|---|---|
| Model right-sizing | 20-40% overall | Medium | Days per service | Per service | Full |
| Prompt caching | 15-30% overall | Low | Hours | Request-level | Full |
| Provisioned Throughput | 10-25% overall | Low | Days + commitment | Per model | Partial |
| Output-length constraints | 5-15% overall | Low if tested | Hours | Per prompt | Full |
| Retrieval tightening | 10-20% overall | Medium | Hours + evals | Per KB | Full |
| Response caching | 5-20% overall | Low for FAQ | Days | New service | Full |
| Cross-region inference | 5-15% overall | Low | Hours | Per service | Full |
Stacking these isn’t additive, some overlap, but a realistic quarterly programme combines three or four of the low-quality-risk levers and produces 30-50% savings without retraining a model or changing a product feature.
The cost levers in order of expected return
The picks in depth
Model routing. The biggest lever, and the most careful. Start with the ticket classifier: a seven-class classifier with clear rubrics is almost certainly fine on Haiku. Run a Bedrock Evaluation comparing Haiku against the current Sonnet baseline on a 500-ticket sample; if within tolerance on the decision-rule (per the earlier evaluation post), switch. Repeat for the marketing-copy drafter’s first-pass generation (revision stays on Sonnet), the internal-search summariser (Haiku), the intent-detection step at the front of the support assistant (Haiku). Keep Sonnet for the conversation-style turns where fluency matters. Expected saving: ~25% of the overall bill, delivered over 2-3 weeks.
Prompt caching. Every service with a stable system prompt gets cache points. The support assistant’s 1800-token system prompt and long few-shot section qualify; the ticket classifier’s 600-token rubric qualifies; the marketing-copy drafter’s style-guide prefix qualifies. Bedrock’s cache points sit in the request, one flag per content block indicates it’s cache-eligible, and Bedrock returns cache-read and cache-write token counts in the response. 5-minute TTL; hot services refresh continuously. Expected saving: ~12% of the overall bill, delivered in a week.
Output-length constraints. System-prompt lines and max_tokens caps. The summariser gets “respond in at most 75 words, no preamble”; the classifier gets JSON output with a fixed schema; the marketing-copy drafter keeps its longer outputs but gains “no meta-commentary, no recap, no follow-up suggestions.” Expected saving: ~5% of the overall bill.
Retrieval tightening. The support assistant and internal search both over-retrieve. Move top-k from 10 to 5 with a re-ranker on top-20 candidates. Chunk size from 500 to 300 tokens, which requires re-indexing but also improves retrieval precision. Evaluate with the retrieval-quality suite from earlier; ship if the evaluation doesn’t regress. Expected saving: ~7%.
Provisioned Throughput. Skip for now. No individual model’s usage on any service is steady enough to fill a PT commitment; PT earns its keep when a service runs a consistent baseline 24/7, which isn’t yet true for this portfolio. Revisit in a quarter when the daily summariser’s traffic has grown enough.
Response caching. Skip for the support assistant (conversational, not cacheable at response level). Enable for a small FAQ-style endpoint the marketing team uses (pre-baked prompts with stable answers). Marginal saving, but the pattern is worth having in the toolkit.
A worked example: the ticket classifier’s drop from Sonnet to Haiku
Two months of traffic: 500,000 classification calls, 600 input tokens average (system prompt + ticket), 20 output tokens average. Current bill on Sonnet:
Input: 500,000 calls × 600 tokens × $3.00/M tokens = $900
Output: 500,000 calls × 20 tokens × $15.00/M tokens = $150
Total: $1,050
On Haiku at ~12% the rate:
Input: 500,000 × 600 × $0.25/M = $75
Output: 500,000 × 20 × $1.25/M = $12.50
Total: $87.50
That’s ~92% off for one service, $960 saved. Running the evaluation first (decision rule: F1 within 0.02 across all categories): Haiku scored F1 = 0.93 vs Sonnet’s 0.95 on the 500-example eval set, within tolerance. Alias moves in Prompt Management; next invocation hits Haiku; CloudWatch per-prompt metrics confirm the quality signal holds in production. Total engineering time: ~2 days for eval design, eval run, rollout.
The classifier is small in the overall bill but demonstrates the method. The same pattern applied to three more services is the $12k of savings in the waterfall.
What’s worth remembering
- Cost control is a lever stack, not a silver bullet. Three or four levers stacked produce more than any single lever.
- Model right-sizing is the biggest lever for most portfolios. The default drifts up to the best available; the bill reflects that drift. Evaluate, then downsize.
- Prompt caching is cheap to enable and recovers a large chunk of input costs. Mark cache points on stable prefixes; watch the response metadata for cache hit counts.
- Output-length constraints are free money. Chatty models get trimmed by prompt instruction and
max_tokens. Test that quality holds. - Retrieval over-fetch is a silent cost. Top-k 10 when top-3 would do is paying for seven chunks of context per call. Tighten with re-ranking.
- Provisioned Throughput earns its keep at steady-state scale. Wrong for bursty portfolios; correct for a 24/7 baseline on a single model. Model the break-even before committing.
- Response caching works for FAQ traffic. Not for conversation; not for retrieval-personalised calls. Narrow but free when it applies.
- Per-service, per-model attribution is the prerequisite for any of this. Without it, every optimisation is a hypothesis without a way to check it. Tag every invocation; dimension every metric.
Next quarter closes at $22k, comfortably inside the $35k target, with headroom for the features product wanted to ship. The bill didn’t shrink because anyone changed a model’s price; it shrank because the work that used to go to the most expensive model now goes to the correct model, with the correct prompt, with the correct context, at the correct time.