Exam Room · Advanced GenAI

Caching LLM Responses Without Stale Answers

July 29, 2026 · 22 min read

The situation

Analytics on the support assistant’s last month of traffic shows a striking shape. Out of ~400,000 distinct user queries, the top 500 phrasings account for 30% of the volume. Another 25% clusters into a few thousand near-duplicate queries differing only in wording. Every one of these calls a foundation model with retrieval, spends ~2,500 input tokensTokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. and ~300 output tokens, and comes back with an answer that’s functionally identical to what we gave the last thousand askers.

Product has asked for faster responses and engineering has asked for a lower bill. Caching looks like the obvious lever, but LLM response caching is trickier than caching a REST endpoint. A cache hit on the wrong query returns a confidently-stated wrong answer, which is worse than the slow-but-correct baseline. A cache that’s too strict never hits. A cache that crosses sessions serves another user’s context to the current one.

The team needs a caching strategy that meaningfully reduces model calls without compromising correctness, privacy, or freshness.

What actually matters

A cache is a map from key to value. For a REST endpoint, the key is usually the URL; for an LLM, the key is fuzzier. Two prompts that differ by a word might be the same question; two prompts that differ by a single digit (product ID, date) might have completely different answers.

The first decision is the cache key. Exact-match hashes the full prompt, stable but misses paraphrases. Normalised hash (lowercase, strip punctuation, sort tokens) catches some paraphrases. Semantic hashing (embed the prompt, cluster by cosine similarity to a known set of cached embeddingsEmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. ) catches real paraphrases but risks false positives. Structured keys (pull out known slots from the query, intent, product, action, and hash those) are strict but brittle.

The second is what gets cached. Caching the final response is one option; caching intermediate artefacts (retrieved chunksChunkingSplitting documents into retrievable pieces before embedding them – small enough to match precisely, big enough to still make sense. for a given query, parsed intent for a given message) is another. The former skips the whole pipeline; the latter skips parts of it. Both have roles.

The third is freshness and invalidation. A cached response has to know when to expire. Time-based TTL is the simplest, “this response is valid for 6 hours.” Event-based invalidation ties the cache to content changes (the Knowledge Base was re-ingested; invalidate anything that cited changed chunks). Both together is the robust version.

The fourth is context-sensitivity. “How do I cancel?” is a safe cache candidate, answer doesn’t depend on who’s asking. “When is my next payment due?” obviously does. A general cache serves the former cleanly and has to exclude the latter. Classification of cacheability is a prerequisite to caching anything.

The fifth is the cache store itself. The options break into three categories: a managed keyed store with TTL support (cheap, durable, mid-latency), an in-memory store (faster, more expensive per GB, supports vector similarity in some forms), and an in-process LRU cache (fastest, but doesn’t share across instances). Layered on top of any of those, the model-inference layer itself may offer prefix-level caching for stable prompt sections within a short window, a different kind of cache from whole-response caching, but stackable with it.

There’s a user-experience angle as well: what a hit feels like next to a miss. A miss means the normal latency (seconds); a hit means near-instant (milliseconds). The contrast matters: if 30% of queries return in 50ms and 70% in 2s, the UX is choppy. Consistent UX might mean slightly delaying cache hits to match baseline perceived latency.

What we’ll filter on

Hit rate, what fraction of queries hit the cache?
False-positive risk, how likely is a hit to serve the wrong answer?
Context safety, can a cache entry from one user leak to another?
Freshness, how stale can a cached response get?
Implementation complexity, how much plumbing, how many new services?

The caching landscape

Exact-match response cache. Hash the canonical prompt (normalised, with all context expanded); map to the full response. Redis or DynamoDB. TTL at minutes to hours. Simple; high false-positive-safety (an exact match is exact, with no fuzziness); low hit rate (paraphrases miss).
Semantic response cache. Embed each prompt, find the nearest cached prompt by cosine similarity; if above a threshold (e.g., 0.95), return the cached response. Missing that threshold, fall through to the model. Higher hit rate; threshold needs tuning to avoid false positives.
Bedrock prompt caching (prefix-level). Mark cacheable sections of a prompt (system instructions, few-shot examples, stable retrieved context); Bedrock caches them server-side for 5 minutes, charging a fraction of the normal input-token rate on cache hits. Not a whole-response cache; cuts input-token costs for the cached portion even when the rest of the prompt varies.
Retrieval cache. Cache the retrieved chunks for a given query, not the response. Skips the vectorVectorAn ordered list of numbers – in AI usage, almost always an embedding – and by extension the databases that index them for nearest-neighbour search. search; still runs the generator. Useful when retrieval is the expensive step (rarer than generation being expensive) or when the generator’s output depends on session state that changes.
Structured-intent cache. Pre-classify each query into an intent with slots (intent=cancel, product=X). Cache the response keyed by intent + slots. Very high hit rate within an intent; requires an intent classifier upstream.
Session-level memoisation. Cache within a single session, if the user asks the same question twice in one conversation, return the prior answer. Narrow, safe, easy, low hit rate overall but noticeable within long sessions.
No caching (baseline). Pay for every call. Honest answer if the cacheable fraction is small.

Side by side

Cache type	Hit rate	False-pos risk	Context safety	Freshness	Complexity
Exact-match	Low (~5-10%)	Very low	Strict with session-scoped keys	TTL-only	Low
Semantic (vector)	High (~20-40%)	Medium	Strict with cacheability flag	TTL + eventing	Moderate
Bedrock prompt caching	N/A (covers prefix)	Very low	N/A	5 min TTL	Very low (flag on request)
Retrieval cache	Medium	Low	Session-scoped	TTL + eventing	Moderate
Structured-intent cache	Very high (~50-70% of cacheable)	Low if intents tight	Strict	TTL	High (classifier)
Session memoisation	Low overall	Very low	Per-session	Session lifetime	Very low
No caching	0%	N/A	N/A	N/A	None

No single row is the answer. A layered approach, semantic cache for paraphrases, Bedrock prompt caching for the stable prefix, retrieval cache where it helps, all gated by a cacheability classifier, is the realistic shape.

A layered cache design

Cacheability gate first, semantic cache second, retrieval cache third, Bedrock prompt cache fourth. Misses cascade outward; hits short-circuit. Red dashed arrows are invalidation paths.

The picks in depth

Cacheability classifier. The first guard. A cheap upstream classifier, a small Bedrock call to Claude Haiku, or a fine-tuned small model, or a rule-based system, looks at each incoming query and decides: is this query “generic” (cacheable across users) or “personalised” (context-dependent)? Generic: “how do I cancel?”, “what does Pro plan include?”, “where’s the refund policy?”. Personalised: “when is my next payment due?”, “what’s my billing address?”, “why was my last charge $49?”. The classifier routes accordingly. False negatives (personalised misclassified as generic) are the dangerous case, they can leak one user’s data as another’s cached response. Bias the classifier conservative: when uncertain, treat as personalised.

Semantic response cache in ElastiCache for Redis. For cacheable queries, embed the canonical query using Titan Text Embeddings v2 (a cheap call, sub-cent), search Redis for the nearest cached embedding within a cosine threshold of 0.95. On hit, return the cached response. On miss, fall through to the full pipeline, then write the (embedding, query, response, timestamp) tuple back to Redis. Redis’s VSS (vector similarity search) via RediSearch handles the k-NNk-NNThe retrieval question itself: given a query vector, return the k closest vectors under the index’s distance metric – answered exactly by comparing against everything, or quickly by an ANN index. query natively. 6-hour TTL by default. Threshold of 0.95 is strict enough that “how do I cancel?” and “cancel my plan” both hit the same entry but “how do I cancel a charge?” does not. Tune the threshold against evaluation data; too low serves wrong answers, too high misses paraphrases.

Retrieval cache in ElastiCache. For cacheable queries that miss the semantic cache, cache the retrieved chunks separately. The retrieval step has its own cost (vector search, re-rankerRerankingA second pass that re-scores a wide set of retrieved candidates and keeps only the few most relevant, so the expensive model reads less. ); caching chunks by canonical-query-hash skips it on repeat. Shorter TTL (1 hour) because retrieval should respond to Knowledge Base updates faster than full responses.

Bedrock prompt caching on stable prefixes. Regardless of whether the response is in a cache, Bedrock’s own prompt caching on the system prompt and few-shot section charges cache reads at roughly 10% of the normal input-token rate for repeat calls within 5 minutes. Applies to every Bedrock call, cache hit or miss in our layers.

Write-through on every miss. On a miss at the semantic layer, the pipeline runs to completion, gets the response, and writes back: the canonical query, its embedding, the response, a timestamp, and any invalidation tags (cited chunk IDs, intent classification). Next time this query or a paraphrase comes in, we hit the cache.

Invalidation. Two-way. TTL provides the floor (nothing older than 6 hours). Event-driven invalidation handles content updates: when a Knowledge Base chunk is re-ingested, a CloudWatch event triggers a sweep that evicts every cached response tagged with that chunk’s ID. Same for doc-level updates. The combination means stale answers die on schedule or on event, whichever comes first.

Metrics and observability. Cache hit rate by layer (semantic, retrieval), staleness distribution (how old are hits when served), false-positive detection (A/B sample: occasionally run the full pipeline on a “hit” and compare; if the cached response and the fresh response diverge above a threshold, flag for review). A semantic cache with a 30% hit rate but occasional drift-serving is worth knowing about before a customer points it out.

A worked example: one hour of traffic

Traffic: 6,000 requests over one hour. Breakdown:

Cacheability classifier (upstream, Haiku call, ~$0.0001 each):
  Cacheable: 4,200 requests
  Personalised (bypass): 1,800 requests

Semantic cache hits: 1,350 (32% of cacheable)
  → ~50 ms response, no Bedrock call
  → cost: embedding + Redis lookup only

Semantic cache misses: 2,850
  Retrieval cache hits (skip vector search): 900
  Retrieval cache misses: 1,950 → full retrieval

All 2,850 semantic misses invoke Bedrock with prompt-caching flag
  → ~70% pay the discounted prefix rate (within 5-min TTL cluster)

Personalised bypasses: 1,800 → full pipeline, no cache

Savings versus no-cache baseline:

Model calls avoided by semantic hits: 1,350
  At average $0.012 per call = $16.20 saved per hour

Bedrock prompt-cache savings (prefix discount on misses):
  ~70% of 4,650 calls (misses + bypasses) × ~$0.004 saved = $13.00 per hour

Total saved: ~$29/hour = ~$700/day = ~$21,000/month

Added cost:
  Cacheability classifier: 6,000 × $0.0001 = $0.60/hour = $14/day = $430/month
  Embedding + Redis: negligible, maybe $100/month

Net saving: ~$20,500/month on a Bedrock bill that was projecting ~$55k

Plus: 1,350 requests per hour arrive in ~50ms instead of ~2s. Perceived responsiveness improves meaningfully for the cacheable slice.

What’s worth remembering

LLM response caching is not REST caching. Keys are fuzzy; false positives are confidently wrong answers. Design defensively.
A cacheability classifier is the prerequisite. Without it, personalised queries leak across sessions. Bias conservative: personalised-if-unsure.
Semantic cache with a strict threshold beats exact-match. Embed the query, k-NN against cached embeddings, threshold at ~0.95 for paraphrase-safety.
Bedrock prompt caching is free performance. Flag the stable prefix on every request; saves input tokens server-side with no application-layer work beyond the flag.
Retrieval cache sits between the two. Cache chunks by canonical-query-hash; skip vector search; still generate fresh responses.
Invalidation is TTL plus events. TTL catches time-based staleness; event-driven eviction catches content updates.
Measure hit rate and false positives. A/B sample cache hits against fresh runs; alert on drift. A silent false-positive rate above 1% is a quality problem.
Personalised queries don’t cache across users; they might cache within a session. Session memoisation is a narrow but safe extra layer.

Thirty percent of queries return in 50ms instead of 2s, the bill drops by a third on the cacheable slice, and the wrong-answer rate stays at the cacheability-classifier’s false-negative floor rather than creeping upward. Cache what you can, bypass what you can’t, and measure both sides of the line continuously.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.