When a Document Won't Fit the Context Window

July 27, 2026 · 14 min read

Generative AI Developer · AIP-C01 · part of The Exam Room

The situation

The legal team has a recurring task: given a pair of documents (typically a vendor contract and an internal policy), identify clauses that speak to a specific topic (refund disputes, data handling, liability limits) and surface both the clauses and any conflicts between them. They’ve been doing this by hand, which takes a day per pair; they’ve asked whether an LLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. can help.

Documents are long. A typical contract is 60,000 TokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. of dense legal prose; a policy manual is 100,000 tokens. Together, with a 2,000-token system prompt and room for a useful response, they exceed the comfortable working zone of even a 200k context window, and in practice, very long contexts hurt quality even before they hit the hard limit (“lost in the middle” effects are well documented).

Concrete constraints: Claude Sonnet 4.5’s 200k token window, Bedrock per-token pricing, a budget that prefers five focused calls over one massive call when the total token count is similar, and a user-facing latency target of under 30 seconds per query.

What actually matters

The context window is a budget, and every token in the prompt competes with every other token. A naive “dump both documents, ask the question” approach wastes the budget, most of the 160k document tokens are irrelevant to any specific question, and hurts quality, because the model’s AttentionThe mechanism inside a transformer that lets each token weigh how much every other token in the context matters to it. degrades as relevant needles get buried in irrelevant haystack.

The first decision is chunking. A document split into passages the correct size can be searched by relevance before a prompt is built. “The correct size” depends on the question: a question about a specific clause wants small, tight chunks; a question about broad themes wants larger chunks that capture context. Chunking also interacts with EmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. quality: 300-token chunks with 20% overlap are a reasonable default for QA workloads; 1000-token chunks for summarisation.

The second is retrieval vs windowing. For specific questions, retrieve the top-k relevant chunks from each document, assemble a prompt with those, and answer. For exhaustive questions (“list every clause about X”), retrieval can miss relevant passages the embedding model didn’t rank high enough. A sliding window, process the document in overlapping segments, is the alternative. Trade-offs: retrieval is cheap and focused; windowing is exhaustive but expensive.

The third is map-reduce patterns. Run the same extraction prompt over every chunk (map), collect results, then combine them (reduce). Produces exhaustive coverage at the cost of many calls. For a 60k-token document split into 200-token chunks, that’s 300 map calls per document, which is plenty of tokens and time. Worth it when exhaustive coverage matters.

The fourth is hierarchical summarisation. Summarise each chunk; summarise the summaries; produce a top-level summary. Useful for producing a structured understanding of a document before running targeted questions. The “parent-child” hierarchical chunking patterns from the RAG side of the house are a retrieval-time version of the same idea.

The fifth is context-window hygiene. Even when a document fits, the prompt shouldn’t just be “here’s the document, now the question.” Structure matters: headings preserved, chunk boundaries marked with [chunk N from document X], the question clearly separated, the expected output format spelled out. Prompts that treat the context window as a structured container outperform prompts that treat it as a bucket.

And a softer one: the question itself. “What clauses govern refund disputes?” has a different shape from “summarise the contract” has a different shape from “are there conflicts between these two documents?” The first wants retrieval; the second wants hierarchical summarisation; the third wants map-reduce pair comparison. One architecture doesn’t fit all; the question shapes the approach.

What we’ll filter on

  1. Coverage, exhaustive or best-effort?
  2. Cost, tokens consumed per query?
  3. Latency, seconds to answer?
  4. Quality at length, does the approach avoid “lost in the middle”?
  5. Complexity, how much orchestration code does this need?

The long-context landscape

  1. Single large prompt (“just use the 200k window”). Put both documents in one prompt with the question. Cheapest orchestration; most expensive tokens (~165k input per query, roughly $0.50 at Sonnet pricing). Quality degrades past ~30-50k tokens in practice, needles get lost. Correct for short documents; wrong for 100k+-token corpora.

  2. Retrieval-augmented per query. Chunk both documents into 500-token passages, embed, store. For each query, retrieve top-k from each document (say 10 each), assemble a prompt with 10k tokens of context. Fast, cheap, focused. Misses passages that matter but weren’t retrieved. Correct for specific questions; wrong for exhaustive coverage.

  3. Map-reduce extraction. Split each document into chunks. Map: run the extraction prompt (e.g., “does this chunk discuss refund disputes? If so, quote the relevant sentences”) over every chunk. Reduce: feed all extractions into a combining prompt that organises, dedupes, and cross-references. Exhaustive; expensive (many small calls); slow (parallelisable, but even then ~30 seconds for a pair of documents at 500 map calls).

  4. Hierarchical summarisation. Summarise each chunk; cluster summaries by topic; summarise each cluster; produce document-level summaries. Query-time then operates on summaries (cheap, focused, may miss detail). Useful for multi-query workloads where the summary hierarchy is reused.

  5. Sliding window. Walk the document with an overlapping window (e.g., 20k-token windows with 2k overlap); run the query per window; merge. Exhaustive; simpler than map-reduce; still expensive.

  6. Hybrid: hierarchical-summary-first retrieval. Build a hierarchy (section → chapter → whole document summaries). At query time, start at the top, find the relevant sections via the summaries, retrieve detailed chunks only from those sections. Balances exhaustive with cheap.

Side by side

Approach Coverage Cost Latency Quality at length Complexity
Single large prompt Full Very high 15-30s Degrades past ~50k Lowest
Retrieval top-k Best-effort Low 2-4s Strong Low (KB does it)
Map-reduce Exhaustive High 20-60s parallel Consistent Moderate
Hierarchical summarisation Summary-level Medium (amortised) Prebuilt, fast Strong High (pipeline)
Sliding window Exhaustive High 20-60s Consistent Low-moderate
Hierarchical + retrieval Near-exhaustive Medium 3-8s Strong High

The correct approach depends on the question type. For the legal team’s three query shapes, find clauses, summarise, find conflicts, no single approach dominates. The realistic system picks per query.

A decision tree for “which approach per question”

Which long-context approach per question? Question comes in two documents in scope Narrow, fact-seeking? ("what does the contract say about X?") yes Top-k retrieval 10 chunks per doc via KB ~10k tokens context ~4s · ~$0.02 / query risk: misses low-ranked passages no Exhaustive list? ("every clause of type X") yes Map-reduce extraction per-chunk extract, per-doc reduce parallelise map across chunks ~30s · ~$1-2 / query cheaper with Haiku for the map step no Cross-document? ("find conflicts") yes Paired map-reduce extract from each → pair reduce find semantic overlaps by topic ~40s · ~$2-3 / query reuse extraction across queries no Hierarchical summary prebuilt tree query on summaries ~3s · ~$0.05 build cost amortised
Each question shape gets its own architecture. The cost of routing a query to the wrong approach is either wrong answers (missed coverage) or wasted tokens (over-expensive call).

The picks in depth

Top-k retrieval for narrow questions. “What does the contract say about data retention?” is a clause-hunting query. Chunk both documents at 500 tokens with 50-token overlap, embed with Titan Text Embeddings v2, store in OpenSearch Serverless. Query time: embed the question, retrieve top-10 from each document with metadata filter doc_id IN (contract_id, policy_id), assemble a prompt with the 20 chunks marked by source, ask the question. Total prompt: ~12k tokens. Response: focused, cited, ~4s, a couple of cents. Risk: if an answer sits in a chunk that didn’t rank top-10, we miss it. Mitigation: the retriever’s re-ranker reorders candidates; bump top-k to 15 for legal documents where precision matters.

Map-reduce for exhaustive extraction. “List every clause governing refund disputes” is an exhaustive query. The map prompt, run over each chunk: “Does this text contain any clause relating to refund disputes? If so, quote the relevant sentences verbatim and give a one-line explanation of the clause’s effect.” The reduce prompt takes all the map outputs (a few kilobytes of quoted passages) and dedupes, groups by topic, and produces a structured list. Map step can run on Claude Haiku ($0.25/M input, $1.25/M output) to save costs; reduce runs on Sonnet for quality. 300 chunks × ~$0.002 each = $0.60 for the map step; one Sonnet reduce call = ~$0.10. Parallelise map calls via asyncio / Step Functions Map state; total wall-clock ~30 seconds.

Paired map-reduce for cross-document analysis. “Find conflicts between the contract’s refund policy and the company’s refund policy” is the hardest shape. Extract clauses on “refund” from each document via map-reduce (reusing extractions if they were computed earlier). Then run a pair-comparison prompt: given extractions from Document A and Document B, identify pairs where they speak to overlapping topics and call out differences. The comparison step can be quadratic (every A-clause against every B-clause), but clustering by topic first reduces it to O(topics × clauses_per_topic). Total cost ~$2-3 per query; total time ~40 seconds. Cache extractions so a second cross-document query on the same pair is cheap.

Hierarchical summarisation for summary-style questions. “Give me an executive summary of the contract” or “what’s the shape of this policy manual.” Built offline: summarise each section (chunk), summarise each chapter (group of sections), summarise the whole document. Store as a tree. Query time: traverse the tree to find the correct granularity for the question. Build cost once per document; query cost afterwards is a few cents.

Routing. A thin classifier in front, could be a cheap Haiku call or a regex-based heuristic, picks the approach per question. “List all”, “every”, “find all” triggers map-reduce. “Compare”, “conflict”, “differ” triggers paired map-reduce. “Summarise”, “overview” triggers hierarchical. Everything else defaults to top-k retrieval. The router is allowed to be crude; the cost of mis-routing is at most “use a slower/more-expensive approach for a simpler question,” not wrong answers.

Queries over a typical day on one contract/policy pair:

Q1 "What's the payment schedule in the contract?"
  → narrow, retrieval
  → 4s, $0.02

Q2 "List every clause about liability limits in both documents."
  → exhaustive, map-reduce
  → 35s, $1.80 (Haiku map + Sonnet reduce)

Q3 "Does the policy actually align with the contract on refunds?"
  → cross-document, paired map-reduce
  → 45s, $2.50 (extractions cached from Q4 later)

Q4 "List every refund-related clause in both documents."
  → exhaustive, map-reduce
  → 30s, $1.60 (cached extractions from Q3 reused; re-verify only)

Q5 "Summarise the policy manual for me."
  → summary, hierarchical (prebuilt)
  → 3s, $0.04

Q6 "What does the contract say about force majeure?"
  → narrow, retrieval
  → 4s, $0.02

Total: 6 queries, ~2 minutes of model time, ~$6 in Bedrock spend

Compared to a naive “both documents in one prompt” approach:

6 queries × 165k input tokens × $3/M = $2.97 per query × 6 = $17.80

And the retrieval-quality would be worse on the exhaustive queries because of lost-in-the-middle effects.

What’s worth remembering

  1. The context window is a budget, not a bucket. Every token competes with every other. Use structure.
  2. Quality degrades well before the hard limit. “Lost in the middle” is a real effect past ~30-50k context tokens, even when the window can hold more.
  3. One approach doesn’t fit all questions. Narrow questions want retrieval; exhaustive questions want map-reduce; summaries want hierarchical pre-builds. Route.
  4. Map-reduce is the hammer for exhaustive coverage. Use Haiku for the map step, Sonnet for the reduce step. Parallelise the map.
  5. Chunk size depends on the query shape. 300-500 tokens for QA; 1000+ for summarisation-style. Experiment; don’t adopt defaults blindly.
  6. Cache what you extract. A cross-document conflict query reuses the per-document extractions. Keep them with a TTL.
  7. Cost scales with approach, not question difficulty. A “hard” question can be cheap if routed correctly; an easy question can be expensive if routed wrong.
  8. The router can be crude. A classifier picks the approach; misrouting produces over-expensive answers, not wrong ones. Worry about quality first, router precision later.

A 400-page contract and a 200-page policy manual, six questions in a day, answers that cite their sources and don’t lose clauses in the middle. Not because the context window is big enough; because the system doesn’t try to stuff everything in every time.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.