Combining RAG and Fine-Tuning for a Legal Contract Assistant

June 03, 2026 · 18 min read

Generative AI Developer Professional · AIP-C01 · part of The Exam Room

The situation

A legal-technology startup is building a contract review assistant for a mid-sized commercial firm. The in-product model answers two shapes of question: “What does this clause mean in the context of our past drafting?” and “Where have we seen this indemnity construction before, and how did we negotiate it?”

The constraints:

  • Corpus: ~200,000 past contracts, amendments, side letters, and internal case studies. Roughly 40 GB of text-heavy PDFs, Word documents, and Markdown notes after extraction. Growing by ~500 new matters a month.
  • Voice: every answer references clauses by section number (§3.2(b)), uses the firm’s preferred hedging (“the drafting is ambiguous on this point” rather than “this is unclear”), and cites internal precedents in the firm’s matter-number format.
  • Refusal: questions outside commercial contract law (tax, immigration, employment) get a structured decline with a pointer to the correct in-house team. Nothing off-domain.
  • Budget: AUD$100,000 end-to-end for customisation – data preparation, TrainingThe process of fitting a model’s weights to data by minimising a loss function. , evaluation, first quarter of InferenceRunning a trained model to produce output – as opposed to training it. .
  • Timeline: three months to a pilot with fee-earners.
  • Platform: Bedrock. Nothing self-hosted.

What actually matters

Three customisation levers are on the table – retrieval-augmented generation, supervised fine-tuning, continued pre-training – and the instinct to pick one of them is the mistake. The levers aren’t substitutes; they answer different questions. The first question is what kind of problem is “be correct about 200,000 contracts”? It’s a retrieval problem. Facts about specific documents live in documents, not in weights, and any approach that tries to memorise 200,000 specific contracts is either astronomically expensive or silently unfaithful. That shape pushes the “what does the corpus say?” half of the design toward retrieval by default, and the choice of VectorAn ordered list of numbers – in AI usage, almost always an embedding – and by extension the databases that index them for nearest-neighbour search. and EmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. model becomes the interesting part.

The second question is what kind of problem is “sound like the firm”? It’s a behaviour problem. The firm’s voice is a set of rules – hedged phrasings, §-citations, matter-number formats, the polite decline when the question drifts into tax law. Rules about how to write aren’t facts; they’re patterns of output conditioned on input. Teaching those patterns through the System promptThe instruction block that frames the model’s behaviour for a session, separate from the user’s messages. works up to a point, and then starts drifting under adversarial phrasing or long conversations. Baking the rules into the weights via supervised fine-tuning means a short prompt is enough to invoke them and a jailbreak costs more than a system-prompt line to get around. That pushes the “how should the model say it?” half toward training, with the labelled dataset becoming the artefact that encodes the firm’s style guide as a training signal.

The third is what’s the planning horizon on each piece? The corpus grows by 500 matters a month. The style guide changes when a senior partner wins an argument about hedging. The refusal list changes when a user finds a new way to ask about divorce. A two-person platform team can absorb weekly ingestion (ingest jobs on object-storage events) and quarterly fine-tune refreshes (lawyer curates deltas, trigger a training run) but cannot absorb monthly retrains of anything that reads 40 GB. That cadence asymmetry is the strongest argument against continued pre-training in this project: its refresh cycle is weeks, not days, and its cost is per-token-processed on an unlabelled 40 GB corpus. The pay-off exists only when the base model’s vocabulary is genuinely wrong, and commercial contract English is squarely inside what a modern hosted model has already read.

The fourth is where does the budget actually get spent? AUD$100K in three months looks like training compute at first glance and turns out to be hosting commitments on inspection. Custom-trained models on a managed-model platform typically can’t be served on the standard pay-per-token rate – they need a reserved-capacity commitment – and that is the line item most often under-estimated. The budget shape for any approach that ships custom weights is low-training plus high-fixed-serving, and the architectural consequence is that fine-tuning earns its place only when the behaviour change is worth the always-on hourly burn. A pure retrieval approach has a different cost shape: low-fixed plus variable-per-query, which is correct for a pilot with light traffic.

The fifth is what does a wrong answer look like and who catches it? A model that gets the voice correct but hallucinates clause numbers is worse than an un-tuned model that cites faithfully. The evaluation harness has to score citation faithfulness (every § reference traces back to a retrieved chunk) separately from voice (did the model write like a partner?) because the two signals tell the team different things – citation faithfulness moves when retrieval changes, voice moves when training drifts. Without separate scores the team can’t tell which half to fix.

Finally: what buys the right to change our mind? A retrieval-only baseline ships in weeks and answers faithfully but boringly. Adding a fine-tune on top adds voice without re-doing the retrieval. If the firm decides in year two that Welsh property law has become a practice area, the retrieval corpus picks up the documents immediately and the fine-tune picks up the phrasing on the next quarterly refresh. If instead the team had picked continued pre-training, adding a new sub-domain would mean another round of training on another tranche of unlabelled text.

What we’ll filter on

Five filters to score the landscape against.

  1. Corpus GroundingConstraining a model to answer from provided sources rather than from whatever it absorbed during training. . Two hundred thousand documents the model has never seen, with new ones arriving weekly. The answer has to reflect the current corpus, not a snapshot frozen at training time.
  2. Voice and format. The firm’s phrasing and citation style are rules about how to write, not facts about the world. The model needs to internalise them so a prompt doesn’t re-teach them every turn.
  3. Refusal. Off-domain questions must be declined in a structured way. A behavioural policy that has to hold under adversarial prompting.
  4. Budget and timeline. £50K and 90 days. Any method that blows either is out.
  5. Maintainability. A two-person platform team. Customisation has to be refreshable when the corpus grows or the style guide changes, without a full retrain every time.

The customisation landscape

Bedrock gives five levers that could plausibly shape model behaviour.

Prompt engineering alone. Cheapest. System prompt with the style guide, few-shot examples, refusal instructions. Works well for voice and refusal when the base model is capable – Claude Sonnet follows detailed style instructions to a fault. Fails the corpus attribute: 200,000 documents don’t fit in any prompt.

Retrieval-augmented generation. The corpus lives in a vector store; every question retrieves relevant chunks, and those chunks ride into the prompt alongside the user’s question. Facts stay outside the weights – updating the corpus is an ingestion job, not a training job. Citations fall out naturally because the model knows which chunk each claim came from. On Bedrock: Knowledge Bases plus RetrieveAndGenerate, backed by OpenSearch Serverless, Aurora pgvector, S3 Vectors, or third-party stores.

Supervised fine-tuning. Show a base model a labelled dataset of (prompt, ideal response) pairs; adjust weights so outputs move closer to the ideal. On Bedrock: Claude 3 Haiku (us-west-2), Meta Llama 3.1 / 3.2 / 3.3 across 1B-70B, Amazon Nova Micro / Lite / Pro, plus Titan Text. Writes a custom model that must be served via provisioned throughput – on-demand isn’t available. Training cost is modest (Llama 2 70B fine-tune training is ~$0.00799 per 1,000 TokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. ; custom model storage $1.95/month). Teaches style, format, and behaviour; does not reliably teach facts.

Continued pre-training. Keep training a base model on a large body of unlabelled domain text using the same objective that originally pre-trained it. Shifts the model’s distribution of language toward the domain. Historically supported on Amazon Titan Text; not on Claude, Llama, or Nova. Heavyweight; training cost proportional to tokens processed; output still needs provisioned throughput to serve.

Bedrock Custom Model Import. Bring weights trained elsewhere (Llama / Mistral / compatible architectures) and serve them through the Bedrock API. Provisioned-only; us-east-1 and us-west-2. A packaging choice, not a fresh customisation lever.

Side by side

Lever Corpus grounding Voice & format Refusal behaviour Budget/timeline Maintainability
Prompt engineering alone
RAG (Knowledge Bases)
Supervised fine-tuning
Continued pre-training
Custom Model Import

No single lever clears all five. Two stacked clear all five: RAG for the corpus, fine-tuning for voice and refusal.

Matching the levers to the question

"What does the corpus say?" 200K contracts, growing weekly "How should the model say it?" voice, §-citations, refusals "What vocabulary does it know?" commercial contract English -- already fine RAG Knowledge Bases on OpenSearch Serverless Titan V2 1,024-dim, hierarchical chunks ingestion = config change, not a training job Supervised fine-tuning Claude 3 Haiku (us-west-2) ~1,500 (prompt, ideal) pairs custom model = provisioned throughput only Continued pre-training Titan Text on raw 40 GB priced per training token parked: base vocab is already correct; budget can't absorb it Fine-tuned Haiku served on provisioned throughput behind RetrieveAndGenerate corpus chunks pulled at inference; voice + refusal already in the weights ~£37-43K of £50K -- room for evaluation and one iteration cycle after fee-earner feedback RAG refresh = weekly cron. Fine-tune refresh = quarterly.
Three questions, three levers, two picked. The RAG path pulls corpus facts in at inference; the fine-tune path bakes voice and refusal into weights offline. Continued pre-training stays parked -- the base vocabulary is already correct, and the budget can't carry it alongside the two that earn their place.

The RAG + fine-tune split, in depth

The instinct to pick one customisation method comes from treating them as interchangeable. They aren’t. Each answers a different question.

RAG answers “what does the corpus say?” Facts about 200,000 specific contracts live in the vector store. A question about a force-majeure clause retrieves the dozen most relevant past instances; the model reads them at inference time and reasons about them. Adding a new matter is an ingestion job – the vector store grows by one document, the model doesn’t change. Removing a retracted matter is a delete on a few vectors. The corpus is a living index, not a snapshot baked into weights.

Fine-tuning answers “how should the model say it?” The firm’s voice – hedged, precise, §-citing – is a set of stylistic rules. A few hundred labelled examples teach the model those rules in its weights. After fine-tuning, a twenty-line system prompt produces voice-compliant answers where an un-tuned model would need two hundred lines of style-guide text and still drift under pressure.

Continued pre-training answers “what vocabulary does the model know?” Useful when the base model genuinely doesn’t speak the domain’s language – regulatory filings in a rare jurisdiction, argot from a century-old trade, notation from a narrow sub-field. Commercial contract English doesn’t qualify. Claude has read plenty of contracts.

The three aren’t substitutes – they stack. A fully-customised model in a demanding domain might do all three: CPT on domain text, fine-tune on (prompt, response) pairs, then wrap in RAG at inference. For this situation, two of the three clear every attribute and the third is overkill.

A worked decision trace

Attribute 1 – 200,000-document corpus. RAG ingests into OpenSearch Serverless via Knowledge Bases. Titan Text Embeddings V2 at 1,024 dimensions. Hierarchical chunking – child ~300 tokens for retrieval precision, parent ~1,500 tokens for generator context. Metadata sidecars tag each document with matter number, practice area, and client. Weekly refresh via EventBridge calling StartIngestionJob; deltas only. Fine-tuning doesn’t touch this – the fine-tuned model calls the same vector store as an un-tuned one.

Attribute 2 – voice and citation format. A lawyer-in-the-loop curates ~1,500 (prompt, ideal-response) pairs over four to six weeks. Each pair is a real question-and-answer exchange, reviewed and edited to the firm’s style guide: hedged phrasing, §X.Y(z) references, matter-number citations. The dataset trains Claude 3 Haiku via Bedrock fine-tuning in us-west-2 – the only Claude option for fine-tuning today. Llama 3.3 70B would be the alternative if quality required it; a fine-tuned 70B on provisioned throughput is materially more expensive per hour, and Haiku should clear the bar.

Attribute 3 – refusal on off-domain questions. A subset – perhaps 300 of the 1,500 pairs – are refusal examples. Fine-tuning bakes this into the weights. The system prompt reinforces it; default behaviour under a prompt-injection attempt holds much better than a prompt-only approach would.

Attribute 4 – £50K and 90 days. Budget pass below. Both methods fit; CPT doesn’t.

Attribute 5 – maintainability. RAG updates are ingestion; no retrain needed when a new matter lands. Fine-tuning refreshes happen quarterly, when the style guide evolves or refusal patterns grow. A two-person platform team runs ingestion continuously and the fine-tune four times a year.

Cost shape: where the pounds land

The cost profile differs in shape, not just size.

RAG: low fixed, variable with queries. One-off ingestion cost (embedding 40 GB at Titan V2’s per-token rate, a few thousand pounds, plus incremental weekly deltas), baseline vector-store cost (OpenSearch Serverless at 2-OCU minimum, ~£260/month), per-query embedding plus generation cost.

Fine-tuning: low training, high fixed serving. Training a Haiku fine-tune on 1,500 pairs runs in the low hundreds of pounds; custom model storage $1.95/month. The catch is serving: fine-tuned models run on provisioned throughput only, a minimum hourly burn from deployment. Haiku-tier MUs are cheaper than the Llama 2 70B reference ($21.18/hour, ~$15,750/month on a 1-month commit) but still add up to several thousand pounds a month.

Continued pre-training: high training and high fixed serving. Pricing is per token processed; at 40 GB raw text (~10 billion tokens), one pass is a serious bill before fine-tuning or evaluation begin.

Budget pass, GBP:

  • Data preparation – PDF extraction, chunking pipeline, metadata tagging, the 1,500-pair dataset curated by a lawyer: ~£15K.
  • RAG ingestion + 2-OCU OpenSearch Serverless for three months: ~£3K.
  • Fine-tune training plus iteration cycles: ~£1K.
  • Provisioned throughput for the fine-tuned Haiku, three months: ~£15-20K.
  • Bedrock Evaluations weekly against a 200-question golden set: ~£2K.
  • Generation cost for the pilot at low query volume: ~£1-2K.

Total: ~£37-43K of £50K – headroom for a Sonnet evaluation judge and a round of iteration.

Evaluation: the quiet third leg

A contract review assistant that gets the voice correct but hallucinates clauses is worse than one that gets the voice vaguely correct but cites faithfully. Evaluation matters as much as the customisation choice.

The golden dataset: ~200 real questions from the firm’s advice history, with expected answers reviewed by a senior lawyer. Refreshed quarterly. Includes questions the system should refuse.

Automatic metrics via Bedrock Evaluations: citation faithfulness (every § reference traces back to a retrieved chunk), answer accuracy against the lawyer-reviewed reference, and refusal correctness. Citation faithfulness tells you whether RAG is doing its job; refusal correctness tells you whether fine-tuning is doing its job.

Human review: a weekly spot check by a senior lawyer on a random sample, scoring on “would I have said it this way?” When rubric scores drop, the fine-tune dataset needs refreshing; when citation faithfulness drops, retrieval is returning the wrong chunks.

What’s worth remembering

  1. Three customisation methods answer three different questions. RAG: what does the corpus say? Fine-tuning: how should the model say it? CPT: what vocabulary does it know? Treating them as substitutes leads to picking wrong.
  2. RAG via Bedrock Knowledge Bases handles 200K-document corpora with weekly updates through incremental ingestion – no retrain required. Citations fall out of retrieval, not out of weights.
  3. Supervised fine-tuning on Bedrock supports Claude 3 Haiku (us-west-2), Meta Llama 3.1 / 3.2 / 3.3, Amazon Nova Micro / Lite / Pro, and Amazon Titan. Not Sonnet, not Opus, not Llama 4 MoE.
  4. Fine-tuned custom models must be served via provisioned throughput. On-demand isn’t available. The minimum hourly commitment is the line item that most often blows a customisation budget.
  5. Continued pre-training uses unlabelled text and the base pre-training objective to shift the model’s language distribution. Heavyweight, priced per training token, still needs provisioned throughput. Correct when base vocabulary is wrong; wrong when the corpus is just more of what the base already reads.
  6. Cost shapes differ. RAG: low fixed, variable with queries. Fine-tuning: low training, high fixed serving. CPT: high training and high fixed serving. Budget discipline comes from knowing the shape, not just the sticker price.
  7. Custom Model Import packages an externally-trained model into Bedrock’s inference surface – a deployment choice, not a customisation method. Provisioned-only; us-east-1 and us-west-2 only.
  8. Evaluation is the third leg. Bedrock Evaluations for automatic citation faithfulness, accuracy, and refusal correctness; human review for voice. Without it, neither RAG nor fine-tuning is maintainable.

The answer: Bedrock Knowledge Bases for RAG over the 200,000-document corpus – Titan Text Embeddings V2 at 1,024 dimensions, hierarchical chunking, metadata filtering by matter number and practice area, weekly incremental ingestion from S3. Supervised fine-tuning of Claude 3 Haiku in us-west-2 on ~1,500 lawyer-curated (prompt, response) pairs covering voice, §-citation format, and structured refusals. The fine-tuned Haiku serves via provisioned throughput behind RetrieveAndGenerate, so every inference call pulls relevant chunks from the knowledge base and hands them to a model that already knows how to write in the firm’s voice. Continued pre-training is parked – the sub-domain doesn’t need it, and the budget can’t afford it alongside fine-tuning and RAG. Evaluation runs weekly. RAG for the what, fine-tuning for the how.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.