Prompt, Retrieve, or Fine-Tune

May 06, 2026 · 18 min read

AI Practitioner · AIF-C01 · part of The Exam Room

A legal-ops team wants a tool where paralegals can ask questions about the company’s 4,000 in-house contract templates: “what’s our standard limitation-of-liability clause?”, “which templates require notarisation?”, “has this indemnity language been approved for France?” The first prototype – a plain Claude call with the question in the prompt – invents clause numbers that don’t exist. Someone suggests fine-tuning the model on the templates. Someone else suggests RAG. A third voice wants to try better prompts first. The interesting question isn’t which is better, because they solve different problems – it’s which problem the team actually has.

The situation

The in-house legal team maintains 4,000 contract templates across 12 jurisdictions and 30 contract types (employment, NDA, master services, licensing, etc.). Each template is between 5 and 80 pages. They live in SharePoint today; an S3 bucket is being stood up to mirror them. Templates change: roughly 50 are updated every month when a jurisdiction’s law changes or a clause is renegotiated at the enterprise level.

Paralegals currently find the correct clause by searching SharePoint for keywords and skimming results. The turnaround for “what’s the standard force majeure language for a French SaaS contract?” is 10-15 minutes of human grepping. Legal-ops wants this under 30 seconds with a citation back to the exact template and clause.

A first prototype called Claude with a prompt like “Here’s the question: what’s our standard limitation-of-liability clause? Answer with the full text.” It produced confident, plausible, and wrong answers: clauses that sounded correct but used clause numbers the templates don’t use, jurisdictions the clause doesn’t actually cover, and in one case a citation to a template that doesn’t exist. The team calls this “hallucination” and wants it gone.

Three techniques are on the table. Prompt engineering (rewrite the prompt better). RAG (retrieve relevant template excerpts and include them in the prompt). Fine-tuning (train the model on the template corpus until it knows the language natively). The meeting wants a decision.

What we might want from this

The three techniques are not interchangeable. They solve different problems, and the first mistake is treating them as points on a single “quality” axis.

Prompt engineering is changing the text you send to the model. Better instructions, worked examples in the prompt (few-shot), explicit format requirements, a system prompt that sets the model’s persona. It costs nothing: no infrastructure, no training run, no new data pipeline. It is also the only technique that works if what the model is producing is the wrong shape – too long, too short, wrong tone, missing a required field. A hallucinating model doesn’t need a better prompt alone; it needs information it doesn’t have. Prompt engineering is necessary, always, but rarely sufficient on its own for a knowledge-grounding problem.

Retrieval-augmented generation is pulling relevant documents into the prompt at query time. The model doesn’t need to know the 4,000 templates; it needs to be handed the correct three when the paralegal asks a question. The architecture is: pre-compute vector embeddings of each template chunk (a chunk being a paragraph, a section, or a page), store them in a vector database, and at query time embed the user’s question, retrieve the top-k most similar chunks, and include them in the model’s prompt along with the question. The model answers from the retrieved text. If the retrieval is good, hallucination drops to near-zero on questions the corpus can answer.

Fine-tuning adjusts the weights of the model on a training dataset. For generative models, this usually means providing input-output pairs (“when you see X, produce Y”) and running a training job that nudges the model’s behaviour toward those examples. Fine-tuning teaches style, format, or specialised vocabulary – not facts. A fine-tuned model trained on the templates would learn to sound like a legal template, would learn the vocabulary and cadence of the corpus, but would still not reliably cite a specific clause number. Facts go stale the moment the corpus changes; weights don’t update when a template does.

The second thing worth thinking about is update frequency. The templates change 50 times a month. Fine-tuning takes hours to days to run and costs money per run; running it weekly to stay current would be expensive and would leave a gap between “clause updated” and “model knows.” RAG updates by re-embedding a changed document – minutes, maybe seconds – and the next query sees the new version immediately. Freshness is a first-class requirement for this domain, and fine-tuning does badly on freshness.

The third is data volume and quality for fine-tuning. Bedrock fine-tuning typically wants hundreds to thousands of high-quality labelled examples for the behaviour you’re trying to teach. For the legal-ops team, that would mean writing hundreds of (question, ideal-answer) pairs by hand – the kind of project that takes months and is done infrequently. By contrast, RAG needs the documents (already have them) and an embedding model (off the shelf). The bar to entry is lower by an order of magnitude.

The fourth is cost shape at inference. Prompt engineering and RAG both run on standard on-demand Bedrock pricing. Fine-tuned Bedrock models require provisioned throughput – a commitment of model units for 1 or 6 months. That is a very different cost profile: RAG at 500 queries a day costs cents; provisioned throughput for a custom model is thousands of dollars a month whether anyone uses it or not. Fine-tuning is appropriate when volume is high and predictable enough to saturate a provisioned endpoint. Not a legal-ops team of 20 paralegals.

The fifth is explainability. When the model cites a specific clause, the paralegal wants to click through to the source template. RAG gives that for free – the retrieved chunks are the citation. Fine-tuning erases the provenance: the model emits text that came from somewhere in training, but “somewhere” isn’t a clickable link.

The attributes that matter

Six filters, applied to each of the three techniques.

  1. Corrects hallucination on proprietary data – does the technique give the model access to the actual templates?
  2. Adapts format and tone – does the technique change how the model writes (length, structure, register)?
  3. Handles data freshness – when a template updates, how fast does the system reflect it?
  4. Setup cost – time and data-labelling effort to get to first working version?
  5. Inference cost shape – on-demand per token, provisioned, or something else?
  6. Provides citations – can the paralegal trace the answer to a specific source?

The three-technique landscape

1. Prompt engineering. Refining the text sent to the model. For a question-answering task: system prompt that sets role and constraints (“You are a legal research assistant. Only answer based on provided source text. If the source doesn’t contain the answer, say so.”); few-shot examples showing the desired question-answer shape; instructions on format (“cite the template name and section number”). No new AWS services; the work is in the application code. Inference cost is whatever Bedrock on-demand charges. Setup: hours to days of iteration. Citation: only if the source text is already in the prompt (i.e. paired with retrieval).

2. Retrieval-augmented generation (RAG). Pre-embed the corpus, store embeddings, retrieve-and-inject at query time. AWS building blocks: Bedrock Knowledge Bases (managed: point at S3, configure chunking and embedding model, query via bedrock-agent-runtime:RetrieveAndGenerate), or DIY with a Bedrock embedding model (Titan Text Embeddings v2, Cohere Embed) plus a vector store (OpenSearch Serverless, Aurora PostgreSQL with pgvector, Pinecone, etc.). Inference cost: on-demand Bedrock per token + vector-store running cost. Setup: hours to days, depending on chunking strategy. Citation: native – the retrieved chunks carry their source metadata.

3. Fine-tuning. Adjusting model weights on task-specific training data. Bedrock supports fine-tuning selected models (Nova, Titan, Llama) via bedrock:CreateModelCustomizationJob: upload JSONL training data to S3, configure hyperparameters, start the job. Output is a custom model that requires provisioned throughput to serve. Setup: days to weeks to prepare a training set of hundreds of high-quality examples, plus the training job itself (hours), plus evaluation. Inference cost: provisioned throughput starting at a few dollars per hour per model unit, 1- or 6-month commitments. Citation: none by default; the model produces text without provenance.

4. Continued pre-training. Bedrock’s other customisation path: feed a large corpus of unlabelled domain text (a few gigabytes to tens) and further train the base model on it. Useful for teaching the model a specialised vocabulary or domain (medical, legal, financial) when fine-tuning on labelled pairs isn’t enough. Same cost shape as fine-tuning: provisioned throughput to serve, days of setup. Mentioned for completeness; rarely the correct first answer for a question-answering problem.

The attribute table

Technique Corrects hallucination Adapts format/tone Handles freshness Setup cost Inference cost Citations
Prompt engineering N/A Low On-demand
RAG Partial ✓ (seconds) Medium On-demand + vector store
Fine-tuning Partial (style only) ✗ (re-train) High Provisioned throughput
Continued pre-training Partial (vocabulary) ✗ (re-train) Very high Provisioned throughput

Reading the table against the legal-ops team’s actual problem: the templates change 50 times a month (fine-tuning fails freshness), the paralegals want citations (fine-tuning doesn’t provide them), and the hallucination is about facts, not style (fine-tuning doesn’t fix facts). RAG is the technique for this problem. Prompt engineering will still be necessary on top of RAG – the retrieved chunks need the correct framing – but it’s not sufficient alone because the model needs the information injected, not just instructed.

When each technique earns its keep

Picking the technique Does the model need knowledge it does not already have? NO — pure style/format problem YES — knowledge-grounding problem Hundreds of labelled examples of the desired behaviour? NO Prompt engineering system prompt + few-shot on-demand Bedrock iterate on wording YES Fine-tuning JSONL training pairs provisioned throughput for style, not facts Does that knowledge change faster than quarterly? YES RAG Bedrock Knowledge Bases or DIY vector store citations built in NO Continued pre-training large unlabelled corpus provisioned throughput stable-vocabulary domains All four can combine. RAG is rarely deployed without prompt engineering on top of it. Fine-tuning plus RAG is how some domain-specific chatbots are built: the fine-tune teaches voice and format, retrieval supplies the facts.
Two questions -- does the model need new knowledge, and does that knowledge change -- partition the techniques. The legal-ops scenario lands on RAG.

The pick in depth

RAG via Bedrock Knowledge Bases, with prompt engineering on top. Bedrock Knowledge Bases is the managed RAG path: point it at an S3 bucket of source documents, configure an embedding model and a vector store, and Bedrock handles chunking, embedding, indexing, and retrieval. At query time, one API call (bedrock-agent-runtime:RetrieveAndGenerate) takes the user’s question, retrieves the most relevant chunks, constructs the prompt, calls the generation model, and returns the answer with source citations.

The configuration surface that matters:

  • Embedding model. The embedding model turns text into a vector (a list of numbers like [0.12, -0.45, ..., 0.08], typically 1024 or 1536 dimensions). Similar meanings produce similar vectors. Bedrock Knowledge Bases supports Titan Text Embeddings v2 (Amazon), Cohere Embed English v3, and Cohere Embed Multilingual v3. For a multi-jurisdiction legal corpus including non-English templates, Cohere Multilingual is the correct default.

  • Chunking strategy. A 50-page template isn’t embedded as one vector; it’s split into chunks – usually a few hundred tokens each – and each chunk gets its own vector. Default chunk size is 300 tokens with 20% overlap. For legal templates where clauses have meaningful boundaries, a semantic chunking strategy (chunks respect paragraph or heading boundaries) often retrieves more cleanly than fixed-size chunks. Bedrock Knowledge Bases supports default, fixed-size, hierarchical, and semantic chunking.

  • Vector store. Where the embeddings live. OpenSearch Serverless is the default (Bedrock can create it for you). Aurora PostgreSQL with pgvector is the alternative if you already run Aurora; Pinecone and Redis Enterprise Cloud are supported third-party options. For 4,000 templates of varying size, OpenSearch Serverless is the lowest-friction choice; Aurora pgvector matters if the legal team already runs metadata in Postgres and wants SQL joins across vector and structured data.

  • Retrieval configuration. How many chunks to retrieve per query (numberOfResults, default 5), and whether to use hybrid search (vector similarity plus keyword matching) versus pure vector. For legal templates where exact clause names matter, hybrid search often retrieves more reliably than pure-vector.

  • Generation model. Separately configurable: Claude Sonnet, Nova Pro, Llama, whichever. The generation model sees the retrieved chunks plus the question and produces the answer. Bedrock’s default prompt template includes the chunks under a $search_results$ placeholder and instructs the model to answer based on them.

The prompt-engineering layer still matters. Knowledge Bases lets you override the default prompt template; a custom template for this use case might add instructions like “If the source templates don’t contain the answer, say ‘I don’t have a template matching those criteria’ rather than guessing. Always cite the template name and clause number in the format [Template Name, §Clause Number].” These instructions are why the technique works end-to-end: retrieval feeds the model the correct chunks; the prompt tells the model how to handle missing information without inventing it.

Freshness is handled by Bedrock’s ingestion pipeline. A changed template in S3 triggers a re-sync – either on demand via the console or API, or scheduled – that re-embeds only the changed documents and updates the vector store. From template-change to model-knows is minutes, not a training run.

A worked query

Priya, a paralegal, has a question. She types it into the team’s internal tool, which calls RetrieveAndGenerate.

$ aws bedrock-agent-runtime retrieve-and-generate \
    --input '{"text": "What is our standard limitation of liability clause for French SaaS agreements, capped at 12 months of fees?"}' \
    --retrieve-and-generate-configuration '{
      "type": "KNOWLEDGE_BASE",
      "knowledgeBaseConfiguration": {
        "knowledgeBaseId": "KB-LEGAL-TEMPLATES",
        "modelArn": "arn:aws:bedrock:eu-west-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0",
        "retrievalConfiguration": {
          "vectorSearchConfiguration": {
            "numberOfResults": 6,
            "overrideSearchType": "HYBRID"
          }
        }
      }
    }'

{
  "output": {
    "text": "Our standard limitation of liability clause for French SaaS agreements, capped at 12 months of fees, is in SaaS-FR-v4.2 at §14.3. The clause reads: 'Except for breaches of confidentiality or indemnification obligations, each party's aggregate liability under this Agreement shall not exceed the fees paid or payable by Customer to Provider in the twelve (12) months immediately preceding the event giving rise to the claim.' Related carve-outs for gross negligence are in §14.4."
  },
  "citations": [
    {
      "generatedResponsePart": { ... },
      "retrievedReferences": [
        {
          "content": { "text": "..." },
          "location": {
            "type": "S3",
            "s3Location": { "uri": "s3://legal-templates/saas/FR/SaaS-FR-v4.2.docx" }
          }
        }
      ]
    }
  ]
}

What happened:

  1. The retrieve-and-generate call embedded the question with Cohere Multilingual v3.
  2. Bedrock queried the OpenSearch Serverless index with the resulting vector, using hybrid search (vector + keyword for “French SaaS”, “limitation of liability”, “12 months”).
  3. It retrieved the top 6 chunks by relevance. The top hit was §14.3 of SaaS-FR-v4.2.docx; adjacent chunks included §14.4 and the definitions section.
  4. The retrieved chunks were injected into the generation prompt under the $search_results$ placeholder; Claude Sonnet produced the answer, sticking to the retrieved text because the custom prompt template instructs it to.
  5. The citations array on the response links each generated span to the source chunk. Priya’s UI renders a clickable citation: “[SaaS-FR-v4.2, §14.3]” jumps straight to the document in SharePoint.

The round trip is 1-3 seconds. The answer is grounded in actual text. If Priya asks about a jurisdiction the corpus doesn’t cover, the model says so rather than inventing.

Why fine-tuning would have failed

Suppose the team had chosen fine-tuning instead. A training set of (question, answer) pairs would need to be written by hand – realistically, 500-2000 pairs covering the common question shapes. Legal domain expertise; a month of work for a paralegal. Then the fine-tuning job itself (hours), then provisioned throughput to serve ($2-$20 per hour depending on the model and units, continuously, whether queried or not).

Worse, when a template changes next Tuesday the fine-tuned model doesn’t know. The weights are a snapshot of the training data. Keeping up means retraining – weekly, monthly – each run costing a fresh training bill. And the model wouldn’t cite: the generated text has no provenance back to a specific document.

Fine-tuning is the correct tool when the problem is how the model writes – matching a company’s tone, producing a very specific output format reliably, handling a domain with such specialised vocabulary that the base model can’t parse it correctly. Legal templates have some of that, but the primary problem is grounding, not voice. Solve the primary problem first; fine-tune later if there’s still a style gap after RAG is working.

What’s worth remembering

  1. The three techniques solve three different problems. Prompt engineering changes instructions. RAG adds facts. Fine-tuning changes style and format. Reaching for the wrong one solves nothing.
  2. Hallucination on proprietary data is almost always a retrieval problem, not a training problem. The base model doesn’t have your documents. Give it them at query time; don’t try to bake them into the weights.
  3. Freshness kills fine-tuning for dynamic domains. If the knowledge changes faster than the training cadence, fine-tuned models are stale the moment they land. RAG’s freshness is minutes; fine-tuning’s is the next training run.
  4. Bedrock Knowledge Bases is the managed RAG path. S3 in, vector store out, RetrieveAndGenerate as the single query API. Chunking strategy, embedding model, and retrieval config are the levers worth tuning.
  5. Citations come from retrieval, not generation. RAG’s output carries retrievedReferences pointing to source documents. Fine-tuning produces text without provenance; if citations matter, fine-tuning alone won’t suffice.
  6. Provisioned throughput is the cost shape for fine-tuned models on Bedrock. That’s a commitment of model units for 1 or 6 months, in the thousands of dollars per month. On-demand per-token pricing doesn’t apply to custom models.
  7. Prompt engineering is always part of the answer. Even with perfect retrieval, the model needs instructions: format the citation this way, refuse to answer if the source doesn’t cover it, adopt this tone. Prompt work sits on top of every technique.
  8. Combine where it makes sense. RAG + prompt engineering is the common pair. Fine-tuning + RAG is the domain-chatbot pattern: the fine-tune teaches voice, retrieval supplies facts. Rarely is fine-tuning alone the correct choice for a knowledge-heavy task.

The legal-ops team’s first prototype hallucinated because the model didn’t have the templates. The fix isn’t a smarter prompt or a longer training run – it’s plumbing the templates into the model’s input at query time. RAG, with prompt engineering shaping the output and Bedrock Knowledge Bases doing the retrieval plumbing, is the technique that matches the problem.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.