Prompt, Retrieve, or Fine-Tune

A legal-ops team wants paralegals to ask plain-English questions about 4,000 contract templates and get accurate answers with citations. Prompt engineeringPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. , RAGRAGA pattern where you retrieve relevant documents at query time and stuff them into the prompt so the model can ground its answer on them. , and fine-tuningFine-tuningContinuing to train an already-trained model on a smaller dataset to adapt its behaviour. all promise to help – but they solve different problems, and reaching for the wrong one burns budget without fixing the bug.

The situation

The in-house legal team maintains 4,000 contract templates across 12 jurisdictions and 30 contract types (employment, NDA, master services, licensing, etc.). Each template is between 5 and 80 pages. They live in SharePoint today; an S3 bucket is being stood up to mirror them. Templates change: roughly 50 are updated every month when a jurisdiction’s law changes or a clause is renegotiated at the enterprise level.

Paralegals currently find the correct clause by searching SharePoint for keywords and skimming results. The turnaround for “what’s the standard force majeure language for a French SaaS contract?” is 10-15 minutes of human grepping. Legal-ops wants this under 30 seconds with a citation back to the exact template and clause.

A first prototype called Claude with the question in the prompt. It hallucinated – clauses that sounded right but used clause numbers the templates don’t use, jurisdictions the clause doesn’t cover, and in one case a citation to a template that doesn’t exist. Three techniques are on the table to fix it: prompt engineering (rewrite the prompt better), RAG (retrieve relevant template excerpts and include them in the prompt), and fine-tuning (train the modelModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. on the template corpus until it knows the language natively). The team want a decision.

What actually matters

The three techniques are not interchangeable. They solve different problems, and the first mistake is treating them as points on a single “quality” axis.

Prompt engineering is changing the text you send to the model. Better instructions, worked examples in the prompt (few-shot), explicit format requirements, a system promptSystem promptThe instruction block that frames the model’s behaviour for a session, separate from the user’s messages. that sets the model’s persona. It costs nothing in infrastructure – no trainingTrainingThe process of fitting a model’s weights to data by minimising a loss function. run, no new data pipeline – though iterating on the wording can take hours to days of engineering time. It is also the only technique that works if what the model is producing is the wrong shape – too long, too short, wrong tone, missing a required field. A hallucinating model doesn’t need a better prompt alone; it needs information it doesn’t have. Prompt engineering is necessary, always, but rarely sufficient on its own for a knowledge-grounding problem.

Retrieval-augmented generation is pulling relevant documents into the prompt at query time. The model doesn’t need to know the 4,000 templates; it needs to be handed the correct three when the paralegal asks a question. The architecture is: pre-compute vectorVectorAn ordered list of numbers – in AI usage, almost always an embedding – and by extension the databases that index them for nearest-neighbour search. embeddingsEmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. of each template chunk (a chunk being a paragraph, a section, or a page), store them in a vector database, and at query time embed the user’s question, retrieve the top-k most similar chunks, and include them in the model’s prompt along with the question. The model answers from the retrieved text. If the retrieval is good, hallucinationHallucinationAn LLM stating something false with the same confidence it states something true. drops to near-zero on questions the corpus can answer.

Fine-tuning adjusts the weights of the model on a training dataset. For generative models, this usually means providing input-output pairs (“when you see X, produce Y”) and running a training job that nudges the model’s behaviour toward those examples. Fine-tuning teaches style, format, or specialised vocabulary – not facts. A fine-tuned model trained on the templates would learn to sound like a legal template, would learn the vocabulary and cadence of the corpus, but would still not reliably cite a specific clause number. Facts go stale the moment the corpus changes; weights don’t update when a template does.

The second thing worth thinking about is update frequency. The templates change 50 times a month. Fine-tuning takes hours to days to run and costs money per run; running it weekly to stay current would be expensive and would leave a gap between “clause updated” and “model knows.” RAG updates by re-embedding a changed document – minutes, maybe seconds – and the next query sees the new version immediately. Freshness is a first-class requirement for this domain, and fine-tuning does badly on freshness.

The third is data volume and quality for fine-tuning. Bedrock fine-tuning typically wants hundreds to thousands of high-quality labelled examples for the behaviour you’re trying to teach. For the legal-ops team, that would mean writing hundreds of (question, ideal-answer) pairs by hand – the kind of project that takes months and is done infrequently. By contrast, RAG needs the documents (already have them) and an embedding model (off the shelf). The bar to entry is lower by an order of magnitude.

The fourth is cost shape at inferenceInferenceRunning a trained model to produce output – as opposed to training it. . Prompt engineering and RAG both run on standard on-demand Bedrock pricing. Fine-tuned Bedrock models require provisioned throughput – a commitment of model units for 1 or 6 months. That is a very different cost profile: RAG at 500 queries a day costs cents; provisioned throughput for a custom model is thousands of dollars a month whether anyone uses it or not. Fine-tuning is appropriate when volume is high and predictable enough to saturate a provisioned endpoint. Not a legal-ops team of 20 paralegals.

The fifth is explainability. When the model cites a specific clause, the paralegal wants to click through to the source template. RAG gives that for free – the retrieved chunks are the citation. Fine-tuning erases the provenance: the model emits text that came from somewhere in training, but “somewhere” isn’t a clickable link.

What we’ll filter on

Six filters, applied to each of the three techniques.

Corrects hallucination on proprietary data – does the technique give the model access to the actual templates?
Adapts format and tone – does the technique change how the model writes (length, structure, register)?
Handles data freshness – when a template updates, how fast does the system reflect it?
Setup cost – time and data-labelling effort to get to first working version?
Inference cost shape – on-demand per tokenTokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. , provisioned, or something else?
Provides citations – can the paralegal trace the answer to a specific source?

The three-technique landscape

Prompt engineering. Refining the text sent to the model. For a question-answering task: system prompt that sets role and constraints (“You are a legal research assistant. Only answer based on provided source text. If the source doesn’t contain the answer, say so.”); few-shot examples showing the desired question-answer shape; instructions on format (“cite the template name and section number”). No new AWS services; the work is in the application code. Inference cost is whatever Bedrock on-demand charges. Setup: hours to days of iteration. Citation: only if the source text is already in the prompt (i.e. paired with retrieval).
Retrieval-augmented generation (RAG). Pre-embed the corpus, store embeddings, retrieve-and-inject at query time. AWS building blocks: Bedrock Knowledge Bases (managed: point at S3, configure chunking and embedding model, query via bedrock-agent-runtime:RetrieveAndGenerate), or DIY with a Bedrock embedding model (Titan Text Embeddings v2, Cohere Embed) plus a vector store (OpenSearch Serverless, Aurora PostgreSQL with pgvector, Pinecone, etc.). Inference cost: on-demand Bedrock per token + vector-store running cost. Setup: hours to days, depending on chunking strategy. Citation: native – the retrieved chunks carry their source metadata.
Fine-tuning. Adjusting model weights on task-specific training data. Bedrock supports fine-tuning selected models (Nova, Titan, Llama) via bedrock:CreateModelCustomizationJob: upload JSONL training data to S3, configure hyperparameters, start the job. Output is a custom model that requires provisioned throughput to serve. Setup: days to weeks to prepare a training set of hundreds of high-quality examples, plus the training job itself (hours), plus evaluation. Inference cost: provisioned throughput starting at a few dollars per hour per model unit, 1- or 6-month commitments. Citation: none by default; the model produces text without provenance.
Continued pre-training. Bedrock’s other customisation path: feed a large corpus of unlabelled domain text (a few gigabytes to tens) and further train the base model on it. Useful for teaching the model a specialised vocabulary or domain (medical, legal, financial) when fine-tuning on labelled pairs isn’t enough. Same cost shape as fine-tuning: provisioned throughput to serve, days of setup. Mentioned for completeness; rarely the correct first answer for a question-answering problem.

Side by side

Technique	Corrects hallucination	Adapts format/tone	Handles freshness	Setup cost	Inference cost	Citations
Prompt engineering	✗	✓	N/A	Low	On-demand	✗
RAG	✓	Partial	✓ (seconds)	Medium	On-demand + vector store	✓
Fine-tuning	Partial (style only)	✓	✗ (re-train)	High	Provisioned throughput	✗
Continued pre-training	Partial (vocabulary)	✓	✗ (re-train)	Very high	Provisioned throughput	✗

Reading the table against the legal-ops team’s actual problem: the templates change 50 times a month (fine-tuning fails freshness), the paralegals want citations (fine-tuning doesn’t provide them), and the hallucination is about facts, not style (fine-tuning doesn’t fix facts). RAG is the technique for this problem. Prompt engineering will still be necessary on top of RAG – the retrieved chunks need the correct framing – but it’s not sufficient alone because the model needs the information injected, not just instructed.

When each technique earns its keep

Two questions -- does the model need new knowledge, and does that knowledge change -- partition the techniques. The legal-ops scenario lands on RAG.

The pick in depth

RAG via Bedrock Knowledge Bases, with prompt engineering on top. Bedrock Knowledge Bases is the managed RAG path: point it at an S3 bucket of source documents, configure an embedding model and a vector store, and Bedrock handles chunking, embedding, indexing, and retrieval. At query time, one API call (bedrock-agent-runtime:RetrieveAndGenerate) takes the user’s question, retrieves the most relevant chunks, constructs the prompt, calls the generation model, and returns the answer with source citations.

The configuration surface that matters:

Embedding model. The embedding model turns text into a vector (a list of numbers like [0.12, -0.45, ..., 0.08], typically 1024 or 1536 dimensions). Similar meanings produce similar vectors. Bedrock Knowledge Bases supports Titan Text Embeddings v2 (Amazon), Cohere Embed English v3, and Cohere Embed Multilingual v3. For a multi-jurisdiction legal corpus including non-English templates, Cohere Multilingual is the correct default.
Chunking strategy. A 50-page template isn’t embedded as one vector; it’s split into chunks – usually a few hundred tokens each – and each chunk gets its own vector. Default chunk size is 300 tokens with 20% overlap. For legal templates where clauses have meaningful boundaries, a semantic chunking strategy (chunks respect paragraph or heading boundaries) often retrieves more cleanly than fixed-size chunks. Bedrock Knowledge Bases supports default, fixed-size, hierarchical, and semantic chunking.
Vector store. Where the embeddings live. OpenSearch Serverless is the default (Bedrock can create it for you). Aurora PostgreSQL with pgvector is the alternative if you already run Aurora; Pinecone and Redis Enterprise Cloud are supported third-party options. For 4,000 templates of varying size, OpenSearch Serverless is the lowest-friction choice; Aurora pgvector matters if the legal team already runs metadata in Postgres and wants SQL joins across vector and structured data.
Retrieval configuration. How many chunks to retrieve per query (numberOfResults, default 5; legal corpora with many adjacent clauses often benefit from 6-8 to capture related sections), and whether to use hybrid search (vector similarity plus keyword matching) versus pure vector. For legal templates where exact clause names matter, hybrid search often retrieves more reliably than pure-vector.
Generation model. Separately configurable: Claude Sonnet, Nova Pro, Llama, whichever. The generation model sees the retrieved chunks plus the question and produces the answer. Bedrock’s default prompt template includes the chunks under a $search_results$ placeholder and instructs the model to answer based on them.

The prompt-engineering layer still matters. Knowledge Bases lets you override the default prompt template; a custom template for this use case might add instructions like “If the source templates don’t contain the answer, say ‘I don’t have a template matching those criteria’ rather than guessing. Always cite the template name and clause number in the format [Template Name, §Clause Number].” These instructions are why the technique works end-to-end: retrieval feeds the model the correct chunks; the prompt tells the model how to handle missing information without inventing it.

Freshness is handled by Bedrock’s ingestion pipeline. A changed template in S3 triggers a re-sync – either on demand via the console or API, or scheduled – that re-embeds only the changed documents and updates the vector store. From template-change to model-knows is minutes, not a training run.

A worked query

A paralegal has a question. They type it into the team’s internal tool, which calls RetrieveAndGenerate.

$ aws bedrock-agent-runtime retrieve-and-generate \
    --input '{"text": "What is our standard limitation of liability clause for French SaaS agreements, capped at 12 months of fees?"}' \
    --retrieve-and-generate-configuration '{
      "type": "KNOWLEDGE_BASE",
      "knowledgeBaseConfiguration": {
        "knowledgeBaseId": "KB-LEGAL-TEMPLATES",
        "modelArn": "arn:aws:bedrock:eu-west-1::foundation-model/anthropic.claude-sonnet-4-5-20250929-v1:0",
        "retrievalConfiguration": {
          "vectorSearchConfiguration": {
            "numberOfResults": 6,
            "overrideSearchType": "HYBRID"
          }
        }
      }
    }'

{
  "output": {
    "text": "Our standard limitation of liability clause for French SaaS agreements, capped at 12 months of fees, is in SaaS-FR-v4.2 at §14.3. The clause reads: 'Except for breaches of confidentiality or indemnification obligations, each party's aggregate liability under this Agreement shall not exceed the fees paid or payable by Customer to Provider in the twelve (12) months immediately preceding the event giving rise to the claim.' Related carve-outs for gross negligence are in §14.4."
  },
  "citations": [
    {
      "generatedResponsePart": { ... },
      "retrievedReferences": [
        {
          "content": { "text": "..." },
          "location": {
            "type": "S3",
            "s3Location": { "uri": "s3://legal-templates/saas/FR/SaaS-FR-v4.2.docx" }
          }
        }
      ]
    }
  ]
}

What happened:

The retrieve-and-generate call embedded the question with Cohere Multilingual v3.
Bedrock queried the OpenSearch Serverless index with the resulting vector, using hybrid search (vector + keyword for “French SaaS”, “limitation of liability”, “12 months”).
It retrieved the top 6 chunks by relevance. The top hit was §14.3 of SaaS-FR-v4.2.docx; adjacent chunks included §14.4 and the definitions section.
The retrieved chunks were injected into the generation prompt under the $search_results$ placeholder; Claude Sonnet produced the answer, sticking to the retrieved text because the custom prompt template instructs it to.
The citations array on the response links each generated span to the source chunk. The UI renders a clickable citation: “[SaaS-FR-v4.2, §14.3]” jumps straight to the document in SharePoint.

The round trip is 1-3 seconds. The answer is grounded in actual text. If a paralegal asks about a jurisdiction the corpus doesn’t cover, the model says so rather than inventing.

When fine-tuning would be the right choice

Fine-tuning is the correct tool when the problem is how the model writes – matching a company’s tone, producing a specific output format reliably, handling specialised vocabulary the base model gets wrong. Legal templates have some of that, but the primary problem here is groundingGroundingConstraining a model to answer from provided sources rather than from whatever it absorbed during training. , not voice. Solve the grounding problem with RAG first; fine-tune later if there’s still a style gap.

For reference, fine-tuning a Bedrock model would have meant writing 500-2000 question-answer pairs by hand (a month of paralegal time), running a training job, then paying provisioned throughput at $2-$20 per hour to keep the custom model serving – charged whether anyone queries it or not. RAG plus prompt engineering ships in days at cents per query.

What’s worth remembering

The three techniques solve three different problems. Prompt engineering changes instructions. RAG adds facts. Fine-tuning changes style and format. Reaching for the wrong one solves nothing.
Hallucination on proprietary data is almost always a retrieval problem, not a training problem. The base model doesn’t have your documents. Give it them at query time; don’t try to bake them into the weights.
Freshness kills fine-tuning for dynamic domains. If the knowledge changes faster than the training cadence, fine-tuned models are stale the moment they land. RAG’s freshness is minutes; fine-tuning’s is the next training run.
Bedrock Knowledge Bases is the managed RAG path. S3 in, vector store out, RetrieveAndGenerate as the single query API. Chunking strategy, embedding model, and retrieval config are the levers worth tuning.
Citations come from retrieval, not generation. RAG’s output carries retrievedReferences pointing to source documents. Fine-tuning produces text without provenance; if citations matter, fine-tuning alone won’t suffice.
Provisioned throughput is the cost shape for fine-tuned models on Bedrock. That’s a commitment of model units for 1 or 6 months, in the thousands of dollars per month. On-demand per-token pricing doesn’t apply to custom models.
Prompt engineering is always part of the answer. Even with perfect retrieval, the model needs instructions: format the citation this way, refuse to answer if the source doesn’t cover it, adopt this tone. Prompt work sits on top of every technique.
Combine where it makes sense. RAG + prompt engineering is the common pair. Fine-tuning + RAG is the domain-chatbot pattern: the fine-tune teaches voice, retrieval supplies facts. Rarely is fine-tuning alone the correct choice for a knowledge-heavy task.

The legal-ops team’s first prototype hallucinated because the model didn’t have the templates. The fix isn’t a smarter prompt or a longer training run – it’s plumbing the templates into the model’s input at query time. RAG, with prompt engineering shaping the output and Bedrock Knowledge Bases doing the retrieval plumbing, is the technique that matches the problem.