Choosing Between RAG, Fine-Tuning, and Prompting

April 03, 2028 · 16 min read

ML Engineer · MLA-C01 · part of The Exam Room

The situation

A B2B software company wants to build an internal chatbot for its customer-success team. The bot should answer questions like “does version 4.3 still support the legacy SFTP connector?” and “what’s the recommended timeout for the webhooks API under heavy load?”

The available knowledge lives in four places:

  • Product documentation: ~8,000 pages in a Hugo-built static site, updated weekly.
  • Changelogs: ~200 release notes per year, written in markdown.
  • Support tickets: ~140,000 resolved tickets from the last decade in Zendesk, averaging 12 messages each.
  • Engineering wiki: ~3,000 Confluence pages; the bit that actually gets maintained is a ~200-page subset.

The base model, let’s say Anthropic Claude via Bedrock, or one of the Titan or Llama options, has world knowledge but no company-specific knowledge. The team has asked “how do we make the model know our stuff?” and the answer turns out to be three levers, each solving a different flavour of the problem.

What actually matters

Before reaching for a technique, it helps to distinguish the three shapes the question can take.

Shape one: the knowledge is factual and the source is a document. “Does version 4.3 support the SFTP connector?” is the kind of question where there’s a correct answer in the changelog. The model doesn’t need to internalise the knowledge; it needs to look it up and reason over what it finds. The canonical technique here is retrieval-augmented generation (RAG): at query time, an EmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. search finds the most relevant document chunks, those chunks are stuffed into the prompt as context, and the model answers from what it sees.

RAG has three big virtues:

  • Freshness: update the documents, re-embed, and the bot’s answers change the same day. No retraining.
  • Provenance: the chunks that went into the answer are there, so citing them and letting the user verify is straightforward.
  • Cost: no training compute. Operating cost is embedding storage + per-query retrieval + per-query LLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. InferenceRunning a trained model to produce output – as opposed to training it. .

RAG has two big limits:

  • Latency includes both the retrieval step and the LLM call. Retrieval from a well-tuned VectorAn ordered list of numbers – in AI usage, almost always an embedding – and by extension the databases that index them for nearest-neighbour search. is typically 50-200ms; the LLM call dominates.
  • Quality depends on retrieval. If the embedding search misses the correct chunk, the model will confidently answer from whatever it did find, which may be wrong or tangential.

Shape two: the knowledge is a style, tone, or behaviour. If the chatbot should answer in the voice of the support team, terse, specific, ending with a “let me know if that didn’t cover it”, no amount of retrieved context will produce that reliably. The model needs to become a thing that answers that way. The technique here is fine-tuning: take a base model, train it further on a curated dataset that demonstrates the desired style on relevant inputs, and the model’s output distribution shifts towards that style.

Fine-tuning has three big virtues:

  • Internalisation: the model acts like it knows, without having to be shown. Style, format, vocabulary, reasoning patterns all shift.
  • No per-query retrieval overhead: latency is just the LLM call.
  • Can be combined with RAG: fine-tune for style, retrieve for facts.

Fine-tuning has four big limits:

  • Cost: requires a training dataset (curation is the expensive bit, not compute), training compute, and either a hosted or self-hosted fine-tuned model.
  • Freshness: knowledge baked in at fine-tune time. Weekly doc updates don’t propagate without a retrain.
  • Catastrophic forgetting: the model may lose capabilities it had before fine-tuning if the dataset is narrow.
  • Bedrock custom models, or SageMaker JumpStart fine-tuning: not every provider’s base model can be fine-tuned cheaply or at all. The choice is constrained.

Shape three: the knowledge is a constraint, a format, or a role. “Only answer from the internal docs; if you don’t know, say so” is a behavioural instruction, not a fact or a style. The technique is prompt engineering: put the instruction in the System promptThe instruction block that frames the model’s behaviour for a session, separate from the user’s messages. , include examples in the user prompt, structure the output with templates. No training, no retrieval, just careful words.

Prompt engineering has three big virtues:

  • Trivially cheap: no training, no infrastructure. Edit a prompt, ship.
  • Composable: the prompt can include retrieved chunks (RAG) and still instruct about format (prompt engineering) and call a fine-tuned model (fine-tuning) all at once.
  • Fast iteration: change the prompt, re-run.

Prompt engineering has one big limit:

  • Everything has to fit in the Context windowThe maximum number of tokens an LLM can attend to in a single call – prompt plus output combined. . A 200k-token model is generous; a 4k-token model is restrictive. And the longer the prompt, the higher the per-query cost.

The three are complementary, not exclusive. The work is figuring out what combination fits each chatbot scenario.

What we’ll filter on

Five filters for picking between RAG, fine-tuning, and prompt engineering:

  1. Knowledge type, factual and document-sourced, stylistic, or behavioural?
  2. Freshness, how often does the knowledge change, and does the bot need to reflect that?
  3. Latency budget, is the per-query retrieval overhead acceptable?
  4. Cost shape, training cost, embedding cost, per-query cost?
  5. Provenance / citability, does the user need to see the source?

The three-lever landscape

1. Prompt engineering alone. System prompt + user prompt, no retrieval, no fine-tune. The baseline. Works when the knowledge fits in the prompt and the base model is capable. For most domains, insufficient; for some narrow ones (style or role-only), enough.

2. Retrieval-augmented generation (RAG). Embed the knowledge into a vector store at ingestion time; at query time, search the vector store for relevant chunks, inject them into the prompt, call the LLM. The canonical pattern for “the bot needs to answer from our docs.” On AWS, the quick-start is Bedrock Knowledge Bases (managed ingestion + retrieval on OpenSearch / Aurora / Pinecone / Redis), with Bedrock Agents as an optional orchestration layer. Build-your-own uses OpenSearch or Aurora pgvector or Kendra with custom ingestion.

3. Fine-tuning (full or LoRA/QLoRA). Further-train a base model on a curated dataset of (input, expected output) pairs. On Bedrock, supported for a subset of base models via Bedrock Custom Models (LoRA-style parameter-efficient fine-tuning); on SageMaker, supported broadly via JumpStart or custom training jobs. Output is a new model artefact, served by Bedrock (if custom-trained there) or on a SageMaker endpoint. Used for style, format, behaviour adaptation; less useful for “the latest docs” (which change faster than fine-tune cycles).

4. Continued pre-training. A heavier version of fine-tuning: rather than tuning on (input, output) pairs, continue the base model’s pre-training objective on large unstructured domain text. Useful when the domain’s vocabulary and concepts differ enough from the base model’s training distribution that the model needs broader re-exposure. Bedrock supports this for some models (Titan); SageMaker supports it via custom training. Expensive (hundreds to thousands of GPU-hours) and rarely the first reach.

5. Hybrid (RAG + fine-tuning + prompt). The production pattern. Fine-tune for style and behaviour; RAG for facts; prompt for structure and constraints. Each lever covers what it’s good at; the combination is robust.

Side by side

Technique Knowledge type Freshness Latency impact Cost shape Provenance
Prompt engineering Behavioural, roles, constraints n/a None Prompt TokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word.  
</span> per call Only what’s in prompt        
RAG Factual, document-sourced Minutes to hours +50-200ms retrieval Per-call embedding + vector search + LLM ✓ retrieved chunks
Fine-tuning Style, format, internal knowledge Weeks to months None Training once + hosting Less direct
Continued pre-training Vocabulary, concepts Months None Heavy training + hosting Less direct
Hybrid Everything Minutes (for facts) +retrieval All of the above

Reading the table against the B2B support bot:

  • Product docs and changelogs: factual, frequently updated, need citations. RAG is the fit. The bot cites the specific doc and version, so the human can verify.
  • Support-ticket voice and structure: stylistic, stable. Fine-tuning on a curated subset of resolved tickets produces a bot that answers in the team’s actual voice. Not frequently updated; a once-a-quarter fine-tune is plenty.
  • “Only answer from internal sources; if you don’t know, say so”: behavioural. Prompt engineering owns it.

The hybrid is the production pick: a fine-tuned Claude or Llama on Bedrock as the base, Bedrock Knowledge Bases for RAG over docs + changelogs, and a carefully crafted system prompt enforcing the behaviour.

The three levers in one pipeline

Query path through the three levers User "Does 4.3 still support SFTP?" 1. Prompt engineering system: "You are an internal support assistant. Answer from provided docs. If no doc confirms, say so. Cite the doc title and version." user: <question> + <retrieved chunks, formatted> edits: git diff on a text file; no training, no infrastructure change 2. Retrieval (RAG) embed question → vector store → top-K chunks Bedrock Knowledge Base over: product docs, changelogs re-embed nightly after doc updates latency: ~100ms; each chunk carries doc URL + version for citation 3. Fine-tuned base model Bedrock Custom Model: base + LoRA adapter trained on 5,000 curated (question, resolved-ticket-reply) pairs quarterly retrain cadence effect: output in support-team voice without per-query instruction Vector store OpenSearch / Aurora pgvector ~8k doc chunks + 200 changelog re-embed on doc update top-K with rerank Fine-tuned LLM call all three levers together "As of 4.3, SFTP connector is deprecated but still supported. [Changelog 4.3]"
The three levers compose cleanly in a single request path: prompt engineering frames the task, retrieval pulls in the current facts, and fine-tuning shapes the voice the answer comes out in.

The pick in depth

Start with RAG alone. The fastest path to a demonstrably useful bot is RAG over product docs and changelogs, with a carefully engineered system prompt, on an unmodified base model. Bedrock Knowledge Bases gets a team from “raw docs” to “working chatbot” in a few days:

# Rough shape via Bedrock SDK
import boto3

bedrock = boto3.client("bedrock-agent")
kb = bedrock.create_knowledge_base(
    name="support-kb",
    roleArn="arn:aws:iam::111122223333:role/kb-role",
    knowledgeBaseConfiguration={
        "type": "VECTOR",
        "vectorKnowledgeBaseConfiguration": {
            "embeddingModelArn": "arn:aws:bedrock:eu-west-1::foundation-model/amazon.titan-embed-text-v2:0",
        },
    },
    storageConfiguration={
        "type": "OPENSEARCH_SERVERLESS",
        "opensearchServerlessConfiguration": {
            "collectionArn": "...",
            "vectorIndexName": "support-index",
            "fieldMapping": {"vectorField": "vector", "textField": "text", "metadataField": "meta"},
        },
    },
)
bedrock.create_data_source(
    knowledgeBaseId=kb["knowledgeBase"]["knowledgeBaseId"],
    name="product-docs",
    dataSourceConfiguration={
        "type": "S3",
        "s3Configuration": {"bucketArn": "arn:aws:s3:::company-docs"},
    },
)

Query-side:

runtime = boto3.client("bedrock-agent-runtime")
resp = runtime.retrieve_and_generate(
    input={"text": "Does version 4.3 still support the legacy SFTP connector?"},
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            "knowledgeBaseId": kb["knowledgeBase"]["knowledgeBaseId"],
            "modelArn": "arn:aws:bedrock:eu-west-1::foundation-model/anthropic.claude-sonnet-4-5-20250929-v1:0",
        },
    },
)

The response includes both the generated answer and the citations, the specific chunks retrieved, with their source URLs. No training, nightly re-embedding of updated docs.

Add a fine-tune when style matters. After six weeks in production, the team notices the answers, while correct, don’t sound like the support team. The support team is terse; the bot is chatty. The support team ends replies with “ping us back if that didn’t cover it”; the bot says “I hope this helps!”

Curate 5,000 (customer-question, team-member-reply) pairs from the last two years of resolved Zendesk tickets (filter to resolved tickets where the customer’s follow-up was a confirmation, not a re-ask, as a rough quality signal). Run a Bedrock Custom Model fine-tune on a fine-tunable base (e.g., Claude Haiku 4.5, Titan, or Llama 4) using the curated dataset:

bedrock = boto3.client("bedrock")
job = bedrock.create_model_customization_job(
    jobName="support-voice-q3",
    customModelName="support-voice-v3",
    roleArn="arn:aws:iam::...:role/bedrock-customization",
    baseModelIdentifier="anthropic.claude-haiku-4-5-20251001-v1:0",
    trainingDataConfig={"s3Uri": "s3://bucket/training/q3.jsonl"},
    hyperParameters={"epochCount": "3", "learningRate": "0.00001"},
    customizationType="FINE_TUNING",
)

Output: a custom model ARN; point retrieveAndGenerate at it instead of the base model. The bot’s voice shifts. Same RAG, same prompt, different model.

Iterate the prompt continuously. Prompt changes land daily. System prompt evolves: “Answer in 2-3 sentences unless the question is about configuration, in which case include a code example. If the retrieved docs disagree with each other, highlight the disagreement and cite both.”

When continued pre-training would be worth it. If the company’s domain has heavy jargon that the base model’s training corpus lacked, financial derivatives, legal discovery, a niche industry with specific vocabulary, the base model may struggle to understand queries in that vocabulary, not just generate in it. Continued pre-training on a corpus of unstructured domain text shifts the model’s language distribution, not just its output style. Rarely needed for general B2B SaaS support; occasionally needed for very specialised domains.

A worked evolution

The support bot evolves over nine months:

  • Week 1: RAG-only MVP. Bedrock Knowledge Base over product docs. Claude Sonnet 4.5 as base. System prompt: “Answer from provided docs; cite versions.” Works well for factual queries; awkward voice.
  • Week 2-6: Prompt iteration. System prompt grows from 100 to 400 words. Includes format examples, edge cases, fallback language. Citation format tightened. Still awkward voice.
  • Month 3: First fine-tune. 5,000 ticket-pair dataset. Fine-tuned Claude Haiku 4.5 as the base for the bot. Voice shift is dramatic; cost per query drops (Haiku is cheaper than Sonnet); latency drops.
  • Month 4-6: Expand RAG. Add engineering-wiki subset to the Knowledge Base. Add a separate KB for release notes with a release-date-aware retrieval filter. Tune chunk size from 500 to 800 tokens after evaluation.
  • Month 7: Evaluation harness. 200 curated test questions with expected answers. Run weekly; flag regressions. Reveals that the fine-tune occasionally over-hedges, dataset-quality iteration follows.
  • Month 9: Continued pre-training on a corpus of the company’s own technical forum posts. Marginal improvement on domain-jargon recognition; probably not worth the cost. Dropped after review.

The hybrid isn’t a monolith; it’s a pipeline where each lever can be tuned independently. The team’s instrumentation lets them attribute regressions to the correct layer.

What’s worth remembering

  1. Three levers, not one. RAG for facts, fine-tuning for style, prompt engineering for behaviour. The correct answer is usually a combination.
  2. RAG is for knowledge that changes. Update documents, re-embed, the bot knows. No retraining cycle.
  3. Fine-tuning is for style and behaviour. Takes a curated (input, output) dataset; training cost once, hosting cost ongoing. Doesn’t help with fresh facts.
  4. Prompt engineering is free, fast, and composable. Change the system prompt, ship. Always compatible with the other two.
  5. Bedrock Knowledge Bases is the RAG quick-start on AWS. Managed ingestion from S3, managed vector storage on OpenSearch Serverless / Aurora / Pinecone / Redis, managed retrieval-and-generate API.
  6. Bedrock Custom Models does the fine-tune on Bedrock. LoRA-style parameter-efficient tuning, supported on a subset of base models. SageMaker JumpStart and SageMaker training do the broader set.
  7. Continued pre-training is rare. Only needed when the base model’s vocabulary/concepts genuinely lack the domain; heavier cost, harder to justify.
  8. Evaluation is the missing piece. Without a test set of “this question should get roughly this answer,” you can’t tell if a prompt change or a fine-tune regressed anything. Build the evaluation before the bot goes to production.

“How do we make the model know our stuff?” decomposes into three shapes of knowledge and three techniques that solve them. Picking the correct one, or the correct combination, is less about model quality and more about knowing what kind of knowledge each piece of the domain is.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.