Building RAG When the Source Documents Change Daily

June 22, 2026 · 16 min read

Generative AI Developer · AIP-C01 · part of The Exam Room

The situation

A product team wants a support assistant that answers customer questions from three bodies of knowledge. The product manual is a 400-page PDF that engineering edits weekly. The pricing sheet is a set of Markdown files in a Git repo that finance updates at month-end. The operations runbook is a Confluence space that on-call engineers amend throughout the day, sometimes hourly.

Today there’s no assistant at all, customers raise tickets and humans search the three sources by hand. Leadership wants a first version in front of customers in six weeks, accurate enough that wrong answers are rare and caught fast, and maintainable enough that two engineers can keep it running alongside their other work. The base Bedrock model. Claude, Nova, whichever, doesn’t know any of the three sources; its training cut-off is long past the last pricing change and it has never seen the runbook.

Fine-tuning is off the table for a reason worth naming: the content changes faster than any training pipeline could keep up. Retraining weekly for the manual, monthly for pricing, hourly for the runbook is a job, not a project.

What actually matters

Before reaching for an architecture, it’s worth asking what we’re actually trading.

The core idea behind retrieval-augmented generation is that the model doesn’t have to know the answer, it has to be given the answer at InferenceRunning a trained model to produce output – as opposed to training it. time. We turn the three knowledge sources into a searchable corpus, retrieve the relevant passages when a question arrives, stuff those passages into the PromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. , and let the model compose the answer. The base model’s job is comprehension and writing; the retrieval system’s job is finding the right passages.

That framing exposes the decisions. The first is ingestion: how documents get from their home (S3, Git, Confluence, SharePoint) into a VectorAn ordered list of numbers – in AI usage, almost always an embedding – and by extension the databases that index them for nearest-neighbour search. index. Are we writing that pipeline or having AWS run it? The second is chunking: documents are too big to embed whole, so they get split. How we split affects what the retriever can find, a paragraph-level chunk answers “what is the warranty period?” cleanly; a page-level chunk buries the answer. The third is EmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. : we need a model that turns text into vectors, and the vectors’ quality caps the retriever’s quality. The fourth is storage: vectors live somewhere searchable. Managed or self-run; cheap or fast or both. The fifth is retrieval strategy: pure vector, hybrid with keyword, re-ranking, metadata filters, or something more elaborate. The sixth is the prompt: how the retrieved passages meet the user’s question, and how the model is steered to cite sources and refuse when it can’t find an answer. The seventh, always, is observability: what did the retriever return, what did the model do with it, and when it got something wrong, which step failed.

Another thing worth thinking about is where the sharp edges sit. A well-tuned retriever with a mediocre model beats a strong model with a bad retriever; most RAG failures trace back to chunking, embedding choice, or metadata filtering long before the generation step. That shapes which knobs matter.

And a softer one: the planning horizon of the team. A managed service that gets us to a working assistant in two weeks and handles the undifferentiated plumbing is worth a lot when there are two engineers. A custom stack that tunes every knob is worth a lot when there are twenty and the product is the retrieval itself.

What we’ll filter on

Distilling that into filters:

  1. Time to first working system, weeks or months?
  2. Flexibility at each step, can we swap embedding model, chunking strategy, retriever?
  3. Operational surface, how much infrastructure do we run ourselves?
  4. Cost shape, per-token, per-vector, per-hour, and how they compound?
  5. Source-of-truth fidelity, how quickly do changes in the underlying docs show up in answers?

The RAG architecture landscape

  1. Bedrock Knowledge Bases. The managed option. Point a knowledge base at a data source (S3 bucket, Confluence, SharePoint, Salesforce, web crawler, or a custom connector), pick an embedding model (Titan Text Embeddings v2, Cohere Embed English/Multilingual), pick a vector store (OpenSearch Serverless, Aurora PostgreSQL with pgvector, Neptune Analytics, Pinecone, MongoDB Atlas, or a quick-create OpenSearch Serverless collection if we don’t care yet), and Bedrock handles ingestion, chunking (fixed-size, hierarchical, semantic, or a custom Lambda), embedding, and retrieval. We call RetrieveAndGenerate with a query and get back a grounded answer plus citations. No servers. Sync is triggered by an API call (StartIngestionJob) or on a schedule. Ticks attributes 1, 3, and 5; gives up ground on 2.

  2. LangChain (or LlamaIndex) on our own infrastructure. A framework-assembled stack. LangChain wraps the pieces, loaders per source type, splitters for chunking, embeddings wrappers around Bedrock or Cohere or OpenAI, vector stores (Chroma, Pinecone, pgvector, OpenSearch, FAISS), retrievers (vector, hybrid BM25+vector, multi-query, parent-document), and a chain that glues retrieval to generation. Runs wherever Python runs: Lambda, Fargate, EKS. More knobs exposed; more moving parts to own. Ticks attributes 2 and (partly) 4; gives up ground on 1 and 3.

  3. Custom pipeline. We write the ingestion, chunking, embedding call, vector write, retriever, and prompt assembly by hand, using Bedrock’s InvokeModel for embeddings and generation and a vector store of our choice. No framework. Maximum control, custom chunking, custom metadata, custom retriever logic, custom prompt assembly, custom evaluation. Maximum code. Ticks 2 to the hilt; heavy cost on 1 and 3.

  4. Bedrock Agents with a Knowledge Base attached. Agents sit above Knowledge Bases and add tool-calling, action groups, and session memory. If the assistant needs to do things beyond answering, look up a customer’s subscription, trigger a refund. Agents make sense. For pure question-answering, Agents add a layer that isn’t earning its keep; Knowledge Bases alone are enough.

  5. Fine-tuning instead of retrieval. Mentioned only to rule it out. Fine-tuning embeds knowledge in model weights, which is slow to update and expensive to retrain. For content that changes weekly or hourly, fine-tuning is the wrong tool, the model would be out of date before it shipped. Fine-tuning earns its keep for style, format, and domain vocabulary, not for facts that mutate.

Side by side

Option Time to first system Flexibility Ops surface Cost shape Source fidelity
Bedrock Knowledge Bases Days Medium Minimal Per-query + vector-store hours Sync on schedule or API
LangChain stack Weeks High Moderate Compute + vector-store + per-token Whatever we build
Custom pipeline Weeks to months Total Heavy Compute + vector-store + per-token Whatever we build
Agents + KB Days (for Q&A) Medium Minimal AgentA system that wraps an LLM with tools, memory, and a loop, so it can take multi-step actions toward a goal rather than just answering one prompt.  
</span> invocation + KB query Same as KB        
Fine-tuning Weeks per update Wrong axis Moderate Training + hosting Stale between trainings

Reading the table against our situation: a six-week deadline with two engineers, content changing weekly to hourly, and accuracy that matters but doesn’t need state-of-the-art retrieval research. Knowledge Bases is the path of least resistance that also happens to tick the right attributes.

The three shapes, side by side

Bedrock Knowledge Bases managed ingest + retrieve + generate LangChain stack framework-assembled, self-hosted Custom pipeline hand-written end to end Sources S3 · Confluence · web · connector Sources loaders (S3, Confluence, web) Sources hand-rolled ingest Lambda per source Chunking managed (fixed / hierarchical / semantic) Chunking RecursiveCharacterTextSplitter Chunking bespoke rules per document type Embedding Titan v2 or Cohere, called for us Embedding BedrockEmbeddings wrapper Embedding boto3 bedrock-runtime InvokeModel Vector store OpenSearch Serverless or Aurora pgvector Vector store pgvector on RDS or Chroma Vector store OpenSearch cluster we operate Retrieve Retrieve API, metadata filters Retrieve retriever chain (vector / hybrid) Retrieve query function + re-rank Generate RetrieveAndGenerate, cited output one API call end to end Generate LLMChain with Bedrock backend prompt template we maintain Generate InvokeModel with assembled prompt citations bolted on by hand
The same five steps, three different places to draw the line between us and AWS. Green rows are managed; red rows are ours to own.

The pick in depth: Bedrock Knowledge Bases

Knowledge Bases wins the shape test for this team. The interesting work is setting it up well, not building it from parts.

Data sources. Three sources map to three Knowledge Base data sources. The 400-page manual goes into an S3 bucket, with a weekly sync triggered by EventBridge on a schedule. The pricing sheet goes into the same S3 bucket, different prefix, with a sync triggered by a GitHub Action on each commit to main that calls StartIngestionJob. The runbook uses the Confluence connector, pointed at the specific space with OAuth credentials stored in Secrets Manager, synced on a 15-minute schedule so runbook edits propagate within the SLA on-call engineers expect.

Chunking strategy. Fixed-size chunking (300 TokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. , 20% overlap) is the default and it’s fine for the pricing sheet and runbook, short, self-contained sections. The manual benefits from hierarchical chunking, which stores parent-chunk and child-chunk together: the retriever finds a fine-grained 300-token child chunk, then the generator receives the coarser 1500-token parent so context around the match comes along for free. Hierarchical is worth it for long structured documents; it’s overkill for short ones.

Embedding model. Titan Text Embeddings v2 at 1024 dimensions. Cohere Embed English v3 is a close competitor with slightly better retrieval on some English benchmarks; Multilingual if the corpus crosses languages. Embedding quality caps retrieval quality, but the difference between Titan v2 and Cohere Embed v3 is measured in single percentage points on MTEB, below the noise floor of our corpus. Pick one, measure, switch if needed.

Vector store. OpenSearch Serverless is the path of least friction, quick-create from the Knowledge Base console, no capacity planning, scales on demand. Aurora PostgreSQL with pgvector is cheaper at steady state and worth it once the corpus stabilises; it also lets us query the vectors from application code with plain SQL if we ever want a hybrid retriever driven by metadata. For six-week delivery, OpenSearch Serverless wins. Migrate to pgvector later if the bill justifies it.

Retrieval. RetrieveAndGenerate handles query-embedding, vector search, and generation in one call, returning an answer plus citations that name the source chunk and the parent document. Retrieve returns the chunks without generation, useful for a debug endpoint that shows what the retriever found, which is the single most important observability surface in a RAG system.

Metadata filtering. Each Knowledge Base data source carries metadata. Tag manual chunks with source: "manual", pricing chunks with source: "pricing", runbook chunks with source: "runbook". The Retrieve call accepts a filter parameter that scopes results to a subset of sources. A customer-facing assistant filters out the runbook (internal-only) without rebuilding the index; an internal assistant lets it through.

The prompt template. Knowledge Bases uses a default prompt that’s decent; a custom prompt template earns the last mile. Things worth being explicit about: cite every claim by chunk number, refuse when the retrieved passages don’t contain the answer (don’t hallucinate a plausible one), answer in a defined tone (friendly, not chatty), and if the question is ambiguous, ask a follow-up instead of guessing.

A worked example: a pricing question

A customer asks: “What’s the difference between the Pro and Team plans, and when does the Team plan discount kick in?”

  1. RetrieveAndGenerate embeds the query using Titan v2 (a ~50ms call).
  2. OpenSearch Serverless runs a k-nearest-neighbour search against the indexed chunks, filtered to source IN ("pricing", "manual") (excluding the runbook). Top 5 chunks come back: two pricing sections describing each plan, one manual section on volume tiers, two adjacent pricing sections about overage and billing.
  3. The generator. Claude Sonnet 4.5, as it happens, receives the custom prompt with the five chunks, the conversation history, and the user question. It composes an answer, citing chunks [1] and [3].
  4. The response arrives with an answer string and a citations array; the web front-end renders the citations as clickable links to the source documents.

Total latency: ~1.2 seconds. Total cost at current Bedrock pricing: a handful of cents. What the team had to build: three data-source configs, one custom prompt template, one GitHub Action, one EventBridge schedule, and a thin API Gateway + Lambda that forwards the user’s question to RetrieveAndGenerate with the right session ID.

What’s worth remembering

  1. RAG separates knowing from writing. The base model writes; the retriever knows. Don’t ask the model to know things that change.
  2. Chunking matters more than model choice. A great model with bad chunks underperforms a mediocre model with great chunks. Hierarchical chunking for long structured documents; fixed-size with overlap for short ones.
  3. Embedding quality caps retrieval quality. Titan v2 and Cohere Embed v3 are close; pick one, measure, switch only with evidence.
  4. Knowledge Bases is the default for question-answering RAG. Managed ingestion, managed vector store, managed retrieve-and-generate, connectors for the common source types, citations in the response shape. Days to first working system.
  5. LangChain earns its keep when you need flexibility Knowledge Bases doesn’t offer. Custom retrievers, multi-step reasoning, odd source types, aggressive cost control. You pay in ops surface.
  6. Custom is right when retrieval is the product. If the team’s differentiator is their retrieval algorithm, write it. Otherwise, don’t.
  7. Agents on top of Knowledge Bases unlock tool use. If the assistant has to do things and not just answer, Agents wrap a Knowledge Base with action groups and session memory.
  8. Fine-tuning is not the tool for mutating knowledge. It’s for style, format, and domain vocabulary, things that don’t change weekly.

The assistant ships in three weeks, not six. The retriever explains its citations; the prompt refuses when it doesn’t know; the runbook propagates in fifteen minutes; and the two engineers owning the thing aren’t spending their week babysitting an ingest pipeline they wrote themselves.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.