You shipped a RAGA pattern where you retrieve relevant documents at query time and stuff them into the prompt so the model can ground its answer on them. chatbot last quarter. EmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. , VectorAn ordered list of numbers – in AI usage, almost always an embedding – and by extension the databases that index them for nearest-neighbour search. , prompt template, the lot. Demo went great. Three months in, the support team is finding answers that are technically in the corpus but consistently the wrong ones – close enough on the embedding to rank highly, but not actually what the question was asking. You crank the top-k from 5 to 20, the LLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. gets confused by the noise, and the answers get worse. You’re stuck.
The fix is a step you skipped.
In To LLMs… and Beyond! we covered RAG – retrieval-augmented generation – as a two-step pattern: embed the query, retrieve relevant documents, generate the answer. That’s the correct shape for explanation. It’s also the wrong shape for production. Most working RAG systems have three steps, and the missing middle one is where the quality lives.
This post is about that middle step.
Why a single retrieval pass isn’t enough
The retrieval step in RAG uses what’s called a bi-encoder: an encoder model (usually BERT-family, see The Other Transformers) that produces a single vector for each piece of text. The query gets one vector. Each document gets one vector. You compare them by cosine similarity – the closer the angle, the more similar the texts.
This is fast. Embarrassingly fast. You can pre-compute the document vectors once and store them in a database. At query time, you only need to embed the query (a few milliseconds) and find the nearest neighbours (a few more milliseconds, even across millions of documents). It scales to web-search levels.
It’s also kind of dumb.
The bi-encoder embeds the query and the document independently. The model never sees them together. It produces a vector for the query that captures the query’s meaning in general, and a vector for the document that captures the document’s meaning in general, and then you compare those two general representations. There’s no opportunity for the model to notice that this specific query is asking about a specific aspect of this specific document.
In practice this means bi-encoders are good at finding documents that are topically related to the query. They’re less good at finding the documents that actually answer the query. Two documents about the same topic can have very similar embeddings even if only one of them contains the answer.
For a vague question like “what’s our refund policy?” topical similarity is enough. For a specific question like “can I get a refund on a digital download after 30 days if I haven’t used it?” you need a model that can read the query and the candidate documents together and decide which one actually addresses the conditions.
That’s a cross-encoder.
What a cross-encoder is
A cross-encoder is the same architecture (an encoder TransformerThe neural network architecture that underpins modern LLMs – stacks of self-attention layers that let every token look at every other token in the context. ) used a different way. Instead of producing a vector for each text, it takes a pair of texts – query and candidate document – and produces a single relevance score.
The query and document get concatenated with a separator TokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. , fed through the model together, and the model’s full AttentionThe mechanism inside a transformer that lets each token weigh how much every other token in the context matters to it. mechanism gets to see every query token attend to every document token and vice versa. The output is one number: how well does this document answer this query?
[CLS] can I get a refund on a digital download after 30 days [SEP]
Refund policy: physical goods may be returned within 30 days. Digital
downloads are non-refundable once purchased. [SEP]
The model reads that and outputs, say, 0.91 – the document is highly relevant because it directly addresses both “digital download” and “refund,” even though the answer is “no.” A different document that only mentions the 30-day window for physical goods might score 0.34.
Cross-encoders are dramatically more accurate than bi-encoders for relevance. They’re also dramatically slower. Because the model has to see the query and document together, you can’t pre-compute anything – every query against every candidate is a fresh forward pass. If you have a million documents and you ran the cross-encoder against all of them, you’d be waiting weeks per query.
Which is why you don’t do that. You do retrieve-then-rerank.
The two-stage pattern
The standard production RAG pipeline is:
- Retrieval (bi-encoder). Embed the query, find the top 50-200 candidate documents from the vector database. Fast, parallel, scalable.
- Reranking (cross-encoder). Score each of those candidates against the query using a cross-encoder. Pick the top 3-10 by score.
- Generation (LLM). Pass the top reranked documents into the LLM along with the query. Generate the answer.
The retrieval stage is “we cast a wide net, fast.” The reranking stage is “we read each catch carefully, slowly, but only the ones in the net.” Together they let you get cross-encoder-quality relevance at bi-encoder-scale corpus sizes.
The numbers are striking. For a corpus of one million documents:
- Bi-encoder only: ~10ms per query, mediocre relevance.
- Cross-encoder only: ~1,000,000 model calls per query. Untenable.
- Bi-encoder + cross-encoder: ~10ms retrieval + ~200ms reranking on 100 candidates = ~210ms total, with relevance approaching cross-encoder-only quality.
That third option is what every serious RAG system is doing. The blog posts that don’t mention it are showing you the demo, not the production system.
Models you can actually use
Reranker models are a small but mature corner of the open-source ecosystem.
| Model | Made by | Open / closed | Notable for |
|---|---|---|---|
| BGE Reranker (v2-m3, large) | BAAI | Open | Strong default, multilingual, well-supported |
| Cohere Rerank | Cohere | Closed (API) | Easy integration, multilingual, pay-per-call |
| Voyage Rerank | Voyage AI | Closed (API) | High quality, instruction-tuned variants |
| ms-marco-MiniLM-L-6-v2 | sentence-transformers | Open | Tiny (22M params), runs on CPU, fine for English |
| Jina Reranker | Jina AI | Open / API | Long-context variants for document-level reranking |
The lightweight ones (the MiniLM cross-encoders, around 20-100M parameters) run on a CPU. The heavyweight ones (BGE Reranker v2-m3, around 568M parameters) want a GPU but produce noticeably better rankings. For most projects the correct starting point is the smallest open model that fits your latency budget; you can swap up if quality demands it.
When reranking earns its keep
Not every retrieval task needs a reranker. The benefit grows with task difficulty:
- Vague topical queries against a small corpus: bi-encoder is fine. “Tell me about our company values” against a 50-document handbook will return the correct document on cosine similarity alone.
- Specific factual queries against a medium corpus: reranker helps. “What’s the SLA for our enterprise tier?” against a thousand-document knowledge base benefits from the cross-encoder noticing that the document mentioning enterprise tier SLAs specifically is more relevant than the one with the same words in a marketing context.
- Long-tail queries against a large corpus: reranker is essential. Web-scale search, code search, scientific literature search – the bi-encoder will return a heap of plausible-but-not-quite candidates, and the reranker is what separates them.
The pattern: bi-encoders fail by returning plausibly-related but not actually-answering documents. If your eval set is full of cases like that, you need a reranker. If your bi-encoder is missing the correct document entirely (it’s not in the top 200), reranking won’t save you – you need better embeddings or a hybrid retrieval strategy. Different problem.
Hybrid retrieval: the other thing you might be missing
While we’re here, the second-most-skipped step in RAG explanations: hybrid retrieval.
Bi-encoders work on semantic meaning. They’re great at handling paraphrase (“how do I cancel?” finds documents about “subscription termination”). They’re weak at exact matches – product codes, person names, error messages, version numbers. The vector for KB-ERR-2847-fatal doesn’t necessarily live near the vector for 2847 in embedding space, because the model has never seen that specific string and treats it as a sequence of arbitrary subword tokens.
Hybrid retrieval combines a semantic search (bi-encoder, dense vectors) with a lexical search (BM25, sparse keyword matching) and merges the results. The semantic search catches paraphrase. The lexical search catches exact matches. The reranker takes the union and sorts it.
In production:
- Semantic retrieval returns top 100 by embedding similarity.
- Lexical retrieval returns top 100 by BM25 score.
- Merge – take the union (often 150-200 documents after dedup).
- Rerank with a cross-encoder, take the top 5-10.
- Generate with the LLM.
This pattern – often called hybrid retrieval with cross-encoder reranking – is the realistic shape of a production RAG system in 2026. The blog-post version with one embedding lookup is the simplification.
A decision table
| Symptom | Likely fix |
|---|---|
| "The correct document is in the top 50 but not the top 5" | Add a reranker |
| "The correct document isn't in the top 50 at all" | Better embeddings, or hybrid retrieval (BM25 + semantic), or chunk differently |
| "It can't find specific product codes / IDs" | Hybrid retrieval -- you need lexical matching |
| "The LLM is confused by too many candidates" | Lower top-k after reranking; trust the reranker to filter |
| "Latency is too high" | Smaller reranker (MiniLM cross-encoders), or fewer candidates into the reranker |
| "Quality varies wildly between users" | Likely a chunking or query-rewriting issue, not a reranker issue |
The shortcut version of RAG – embed, look up, generate – works in the demo because the demo corpus is small and the demo questions are vague. The production version has to handle a thousand specific questions against a million documents, and that’s where the bi-encoder’s independence starts to hurt. Embedding the query and the document separately is what makes retrieval scale, and it’s also what stops the model noticing whether the candidate it returned actually answers the question or merely shares a topic with it. The cross-encoder is the cure for that, because it reads the pair together and lets attention work across both halves. The price is speed, which is why nobody runs a cross-encoder against the whole corpus. They run it against the top hundred the bi-encoder fished out, and they merge in BM25 results so the product codes and error strings don’t get lost in the semantic blur.
A reranker can only do its job if the correct document already made it into the candidate set. If the bi-encoder misses entirely, no amount of reranking will recover the answer – the fix lives in the chunking, the embeddings, or the lexical search. Worth knowing which symptom you have before you start tuning.