Why Rerankers Matter in RAG: Fixing the Two-Tower Blind Spot

2025-10-04

Retrieval-Augmented Generation (RAG) combines retrieval and generation to produce grounded, factual answers.
Yet many systems plateau in quality. You improve embeddings, expand the corpus, tweak thresholds—still no gain.
The problem isn’t recall. It’s ranking.
The hidden bottleneck lies in the two-tower architecture.

1. The Two-Tower Bottleneck

Most retrieval systems use bi-encoders (dual encoders):

score(q, d) = dot(Eq(q), Ed(d))

One tower encodes the query
One tower encodes the document
Their dot product defines similarity

This is fast and scalable—you can pre-encode millions of docs and search them in milliseconds.
But it’s also blind: the model never sees query and document together.

No token-level interaction.
No understanding of whether one actually answers the other.
You’re ranking without reading.

2. Semantic ≠ Relevant

Bi-encoders measure semantic similarity, not relevance.

Query	Document	Cosine Similarity	True Relevance
“How to prevent heart attacks?”	“Heart attacks kill millions yearly.”	High	Low
“How to prevent heart attacks?”	“Regular exercise and diet reduce heart attack risk.”	Moderate	High

Both share terms like “heart attack,” so the embedding space pulls them close.
But only the second is relevant.
That’s the failure mode of pure semantic search.

3. The Retrieval–Reranking Gap

Let’s quantify it.

Retrieval@100 (bi-encoder): ~85% recall
Top-10 from those 100: ~60% precision
After reranking top-100: ~85% precision

You lose ~25% precision when you skip reranking.
That’s the gap between a system that “sounds right” and one that is right.

4. Why Cross-Encoders Fix It

A cross-encoder doesn’t treat the query and document separately.
It encodes them jointly in a single forward pass:

score(q, d) = fθ([q, SEP, d])

That single [SEP] token is revolutionary.
The model now sees:

Word-level query–document interactions
Whether the document actually answers the question
Negations, entailments, and contextual clues

It’s not comparing vectors anymore — it’s reading.

5. The Computational Cost

So why not just use cross-encoders everywhere? Because they’re O(n) expensive.

For 100 retrieved documents:

100 docs × 512 tokens × 12 layers = 614,000 token computations

They can’t be precomputed, since encoding depends on the query itself.
That’s why rerankers run after retrieval — not before.

Stage	Function	Scale	Cost	Latency
1️. Retrieval (bi-encoder)	10M → top-100	Huge	Cheap	~1 ms
2️. Reranking (cross-encoder)	100 → top-10	Small	Costly	~50 ms

Skip reranking → fast but wrong
Rerank everything → accurate but bankrupt
Smart pipeline → both fast and right.

6. How to Make Reranking Production-Ready

Reranking doesn’t have to kill your latency budget. Modern systems use several tricks:

Distillation: compress 12-layer BERTs to 6-layer (MiniLM, DeBERTa-v3-tiny)
Listwise scoring: evaluate multiple docs per forward pass
Late interaction (ColBERT): partial precomputation of token embeddings
Hybrid cascades: light rerankers (BM25+DistilBERT) first, heavy rerankers last

Each method keeps the benefits of cross-encoders while optimizing for throughput.

7. Key Takeaways

Bi-encoders: Fast but shallow — they don’t read, they match patterns.
Cross-encoders: Slow but deep — they evaluate true contextual relevance.
Reranking bridges recall and precision.
A two-stage pipeline (retrieve → rerank → generate) is non-negotiable for reliable RAG.

Without rerankers, your system retrieves similar documents — not useful ones.

In short:
Rerankers fix the two-tower limitation of embeddings.
They give your system the missing ingredient: understanding.

They’re not an optional enhancement, they’re the precision multiplier that makes RAG actually work.