Why Rerankers Matter in RAG: Fixing the Two-Tower Blind Spot

2025-10-04

Retrieval-Augmented Generation (RAG) combines retrieval and generation to produce grounded, factual answers.
Yet many systems plateau in quality. You improve embeddings, expand the corpus, tweak thresholds—still no gain.
The problem isn’t recall. It’s ranking.
The hidden bottleneck lies in the two-tower architecture.



1. The Two-Tower Bottleneck

Most retrieval systems use bi-encoders (dual encoders):

score(q, d) = dot(Eq(q), Ed(d))

  • One tower encodes the query
  • One tower encodes the document
  • Their dot product defines similarity

This is fast and scalable—you can pre-encode millions of docs and search them in milliseconds.
But it’s also blind: the model never sees query and document together.

No token-level interaction.
No understanding of whether one actually answers the other.
You’re ranking without reading.



2. Semantic ≠ Relevant

Bi-encoders measure semantic similarity, not relevance.

QueryDocumentCosine SimilarityTrue Relevance
“How to prevent heart attacks?”“Heart attacks kill millions yearly.”HighLow
“How to prevent heart attacks?”“Regular exercise and diet reduce heart attack risk.”ModerateHigh

Both share terms like “heart attack,” so the embedding space pulls them close.
But only the second is relevant.
That’s the failure mode of pure semantic search.



3. The Retrieval–Reranking Gap

Let’s quantify it.

  • Retrieval@100 (bi-encoder): ~85% recall
  • Top-10 from those 100: ~60% precision
  • After reranking top-100: ~85% precision

You lose ~25% precision when you skip reranking.
That’s the gap between a system that “sounds right” and one that is right.



4. Why Cross-Encoders Fix It

A cross-encoder doesn’t treat the query and document separately.
It encodes them jointly in a single forward pass:

score(q, d) = fθ([q, SEP, d])

That single [SEP] token is revolutionary.
The model now sees:

  1. Word-level query–document interactions
  2. Whether the document actually answers the question
  3. Negations, entailments, and contextual clues

It’s not comparing vectors anymore — it’s reading.



5. The Computational Cost

So why not just use cross-encoders everywhere? Because they’re O(n) expensive.

For 100 retrieved documents:

100 docs × 512 tokens × 12 layers = 614,000 token computations

They can’t be precomputed, since encoding depends on the query itself.
That’s why rerankers run after retrieval — not before.

StageFunctionScaleCostLatency
1️. Retrieval (bi-encoder)10M → top-100HugeCheap~1 ms
2️. Reranking (cross-encoder)100 → top-10SmallCostly~50 ms

Skip reranking → fast but wrong
Rerank everything → accurate but bankrupt
Smart pipeline → both fast and right.



6. How to Make Reranking Production-Ready

Reranking doesn’t have to kill your latency budget. Modern systems use several tricks:

  • Distillation: compress 12-layer BERTs to 6-layer (MiniLM, DeBERTa-v3-tiny)
  • Listwise scoring: evaluate multiple docs per forward pass
  • Late interaction (ColBERT): partial precomputation of token embeddings
  • Hybrid cascades: light rerankers (BM25+DistilBERT) first, heavy rerankers last

Each method keeps the benefits of cross-encoders while optimizing for throughput.



7. Key Takeaways

  • Bi-encoders: Fast but shallow — they don’t read, they match patterns.
  • Cross-encoders: Slow but deep — they evaluate true contextual relevance.
  • Reranking bridges recall and precision.
  • A two-stage pipeline (retrieve → rerank → generate) is non-negotiable for reliable RAG.

Without rerankers, your system retrieves similar documents — not useful ones.



In short:
Rerankers fix the two-tower limitation of embeddings.
They give your system the missing ingredient: understanding.

They’re not an optional enhancement, they’re the precision multiplier that makes RAG actually work.