Limits of Single-Vector Embeddings & Why BM25 Still Matters
Limits of Single-Vector Embeddings & Why BM25 Still Matters
Paper: arXiv:2508.21038
This post introduces each building block (BoW, TF/IDF, BM25, embeddings, rerankers) before diving into the paper On the Theoretical Limitations of Embedding-Based Retrieval (Weller et al., 2025).
By the end, you’ll see why BM25 still matters, when dense embeddings fail, and how hybrid systems win.
1) Bag-of-Words (BoW)
The simplest representation: just count words, ignore order and grammar.
Example
- 
Doc1: "red apple fresh apple"
BoW: red=1, apple=2, fresh=1 - 
Doc2: "green apple and pear"
BoW: green=1, apple=1, and=1, pear=1 
Pros: simple, efficient, exact term matching.
Cons: no order, no synonyms, no semantics.
2) TF, DF, IDF
- Term Frequency (TF): number of times a word appears in a document
 - Document Frequency (DF): number of documents that contain that word
 - Inverse Document Frequency (IDF): rare words get higher weight
 
IDF formula (common variant):
IDF(t) = log( (N - df(t) + 0.5) / (df(t) + 0.5) + 1 )
where N = total number of documents, df(t) = number of docs containing term t
3) BM25
BM25 combines TF, IDF, and length normalization.
BM25 scoring formula:
score(q, d) = sum over t in q of [ IDF(t) * ( TF(t,d) * (k1 + 1) ) / ( TF(t,d) + k1 * (1 - b + b * |d| / avgdl) ) ]
- k1 controls TF saturation (usually 1.2–2.0)
 - b controls length normalization (0 = no normalization, 1 = full normalization)
 - |d| = document length, avgdl = average document length
 
4) Single-Vector Embeddings
- Each doc and query mapped to one dense vector
 - Ranked by cosine similarity or dot-product
 - Strengths: captures synonyms and semantic similarity
 - Limits: cannot represent complex AND/OR combinations with one vector
 
5) The LIMIT Paper (Weller et al., 2025)
Main result: for fixed embedding dimension d, some top-k relevance patterns cannot be represented.
- Even k=2 (returning a specific pair) can fail
 - No amount of training or data fixes this structural limitation
 
LIMIT dataset:
- Queries encode combinatorial constraints (like "A and B together")
 - State-of-the-art embeddings achieve less than 20% recall@100
 - Even optimized “free embeddings” fail unless d is very large
 
6) Why BM25 Still Wins
BM25 naturally adds scores across query terms.
- Query "A and B" → document with both gets boosted twice
 - Handles AND/OR constraints directly
 - In LIMIT, BM25 clearly outperforms embeddings
 
7) Component Guide
| Component | Strengths | Weaknesses | 
|---|---|---|
| BoW | exact term match | no semantics | 
| BM25 | boolean queries, rare terms, strong baseline | no synonyms | 
| Single-Vector Embedding | semantics, synonyms | weak on AND/OR | 
| Multi-Vector (ColBERT) | more expressive | higher cost | 
| Reranker | high precision | expensive | 
Best practice: hybrid → BM25 + embeddings + reranker
8) Hybrid Retrieval Recipe
- Recall with BM25 and embeddings in parallel
 - Merge candidate sets
 - Fuse scores (e.g. reciprocal rank fusion)
 - Rerank with cross-encoder
 - Use top passages in LLM (RAG)
 
9) Takeaways
- Single-vector embeddings have theoretical limits
 - BM25 is still crucial for combinatorial retrieval
 - Hybrid retrieval plus reranking is the most robust solution