Limits of Single-Vector Embeddings & Why BM25 Still Matters

2025-09-07

Limits of Single-Vector Embeddings & Why BM25 Still Matters

This post introduces each building block (BoW, TF/IDF, BM25, embeddings, rerankers) before diving into the paper On the Theoretical Limitations of Embedding-Based Retrieval (Weller et al., 2025).
By the end, you’ll see why BM25 still matters, when dense embeddings fail, and how hybrid systems win.

1) Bag-of-Words (BoW)

The simplest representation: just count words, ignore order and grammar.

Example

Doc1: "red apple fresh apple"
BoW: red=1, apple=2, fresh=1
Doc2: "green apple and pear"
BoW: green=1, apple=1, and=1, pear=1

Pros: simple, efficient, exact term matching.
Cons: no order, no synonyms, no semantics.

2) TF, DF, IDF

Term Frequency (TF): number of times a word appears in a document
Document Frequency (DF): number of documents that contain that word
Inverse Document Frequency (IDF): rare words get higher weight

IDF formula (common variant):
IDF(t) = log( (N - df(t) + 0.5) / (df(t) + 0.5) + 1 )
where N = total number of documents, df(t) = number of docs containing term t

3) BM25

BM25 combines TF, IDF, and length normalization.

BM25 scoring formula:
score(q, d) = sum over t in q of [ IDF(t) * ( TF(t,d) * (k1 + 1) ) / ( TF(t,d) + k1 * (1 - b + b * |d| / avgdl) ) ]

k1 controls TF saturation (usually 1.2–2.0)
b controls length normalization (0 = no normalization, 1 = full normalization)
|d| = document length, avgdl = average document length

4) Single-Vector Embeddings

Each doc and query mapped to one dense vector
Ranked by cosine similarity or dot-product
Strengths: captures synonyms and semantic similarity
Limits: cannot represent complex AND/OR combinations with one vector

5) The LIMIT Paper (Weller et al., 2025)

Main result: for fixed embedding dimension d, some top-k relevance patterns cannot be represented.

Even k=2 (returning a specific pair) can fail
No amount of training or data fixes this structural limitation

LIMIT dataset:

Queries encode combinatorial constraints (like "A and B together")
State-of-the-art embeddings achieve less than 20% recall@100
Even optimized “free embeddings” fail unless d is very large

6) Why BM25 Still Wins

BM25 naturally adds scores across query terms.

Query "A and B" → document with both gets boosted twice
Handles AND/OR constraints directly
In LIMIT, BM25 clearly outperforms embeddings

7) Component Guide

Component	Strengths	Weaknesses
BoW	exact term match	no semantics
BM25	boolean queries, rare terms, strong baseline	no synonyms
Single-Vector Embedding	semantics, synonyms	weak on AND/OR
Multi-Vector (ColBERT)	more expressive	higher cost
Reranker	high precision	expensive

Best practice: hybrid → BM25 + embeddings + reranker

8) Hybrid Retrieval Recipe

Recall with BM25 and embeddings in parallel
Merge candidate sets
Fuse scores (e.g. reciprocal rank fusion)
Rerank with cross-encoder
Use top passages in LLM (RAG)

9) Takeaways

Single-vector embeddings have theoretical limits
BM25 is still crucial for combinatorial retrieval
Hybrid retrieval plus reranking is the most robust solution