Limits of Single-Vector Embeddings & Why BM25 Still Matters

2025-09-07

Limits of Single-Vector Embeddings & Why BM25 Still Matters

Paper: arXiv:2508.21038

This post introduces each building block (BoW, TF/IDF, BM25, embeddings, rerankers) before diving into the paper On the Theoretical Limitations of Embedding-Based Retrieval (Weller et al., 2025).
By the end, you’ll see why BM25 still matters, when dense embeddings fail, and how hybrid systems win.



1) Bag-of-Words (BoW)

The simplest representation: just count words, ignore order and grammar.

Example

  • Doc1: "red apple fresh apple"
    BoW: red=1, apple=2, fresh=1

  • Doc2: "green apple and pear"
    BoW: green=1, apple=1, and=1, pear=1

Pros: simple, efficient, exact term matching.
Cons: no order, no synonyms, no semantics.



2) TF, DF, IDF

  • Term Frequency (TF): number of times a word appears in a document
  • Document Frequency (DF): number of documents that contain that word
  • Inverse Document Frequency (IDF): rare words get higher weight

IDF formula (common variant):
IDF(t) = log( (N - df(t) + 0.5) / (df(t) + 0.5) + 1 )
where N = total number of documents, df(t) = number of docs containing term t



3) BM25

BM25 combines TF, IDF, and length normalization.

BM25 scoring formula:
score(q, d) = sum over t in q of [ IDF(t) * ( TF(t,d) * (k1 + 1) ) / ( TF(t,d) + k1 * (1 - b + b * |d| / avgdl) ) ]

  • k1 controls TF saturation (usually 1.2–2.0)
  • b controls length normalization (0 = no normalization, 1 = full normalization)
  • |d| = document length, avgdl = average document length


4) Single-Vector Embeddings

  • Each doc and query mapped to one dense vector
  • Ranked by cosine similarity or dot-product
  • Strengths: captures synonyms and semantic similarity
  • Limits: cannot represent complex AND/OR combinations with one vector


5) The LIMIT Paper (Weller et al., 2025)

Main result: for fixed embedding dimension d, some top-k relevance patterns cannot be represented.

  • Even k=2 (returning a specific pair) can fail
  • No amount of training or data fixes this structural limitation

LIMIT dataset:

  • Queries encode combinatorial constraints (like "A and B together")
  • State-of-the-art embeddings achieve less than 20% recall@100
  • Even optimized “free embeddings” fail unless d is very large


6) Why BM25 Still Wins

BM25 naturally adds scores across query terms.

  • Query "A and B" → document with both gets boosted twice
  • Handles AND/OR constraints directly
  • In LIMIT, BM25 clearly outperforms embeddings


7) Component Guide

ComponentStrengthsWeaknesses
BoWexact term matchno semantics
BM25boolean queries, rare terms, strong baselineno synonyms
Single-Vector Embeddingsemantics, synonymsweak on AND/OR
Multi-Vector (ColBERT)more expressivehigher cost
Rerankerhigh precisionexpensive

Best practice: hybrid → BM25 + embeddings + reranker



8) Hybrid Retrieval Recipe

  1. Recall with BM25 and embeddings in parallel
  2. Merge candidate sets
  3. Fuse scores (e.g. reciprocal rank fusion)
  4. Rerank with cross-encoder
  5. Use top passages in LLM (RAG)


9) Takeaways

  • Single-vector embeddings have theoretical limits
  • BM25 is still crucial for combinatorial retrieval
  • Hybrid retrieval plus reranking is the most robust solution