Designing a Multilingual Semantic Search Architecture
Published at: MDP Group Blog
Introduction
Most enterprise search still relies on keyword matching—which often misses intent, especially in multilingual environments. At MDP Group, we built a semantic retrieval architecture that understands meaning, not just words, enabling smarter, faster, and more accurate search across domains.
In this article, we review the limits of keyword search, introduce a three-stage semantic pipeline, and share key insights from real-world experiments.
Table of Contents
- From Keyword Search to Semantic Understanding
- What Is Semantic Retrieval?
- Understanding the Components
- Three-Stage Architecture
- Experiments
- Key Findings
- Conclusion
- Final Thoughts
From Keyword Search to Semantic Understanding
Keyword search matches strings, not meaning. In multilingual or terminology-heavy corpora, this causes intent mismatches and missed results:
- “Sharing new product ideas” vs. “submitting a new solution proposal” express the same intent with zero shared tokens.
- “Vendor evaluation” and “supplier comparison” belong to the same workflow yet won’t match unless phrased identically.
Users then miss critical information, create duplicates, or give up. Semantic retrieval shifts focus from what was said to what was meant.
What Is Semantic Retrieval?
Semantic retrieval encodes queries and documents into dense vectors using modern language models. These vectors capture intent, enabling matches across paraphrases and languages.
We apply this to:
- Suggestion systems (e.g., “Has this been proposed before?”)
- Procurement (grouping similar supplier offers)
- Knowledge bases (finding answers no matter how they’re asked)
To deliver both flexibility and speed, we use a modular pipeline with three core components.
Understanding the Components
Retriever
Responsible for quickly finding semantically relevant candidates from a large corpus.
How it works
- Query encoding: A multilingual sentence-embedding model converts the query into a high-dimensional vector.
- Vector index lookup: A vector DB (e.g., FAISS, Pinecone) compares the query vector against precomputed document vectors.
- Similarity scoring: Candidates are scored via cosine similarity or inner product.
- Top-k retrieval: The top-k most similar candidates are returned.
Reranker
Refines the candidate set so the most contextually relevant results surface to the top—useful when nuance or domain language matters.
How it works
- Pair construction: Each (query, candidate) pair is formed from the top-k list.
- Contextual scoring: A cross-encoder (e.g., MiniLM/BERT) scores semantic alignment.
- Reordering: Candidates are re-ranked by these scores.
Judge LLM
Validates semantic equivalence between the query and top result—crucial for deduplication, legal/support search, or decision-making.
How it works
- Input prep: Pair the query with the best candidate.
- LLM evaluation: A prompted/fine-tuned LLM judges equivalence beyond surface similarity.
- Decision: Returns labels like “Equivalent”, “Related but not equivalent”, or “Not relevant”.
Each layer adds precision and robustness across multilingual, high-variance enterprise data.
Three-Stage Architecture
To balance performance and accuracy, we use a three-stage pipeline:
-
Retriever: Embedding-based nearest neighbors
Encode the query with a multilingual model (e.g.,all-mpnet-base-v2,multilingual-e5-large,jina-embeddings-v3) and retrieve top-k (e.g., 10) via Pinecone/FAISS. This stage is milliseconds-level. -
Reranker: Contextual fine-grained scoring
Re-score the initial candidates with a lightweight cross-encoder; promote truly relevant items. -
Judge LLM: Semantic equivalence check
Ask a Judge LLM: “Is this semantically equivalent, categorically related, or proposing a similar solution?”
This provides an extra layer of trust when paraphrases are common.
Experiments
We compared configurations on a 16 GB VRAM GPU using both local and API models.
| Configuration | Top-k | Latency (s) | Notes |
|---|---|---|---|
| Embedder + Reranker + Local Judge | 10 | 17.18 | Fully local |
| Embedder + Reranker + API Judge | 10 | 2.79 | API-based verification |
| Embedder + API Judge (No Reranker) | 10 | 1.78 | Fastest configuration |
| API Judge, Top-k = 5 | 5 | 2.97 | Minimal gain from k drop |
| API Judge, Top-k = 3 | 3 | 2.92 | Similar to k=5 |
Key Findings
- Judge dominates latency. Most time is spent in the Judge step.
- API Judge is ~6× faster than local for our setup.
- Removing the reranker can speed things up further (good for real-time), with acceptable accuracy trade-offs in many cases.
- Changing top-k (3/5/10) barely affects latency; Judge remains the bottleneck.
Conclusion
Keyword search no longer suffices in multilingual or semantically diverse environments. Semantic retrieval captures intent and improves discovery.
Our three-stage approach balances speed and accuracy:
- Retriever: fast semantic candidates (FAISS/Pinecone)
- Reranker: fine-grained relevance via a lightweight model
- Judge LLM: equivalence verification for trust
Practical recommendations
- Real-time: Embedder + Judge (API) without reranker for lowest latency.
- High-stakes/privacy-sensitive: Run all components locally for maximum control.
- Resource-constrained: Embedder + Judge (API) offers strong speed/quality trade-offs.
Final Thoughts
Semantic retrieval goes beyond matching words—it captures meaning. In today’s multilingual and dynamic environments, this shift is essential. By optimizing for intent, we unlock faster, smarter, and more reliable search experiences.
At MDP Group, we see semantic search as a cornerstone of intelligent systems.