Designing a Multilingual Semantic Search Architecture

2025-08-13

Introduction

Most enterprise search still relies on keyword matching—which often misses intent, especially in multilingual environments. At MDP Group, we built a semantic retrieval architecture that understands meaning, not just words, enabling smarter, faster, and more accurate search across domains.

In this article, we review the limits of keyword search, introduce a three-stage semantic pipeline, and share key insights from real-world experiments.

From Keyword Search to Semantic Understanding
What Is Semantic Retrieval?
Understanding the Components
Three-Stage Architecture
Experiments
Key Findings
Conclusion
Final Thoughts

From Keyword Search to Semantic Understanding

Keyword search matches strings, not meaning. In multilingual or terminology-heavy corpora, this causes intent mismatches and missed results:

“Sharing new product ideas” vs. “submitting a new solution proposal” express the same intent with zero shared tokens.
“Vendor evaluation” and “supplier comparison” belong to the same workflow yet won’t match unless phrased identically.

Users then miss critical information, create duplicates, or give up. Semantic retrieval shifts focus from what was said to what was meant.

What Is Semantic Retrieval?

Semantic retrieval encodes queries and documents into dense vectors using modern language models. These vectors capture intent, enabling matches across paraphrases and languages.

We apply this to:

Suggestion systems (e.g., “Has this been proposed before?”)
Procurement (grouping similar supplier offers)
Knowledge bases (finding answers no matter how they’re asked)

To deliver both flexibility and speed, we use a modular pipeline with three core components.

Understanding the Components

Retriever

Responsible for quickly finding semantically relevant candidates from a large corpus.

How it works

Query encoding: A multilingual sentence-embedding model converts the query into a high-dimensional vector.
Vector index lookup: A vector DB (e.g., FAISS, Pinecone) compares the query vector against precomputed document vectors.
Similarity scoring: Candidates are scored via cosine similarity or inner product.
Top-k retrieval: The top-k most similar candidates are returned.

Reranker

Refines the candidate set so the most contextually relevant results surface to the top—useful when nuance or domain language matters.

How it works

Pair construction: Each (query, candidate) pair is formed from the top-k list.
Contextual scoring: A cross-encoder (e.g., MiniLM/BERT) scores semantic alignment.
Reordering: Candidates are re-ranked by these scores.

Judge LLM

Validates semantic equivalence between the query and top result—crucial for deduplication, legal/support search, or decision-making.

How it works

Input prep: Pair the query with the best candidate.
LLM evaluation: A prompted/fine-tuned LLM judges equivalence beyond surface similarity.
Decision: Returns labels like “Equivalent”, “Related but not equivalent”, or “Not relevant”.

Each layer adds precision and robustness across multilingual, high-variance enterprise data.

Three-Stage Architecture

To balance performance and accuracy, we use a three-stage pipeline:

Retriever: Embedding-based nearest neighbors
Encode the query with a multilingual model (e.g., all-mpnet-base-v2, multilingual-e5-large, jina-embeddings-v3) and retrieve top-k (e.g., 10) via Pinecone/FAISS. This stage is milliseconds-level.
Reranker: Contextual fine-grained scoring
Re-score the initial candidates with a lightweight cross-encoder; promote truly relevant items.
Judge LLM: Semantic equivalence check
Ask a Judge LLM: “Is this semantically equivalent, categorically related, or proposing a similar solution?”
This provides an extra layer of trust when paraphrases are common.

Experiments

We compared configurations on a 16 GB VRAM GPU using both local and API models.

Configuration	Top-k	Latency (s)	Notes
Embedder + Reranker + Local Judge	10	17.18	Fully local
Embedder + Reranker + API Judge	10	2.79	API-based verification
Embedder + API Judge (No Reranker)	10	1.78	Fastest configuration
API Judge, Top-k = 5	5	2.97	Minimal gain from k drop
API Judge, Top-k = 3	3	2.92	Similar to k=5

Key Findings

Judge dominates latency. Most time is spent in the Judge step.
API Judge is ~6× faster than local for our setup.
Removing the reranker can speed things up further (good for real-time), with acceptable accuracy trade-offs in many cases.
Changing top-k (3/5/10) barely affects latency; Judge remains the bottleneck.

Conclusion

Keyword search no longer suffices in multilingual or semantically diverse environments. Semantic retrieval captures intent and improves discovery.

Our three-stage approach balances speed and accuracy:

Retriever: fast semantic candidates (FAISS/Pinecone)
Reranker: fine-grained relevance via a lightweight model
Judge LLM: equivalence verification for trust

Practical recommendations

Real-time: Embedder + Judge (API) without reranker for lowest latency.
High-stakes/privacy-sensitive: Run all components locally for maximum control.
Resource-constrained: Embedder + Judge (API) offers strong speed/quality trade-offs.

Final Thoughts

Semantic retrieval goes beyond matching words—it captures meaning. In today’s multilingual and dynamic environments, this shift is essential. By optimizing for intent, we unlock faster, smarter, and more reliable search experiences.

At MDP Group, we see semantic search as a cornerstone of intelligent systems.

Designing a Multilingual Semantic Search Architecture

Introduction

Table of Contents

From Keyword Search to Semantic Understanding

What Is Semantic Retrieval?

Understanding the Components

Retriever

Reranker

Judge LLM

Three-Stage Architecture

Experiments

Key Findings

Conclusion

Final Thoughts