Designing a Multilingual Semantic Search Architecture

2025-08-13

Published at: MDP Group Blog

Introduction

Most enterprise search still relies on keyword matching—which often misses intent, especially in multilingual environments. At MDP Group, we built a semantic retrieval architecture that understands meaning, not just words, enabling smarter, faster, and more accurate search across domains.

In this article, we review the limits of keyword search, introduce a three-stage semantic pipeline, and share key insights from real-world experiments.

Table of Contents



From Keyword Search to Semantic Understanding

Keyword search matches strings, not meaning. In multilingual or terminology-heavy corpora, this causes intent mismatches and missed results:

  • “Sharing new product ideas” vs. “submitting a new solution proposal” express the same intent with zero shared tokens.
  • “Vendor evaluation” and “supplier comparison” belong to the same workflow yet won’t match unless phrased identically.

Users then miss critical information, create duplicates, or give up. Semantic retrieval shifts focus from what was said to what was meant.

What Is Semantic Retrieval?

Semantic retrieval encodes queries and documents into dense vectors using modern language models. These vectors capture intent, enabling matches across paraphrases and languages.

We apply this to:

  • Suggestion systems (e.g., “Has this been proposed before?”)
  • Procurement (grouping similar supplier offers)
  • Knowledge bases (finding answers no matter how they’re asked)

To deliver both flexibility and speed, we use a modular pipeline with three core components.

Understanding the Components

Retriever

Responsible for quickly finding semantically relevant candidates from a large corpus.

How it works

  • Query encoding: A multilingual sentence-embedding model converts the query into a high-dimensional vector.
  • Vector index lookup: A vector DB (e.g., FAISS, Pinecone) compares the query vector against precomputed document vectors.
  • Similarity scoring: Candidates are scored via cosine similarity or inner product.
  • Top-k retrieval: The top-k most similar candidates are returned.

Reranker

Refines the candidate set so the most contextually relevant results surface to the top—useful when nuance or domain language matters.

How it works

  • Pair construction: Each (query, candidate) pair is formed from the top-k list.
  • Contextual scoring: A cross-encoder (e.g., MiniLM/BERT) scores semantic alignment.
  • Reordering: Candidates are re-ranked by these scores.

Judge LLM

Validates semantic equivalence between the query and top result—crucial for deduplication, legal/support search, or decision-making.

How it works

  • Input prep: Pair the query with the best candidate.
  • LLM evaluation: A prompted/fine-tuned LLM judges equivalence beyond surface similarity.
  • Decision: Returns labels like “Equivalent”, “Related but not equivalent”, or “Not relevant”.

Each layer adds precision and robustness across multilingual, high-variance enterprise data.

Three-Stage Architecture

To balance performance and accuracy, we use a three-stage pipeline:

  1. Retriever: Embedding-based nearest neighbors
    Encode the query with a multilingual model (e.g., all-mpnet-base-v2, multilingual-e5-large, jina-embeddings-v3) and retrieve top-k (e.g., 10) via Pinecone/FAISS. This stage is milliseconds-level.

  2. Reranker: Contextual fine-grained scoring
    Re-score the initial candidates with a lightweight cross-encoder; promote truly relevant items.

  3. Judge LLM: Semantic equivalence check
    Ask a Judge LLM: “Is this semantically equivalent, categorically related, or proposing a similar solution?”
    This provides an extra layer of trust when paraphrases are common.

Experiments

We compared configurations on a 16 GB VRAM GPU using both local and API models.

ConfigurationTop-kLatency (s)Notes
Embedder + Reranker + Local Judge1017.18Fully local
Embedder + Reranker + API Judge102.79API-based verification
Embedder + API Judge (No Reranker)101.78Fastest configuration
API Judge, Top-k = 552.97Minimal gain from k drop
API Judge, Top-k = 332.92Similar to k=5

Key Findings

  • Judge dominates latency. Most time is spent in the Judge step.
  • API Judge is ~6× faster than local for our setup.
  • Removing the reranker can speed things up further (good for real-time), with acceptable accuracy trade-offs in many cases.
  • Changing top-k (3/5/10) barely affects latency; Judge remains the bottleneck.

Conclusion

Keyword search no longer suffices in multilingual or semantically diverse environments. Semantic retrieval captures intent and improves discovery.

Our three-stage approach balances speed and accuracy:

  • Retriever: fast semantic candidates (FAISS/Pinecone)
  • Reranker: fine-grained relevance via a lightweight model
  • Judge LLM: equivalence verification for trust

Practical recommendations

  • Real-time: Embedder + Judge (API) without reranker for lowest latency.
  • High-stakes/privacy-sensitive: Run all components locally for maximum control.
  • Resource-constrained: Embedder + Judge (API) offers strong speed/quality trade-offs.

Final Thoughts

Semantic retrieval goes beyond matching words—it captures meaning. In today’s multilingual and dynamic environments, this shift is essential. By optimizing for intent, we unlock faster, smarter, and more reliable search experiences.

At MDP Group, we see semantic search as a cornerstone of intelligent systems.