Manga Panel Search with Visual Similarity, Reranking, and VLM Explanation

2025-08-29

Published on: Medium


Ever struggled to find that one panel in a manga you vaguely remember?

In this post, I walk through a multimodal RAG system I built for searching and explaining scenes in the manga Frieren: Beyond Journey’s End (Volume 01). The goal is simple but powerful:
Let users search manga using either a visual or textual query and get semantically relevant panels with visual-language model explanations.

For example:

“What is the scene where Frieren and Fern discovered the statue of Himmel?”
should retrieve the exact panel—even if you only describe it.

Image 1: Pages Retrieved via Question



Why Multimodal RAG?

There are several approaches to visual search in manga or comics. Each has trade-offs. Here's why I chose Multimodal Retrieval-Augmented Generation (RAG):

Using captions, OCR outputs, or manual tags for keyword lookup.

Problems:

  • Manga panels often lack text or contain sparse speech bubbles.
  • OCR may fail on Japanese/Katakana or artistic fonts.
  • Visually significant scenes without text are missed.

2. Visual Similarity (CLIP / CNN + k-NN)

Each image is embedded into a vector, and visually similar panels are returned using cosine similarity.

Pros: Fast and language-independent.
Cons:

  • Cannot answer textual queries.
  • Visually similar ≠ semantically similar (e.g., same layout but different meaning).
  • Lacks context.

3. Caption Generation + Text-Only RAG

Each panel is captioned, and standard text-based RAG is applied.

Pros: Works with traditional RAG pipelines.
Cons:

  • Captions are noisy and unreliable.
  • Highly sensitive to model quality.
  • Different users may describe the same panel differently.

4. Multimodal RAG: Flexible and Intuitive

I opted for Multimodal RAG where both visual and textual queries are embedded in a shared vector space.

  • Users can upload an image or type a query.
  • Both are embedded using the same model.
  • Matches are semantically ranked and optionally reranked with a token-level reranker.

Why it works:

  • No need for captions or manual tagging.
  • Captures semantic meaning, not just visual layout.
  • Supports free-form text and direct visual lookup.
  • Allows reranking (e.g., MaxSim) for improved precision.


System Architecture Overview

Here’s the full pipeline:

  1. PDFs to PNGs:
    PDF volumes converted to page-wise PNGs using PyMuPDF. Image 2: Example Manga Page that Will Be Used

  2. Embeddings (Jina v3):
    Each image embedded into 1024-dim vectors via jinaai/jina-embeddings-v3.

  3. Vector Store (Qdrant):
    Indexed with metadata like volume, page_number, and image_path. Image 3: Qdrant UI and Example Vector

Also, possible to view graph vectors: Image 4: Qdrant ve Graph Representation

  1. Multimodal Search:

    • Image → Image: Find visually similar panels.
    • Text → Image: Natural language queries return relevant panels.
  2. Reranker (MaxSim / ColBERT):
    Top-k candidates reranked using token-level similarity.

  3. Explanation (MiniCPM-V-4):
    Selected panels are described using a vision-language model (VLM).


Models Used

StageModel
Embeddingjinaai/jina-embeddings-v3
Vector DBQdrant
Rerankercolbert-ir/colbertv2.0
VLMopenbmb/MiniCPM-V-4


MaxSim Reranker for Token-Level Precision

Basic cosine similarity is fast but shallow.
To get truly meaningful matches for textual queries, I added a MaxSim reranker.

How it works:

  • Query tokens and visual patches are embedded.
  • For each query token, the most similar patch token is selected.
  • Final score is the average of these max similarities.
score = mean(max_sim(query_token_i, panel_tokens))

This helps:

  • Disambiguate between visually similar but semantically different panels.

  • Improve retrieval quality for questions like: “What was Frieren doing in this scene?”

To support patch-level embedding, I used clip-vit-base-patch16 as the encoder and reranked with colbertv2.0.

Explaining Panels with MiniCPM-V-4

After retrieving a panel, I wanted to generate natural-language explanations. For this, I integrated MiniCPM-V-4 —a lightweight, high-performance vision-language model.

Use Cases:

  • Descriptive Prompts: “What’s happening in this panel?” “Describe the characters and setting.”

  • Targeted Questions: “Who is Frieren talking to here?” “What emotion does this scene convey?”

The model generates contextual answers using character posture, facial expressions, and background features.

Demo: Examples

###Visual → Visual Search: Image 5: Image Search Result Image 6: Remaining of Image 5 (Page number and similarity scores are below of images)

Text → Visual (RAG + MaxSim)

Image 7: Image Search with Reranker Enhances Results

Query: “graveyard scene” → Top matches reranked based on token-level semantic fit.

VLM Explanation:

Image 8: Result of VLM

Prompt: Describe this manga panel in detail: characters, setting, and events. If possible, guess the chapter/volume and justify briefly. Give a word that describes the emotions in this scene.

Output:

This manga panel depicts a moment of parting between two characters... The word that describes the emotion in this scene is bittersweet.

Insights

  • Visual RAG outperforms traditional text-only search in manga scenarios.

  • Token-level reranking (MaxSim) significantly improves semantic precision.

  • Embedding both text and images in a shared space (Jina v3) offers great flexibility.

  • Qdrant supports fast multimodal similarity search and metadata filtering.

  • MiniCPM-V is lightweight but surprisingly capable of nuanced explanations.

  • Turkish queries don’t work well yet—VLMs still struggle with non-English prompts.

  • Some character names and letters may be misread due to patch overlap.

Future Work

Some exciting enhancements ahead:

  • Panel Segmentation: Detect and crop individual panels using YOLO or SAM instead of using full pages.

  • Semantic Tagging: Add character recognition, emotion detection, and scene classification.

  • Narrative Graphs: Build a timeline of events by tracing character trajectories and interactions.

Conclusion

Finding and understanding a specific manga scene shouldn't require flipping through 200 pages. With Multimodal RAG + VLM, we can now:

  • Retrieve panels using either images or free-text

  • Rerank results with token-level semantics

  • Explain scenes in natural language

The architecture is fast, extendable, and fully open source. Check out the code on GitHub and explore your favorite manga like never before.

GitHub repo: [https://github.com/rabiaedayilmaz/manga-multimodal-rag]