Manga Panel Search with Visual Similarity, Reranking, and VLM Explanation

2025-08-29

Published on: Medium

Ever struggled to find that one panel in a manga you vaguely remember?

In this post, I walk through a multimodal RAG system I built for searching and explaining scenes in the manga Frieren: Beyond Journey’s End (Volume 01). The goal is simple but powerful:
Let users search manga using either a visual or textual query and get semantically relevant panels with visual-language model explanations.

For example:

“What is the scene where Frieren and Fern discovered the statue of Himmel?”
should retrieve the exact panel—even if you only describe it.

Image 1: Pages Retrieved via Question

Why Multimodal RAG?

There are several approaches to visual search in manga or comics. Each has trade-offs. Here's why I chose Multimodal Retrieval-Augmented Generation (RAG):

1. Keyword-Based Search

Using captions, OCR outputs, or manual tags for keyword lookup.

Problems:

Manga panels often lack text or contain sparse speech bubbles.
OCR may fail on Japanese/Katakana or artistic fonts.
Visually significant scenes without text are missed.

2. Visual Similarity (CLIP / CNN + k-NN)

Each image is embedded into a vector, and visually similar panels are returned using cosine similarity.

Pros: Fast and language-independent.
Cons:

Cannot answer textual queries.
Visually similar ≠ semantically similar (e.g., same layout but different meaning).
Lacks context.

3. Caption Generation + Text-Only RAG

Each panel is captioned, and standard text-based RAG is applied.

Pros: Works with traditional RAG pipelines.
Cons:

Captions are noisy and unreliable.
Highly sensitive to model quality.
Different users may describe the same panel differently.

4. Multimodal RAG: Flexible and Intuitive

I opted for Multimodal RAG where both visual and textual queries are embedded in a shared vector space.

Users can upload an image or type a query.
Both are embedded using the same model.
Matches are semantically ranked and optionally reranked with a token-level reranker.

Why it works:

No need for captions or manual tagging.
Captures semantic meaning, not just visual layout.
Supports free-form text and direct visual lookup.
Allows reranking (e.g., MaxSim) for improved precision.

System Architecture Overview

Here’s the full pipeline:

PDFs to PNGs:
PDF volumes converted to page-wise PNGs using PyMuPDF.
Embeddings (Jina v3):
Each image embedded into 1024-dim vectors via jinaai/jina-embeddings-v3.
Vector Store (Qdrant):
Indexed with metadata like volume, page_number, and image_path.

Also, possible to view graph vectors: Image 4: Qdrant ve Graph Representation

Multimodal Search:
- Image → Image: Find visually similar panels.
- Text → Image: Natural language queries return relevant panels.
Reranker (MaxSim / ColBERT):
Top-k candidates reranked using token-level similarity.
Explanation (MiniCPM-V-4):
Selected panels are described using a vision-language model (VLM).

Models Used

Stage	Model
Embedding	`jinaai/jina-embeddings-v3`
Vector DB	`Qdrant`
Reranker	`colbert-ir/colbertv2.0`
VLM	`openbmb/MiniCPM-V-4`

MaxSim Reranker for Token-Level Precision

Basic cosine similarity is fast but shallow.
To get truly meaningful matches for textual queries, I added a MaxSim reranker.

How it works:

Query tokens and visual patches are embedded.
For each query token, the most similar patch token is selected.
Final score is the average of these max similarities.

score = mean(max_sim(query_token_i, panel_tokens))

This helps:

Disambiguate between visually similar but semantically different panels.
Improve retrieval quality for questions like: “What was Frieren doing in this scene?”

To support patch-level embedding, I used clip-vit-base-patch16 as the encoder and reranked with colbertv2.0.

Explaining Panels with MiniCPM-V-4

After retrieving a panel, I wanted to generate natural-language explanations. For this, I integrated MiniCPM-V-4 —a lightweight, high-performance vision-language model.

Use Cases:

Descriptive Prompts: “What’s happening in this panel?” “Describe the characters and setting.”
Targeted Questions: “Who is Frieren talking to here?” “What emotion does this scene convey?”

The model generates contextual answers using character posture, facial expressions, and background features.

Demo: Examples

###Visual → Visual Search: Image 5: Image Search Result Image 6: Remaining of Image 5 (Page number and similarity scores are below of images)

Text → Visual (RAG + MaxSim)

Image 7: Image Search with Reranker Enhances Results

Query: “graveyard scene” → Top matches reranked based on token-level semantic fit.

VLM Explanation:

Image 8: Result of VLM

Prompt: Describe this manga panel in detail: characters, setting, and events. If possible, guess the chapter/volume and justify briefly. Give a word that describes the emotions in this scene.

Output:

This manga panel depicts a moment of parting between two characters... The word that describes the emotion in this scene is bittersweet.

Insights

Visual RAG outperforms traditional text-only search in manga scenarios.
Token-level reranking (MaxSim) significantly improves semantic precision.
Embedding both text and images in a shared space (Jina v3) offers great flexibility.
Qdrant supports fast multimodal similarity search and metadata filtering.
MiniCPM-V is lightweight but surprisingly capable of nuanced explanations.
Turkish queries don’t work well yet—VLMs still struggle with non-English prompts.
Some character names and letters may be misread due to patch overlap.

Future Work

Some exciting enhancements ahead:

Panel Segmentation: Detect and crop individual panels using YOLO or SAM instead of using full pages.
Semantic Tagging: Add character recognition, emotion detection, and scene classification.
Narrative Graphs: Build a timeline of events by tracing character trajectories and interactions.

Conclusion

Finding and understanding a specific manga scene shouldn't require flipping through 200 pages. With Multimodal RAG + VLM, we can now:

Retrieve panels using either images or free-text
Rerank results with token-level semantics
Explain scenes in natural language

The architecture is fast, extendable, and fully open source. Check out the code on GitHub and explore your favorite manga like never before.

GitHub repo: [https://github.com/rabiaedayilmaz/manga-multimodal-rag]