Multimodal RAG: You Don't Need CLIP (Until You Do)
In my RAG for Dummies post, I walked through embeddings, chunking, hybrid search, and the full retrieval pipeline. That guide assumed one thing: your knowledge base is text.
In the real world, it’s not.
Your documentation has screenshots. Your research papers have diagrams and tables. Your support tickets include screen recordings. Your training library is full of video and audio. Text-only RAG ignores all of it.
Multimodal RAG fixes this. But here’s what most tutorials get wrong: they jump straight to CLIP and shared embedding spaces like that’s the obvious starting point. It’s not. The most effective multimodal RAG systems I’ve seen start with something much simpler: converting everything to text and running the same pipeline you already know.
There are really only three patterns. Here’s when each one makes sense.
The Pipeline Stays the Same
If you’ve built text RAG, you already understand 90% of multimodal RAG. The five-stage pipeline is identical:
%%{init: {"layout": "dagre"}}%%
flowchart LR
I[Ingest] --> E[Encode]
E --> X[Index]
X --> R[Retrieve]
R --> G[Generate]
- Ingest: parse files, extract text, images, tables, audio tracks, video frames
- Encode: run modality-specific encoders to produce vectors or textual summaries
- Index: store vectors plus metadata in a vector DB
- Retrieve: embed the query, search one or more indices, fuse results
- Generate: send retrieved context to a multimodal LLM for the final answer
The only thing that changes is step 2. Instead of one text encoder, you now have multiple encoders, one per modality. The rest of the pipeline is the same chunks, the same vector DB, the same retrieval logic.
The key insight: multimodal RAG is not a new architecture. It’s the same architecture with new encoders plugged in.
Pattern 1: Ground Everything in Text (Start Here)
This is the pragmatic baseline and where 80% of teams should start. Convert every non-text modality into text, then run your existing text-RAG pipeline unchanged.
Here’s the conversion recipe per modality:
| Modality | Conversion Method | Output |
|---|---|---|
| Images/diagrams | Describe with Gemini, GPT-4o, or Claude | Text summary + link to original image |
| Audio/speech | Transcribe with Whisper or AssemblyAI | Timestamped text chunks |
| Video | Transcribe audio + caption sampled keyframes | Text chunks with timestamps |
| Tables/figures | Parse to markdown or generate natural language summary | Structured text chunks |
Once converted, everything is text. Your existing chunking strategy, embedding model, vector DB, and retrieval logic all work without modification.
%%{init: {"layout": "dagre"}}%%
flowchart TB
subgraph Ingestion
IMG[Images] --> CAP[Caption with LLM]
AUD[Audio] --> ASR[Whisper ASR]
VID[Video] --> |audio track| ASR
VID --> |keyframes| CAP
TAB[Tables] --> PARSE[Parse to Markdown]
TXT[Text] --> CHUNK[Chunker]
end
CAP --> CHUNK
ASR --> CHUNK
PARSE --> CHUNK
CHUNK --> EMB[Text Embeddings]
EMB --> VDB[(Vector DB)]
Why this works: You piggyback on mature text-RAG tooling. No new embedding models, no new indices, no alignment problems. Your retrieval quality depends on caption quality, and modern multimodal LLMs produce surprisingly good descriptions.
A note on model choice: I use Gemini for multimodal extraction specifically because it goes beyond OCR. Gemini doesn’t just read text from a diagram. It understands the relationships: “Service A calls Service B through an API gateway, which routes to a load balancer.” That structural understanding produces text chunks that actually match user queries, unlike flat OCR output that reads like a word salad of labels. For images heavy on spatial relationships (architecture diagrams, flowcharts, org charts), the quality gap between Gemini’s understanding and basic OCR is the difference between useful retrieval and noise.
The trade-off: You lose visual nuance. A text description of a complex architecture diagram captures structure and relationships but misses visual patterns. A transcript of a podcast captures words but loses tone. For most use cases, text conversion is enough. For visual similarity search (“find screenshots that look like this one”), it’s not.
When to stay here:
- Your non-text content is mostly supplementary (screenshots in docs, tables in reports)
- You need something working this week, not this quarter
- Your retrieval queries are text-based (“what does the architecture look like?” not “find images similar to this diagram”)
Pattern 2: Shared Embedding Space with CLIP
When text conversion isn’t enough, you move to true multimodal embeddings. Models like CLIP map text and images into the same vector space, so a text query can directly retrieve images (and vice versa).
CLIP works by training a text encoder and an image encoder contrastively. During training, it learns to place matching text-image pairs close together in vector space. “A golden retriever on a beach” ends up near the photo of exactly that.
%%{init: {"layout": "dagre"}}%%
flowchart LR
subgraph CLIP Encoding
T[Text] --> TE[Text Encoder]
I[Image] --> IE[Image Encoder]
end
TE --> VS[(Shared Vector Space)]
IE --> VS
Q[Text Query] --> TE2[Text Encoder]
TE2 --> VS
VS --> |nearest neighbors| R[Text + Image Results]
Now you can do things Pattern 1 can’t: “find diagrams similar to this one,” “given this screenshot, find the relevant documentation,” or “retrieve the most relevant chart for this question.”
Similar models exist for other modalities:
| Model | Modalities | Use Case |
|---|---|---|
| CLIP | Text + Images | Image search, visual Q&A |
| CLAP | Text + Audio | Audio search, sound retrieval |
| ImageBind | Text + Image + Audio + Video + more | Cross-modal retrieval |
The trade-off: CLIP embeddings are optimized for cross-modal alignment, not for fine-grained text similarity. A CLIP text embedding of “Kubernetes pod networking” is worse at finding related Kubernetes docs than a dedicated text embedding model like BGE-M3. You gain cross-modal retrieval but lose some within-modality precision.
When to use this:
- Users search with images, not just text
- You need visual similarity (“find screenshots that look like this”)
- Your content is primarily visual (product catalogs, design systems, medical imaging)
Pattern 3: Multi-Vector Hybrid Retrieval
Real-world systems often need both text precision and cross-modal retrieval. Pattern 3 combines multiple indices and embedding types, then fuses results at query time.
For each document, you store multiple representations:
Document: architecture-guide.pdf, page 12
Stored vectors:
├── Text chunk embedding (BGE-M3) → "Kubernetes uses a pod network..."
├── Image caption embedding (BGE-M3) → "Diagram showing pod-to-pod communication"
└── CLIP embedding of raw image → [visual features of the diagram]
At query time, you search all indices and merge:
%%{init: {"layout": "dagre"}}%%
flowchart TB
Q[User Query] --> TQ[Text Embedding]
Q --> CQ[CLIP Embedding]
TQ --> TS[Text Index Search]
TQ --> CS1[Caption Index Search]
CQ --> CS2[CLIP Index Search]
TS --> FUSE[Merge + Rerank]
CS1 --> FUSE
CS2 --> FUSE
FUSE --> LLM[Multimodal LLM]
LLM --> A[Answer with Text + Images]
LangChain calls this the “multi-vector retriever” pattern. You retrieve text, image captions, and raw images from separate indices, then merge them before sending to the LLM.
The fusion problem: When you search three indices, you get three ranked lists. How do you merge them? Common strategies:
| Strategy | How It Works | Best For |
|---|---|---|
| Reciprocal Rank Fusion (RRF) | Combine ranks across lists, not scores | Different embedding models with incomparable scores |
| Score normalization | Normalize scores to 0-1, then weighted sum | Same embedding model across indices |
| LLM reranking | Send all candidates to an LLM for relevance scoring | Highest accuracy, highest latency |
When to use this:
- Semi-structured documents (PDFs with text, tables, charts, screenshots)
- You need both precise text retrieval and visual similarity search
- You’re building a production system where retrieval quality justifies the complexity
The Modality Cookbook
Each modality has its own ingestion recipe. Here’s the practical breakdown:
Images and Diagrams
# 1. Extract images from documents
images = extract_images(pdf_path) # pdfplumber, pymupdf, unstructured
# 2. Caption each image
for img in images:
caption = vision_llm.describe(img) # Gemini, GPT-4o, Claude
# 3. Store caption as text chunk + keep raw image reference
store_chunk(
text=caption,
metadata={"source": pdf_path, "page": img.page, "type": "image"},
image_path=img.saved_path # for display in answer UI
)
Pro tip: Prompt the vision LLM with context. “Describe this diagram from a Kubernetes networking guide” produces far better captions than “Describe this image.”
Audio and Speech
# 1. Transcribe with Whisper
transcript = whisper.transcribe(audio_path, timestamps=True)
# 2. Chunk by natural pauses or fixed intervals (30-60 second segments)
chunks = chunk_transcript(transcript, max_seconds=60)
# 3. Embed and store with timestamp metadata
for chunk in chunks:
store_chunk(
text=chunk.text,
metadata={
"source": audio_path,
"start_time": chunk.start,
"end_time": chunk.end
}
)
Timestamps let you link users directly to the relevant moment in the audio. “Jump to 14:32” is far more useful than “somewhere in this podcast.”
Video
Video is audio + sampled frames. Transcribe the audio track, sample keyframes at 1-2 per minute (or on scene changes), caption those frames, and store both as text chunks with timestamps.
Video: product-demo.mp4 (15 min)
Stored chunks:
├── 15 transcript chunks (1 per minute of speech)
├── 20 keyframe captions (scene change detection)
└── All chunks indexed with timestamps for seek links
Tables
Tables are deceptively tricky. A markdown-formatted table embeds poorly because embedding models treat it as flat text and lose the row/column relationships.
Two approaches:
- Flatten to text: “In Q1 2026, revenue was $4.2M, up 23% from Q4 2025.” Embed the natural language version.
- Store structured + summary: Keep the raw table for display, but generate a natural language summary for embedding and retrieval.
Option 2 gives better retrieval and better display. The LLM retrieves via the summary, then gets the full table in context for precise answers.
The Tooling Stack
A practical multimodal RAG stack in 2026:
| Layer | Options |
|---|---|
| Document parsing | Unstructured, pdfplumber, pymupdf |
| Image captioning | Gemini, GPT-4o, Claude, Qwen-VL |
| Audio transcription | Whisper, AssemblyAI, Deepgram |
| Text embeddings | BGE-M3, OpenAI text-embedding-3, Cohere Embed v3 |
| Multimodal embeddings | CLIP, ImageBind, SigLIP |
| Vector DB | pgvector, Pinecone, Weaviate, Qdrant, Milvus |
| Orchestration | LangChain, Haystack, LlamaIndex |
| Generation | GPT-4o, Claude, Qwen-VL (any multimodal LLM) |
My recommendation: Start with Pattern 1 using Unstructured for parsing, Whisper for audio, Gemini for image understanding, and your existing text-RAG stack for everything else. You can add CLIP and multi-vector retrieval later when you have a specific use case that demands it.
Gotchas That Will Bite You
Bad captions poison retrieval. If your vision LLM describes a network architecture diagram as “a diagram with boxes and arrows,” that caption is useless for retrieval. Invest time in captioning prompts.
Modality imbalance skews results. If you have 10,000 text chunks and 50 image captions, text will dominate retrieval regardless of relevance. Normalize scores per modality or use separate indices.
Context windows are not infinite. Sending 5 text chunks plus 3 high-resolution images to the LLM eats context fast. A single image can consume 1,000+ tokens. Budget accordingly.
Evaluation is hard. Text retrieval has established benchmarks. Multimodal retrieval doesn’t (yet). Build domain-specific eval sets: 50-100 queries where you know which documents, images, or audio segments should be retrieved.
Latency compounds. Each additional modality adds encoding time. Whisper transcription on a 30-minute audio file takes 30-60 seconds. Captioning 50 images takes 2-3 minutes. Plan for async ingestion pipelines, not synchronous requests.
The Decision Framework
Not sure which pattern to use? Start here:
Do your users search with text only?
├── Yes: Do you need to retrieve non-text content?
│ ├── No: You don't need multimodal RAG. Stick with text-RAG.
│ └── Yes: Use Pattern 1 (text conversion).
│ └── Are captions losing critical information?
│ ├── No: Stay with Pattern 1.
│ └── Yes: Add Pattern 2 (CLIP) for that modality.
└── No (users search with images/audio):
└── Use Pattern 2 or Pattern 3 (multi-vector hybrid).
Most teams land on Pattern 1. Some graduate to Pattern 3 for production. Very few need Pattern 2 alone.
The Bottom Line
Multimodal RAG is not a new architecture. It’s the same retrieval pipeline with different encoders plugged in at ingestion.
The pragmatic path: convert everything to text first. Captions, transcripts, and markdown tables get you surprisingly far with zero new infrastructure. When text conversion loses critical information (visual similarity, spatial relationships, audio features), add CLIP or multi-vector retrieval for that specific modality.
Don’t build Pattern 3 on day one. Build Pattern 1, measure where retrieval fails, then upgrade the modalities that need it.
Building multimodal RAG and hitting retrieval quality issues? I’d love to hear what modalities are giving you trouble. Reach out on LinkedIn.