Multimodal RAG: You Don't Need CLIP (Until You Do)

In my RAG for Dummies post, I walked through embeddings, chunking, hybrid search, and the full retrieval pipeline. That guide assumed one thing: your knowledge base is text.

In the real world, it’s not.

Your documentation has screenshots. Your research papers have diagrams and tables. Your support tickets include screen recordings. Your training library is full of video and audio. Text-only RAG ignores all of it.

Multimodal RAG fixes this. But here’s what most tutorials get wrong: they jump straight to CLIP and shared embedding spaces like that’s the obvious starting point. It’s not. The most effective multimodal RAG systems I’ve seen start with something much simpler: converting everything to text and running the same pipeline you already know.

There are really only three patterns. Here’s when each one makes sense.

The Pipeline Stays the Same

If you’ve built text RAG, you already understand 90% of multimodal RAG. The five-stage pipeline is identical:

%%{init: {"layout": "dagre"}}%%
flowchart LR
    I[Ingest] --> E[Encode]
    E --> X[Index]
    X --> R[Retrieve]
    R --> G[Generate]

Ingest: parse files, extract text, images, tables, audio tracks, video frames
Encode: run modality-specific encoders to produce vectors or textual summaries
Index: store vectors plus metadata in a vector DB
Retrieve: embed the query, search one or more indices, fuse results
Generate: send retrieved context to a multimodal LLM for the final answer

The only thing that changes is step 2. Instead of one text encoder, you now have multiple encoders, one per modality. The rest of the pipeline is the same chunks, the same vector DB, the same retrieval logic.

The key insight: multimodal RAG is not a new architecture. It’s the same architecture with new encoders plugged in.

Pattern 1: Ground Everything in Text (Start Here)

This is the pragmatic baseline and where 80% of teams should start. Convert every non-text modality into text, then run your existing text-RAG pipeline unchanged.

Here’s the conversion recipe per modality:

Modality	Conversion Method	Output
Images/diagrams	Describe with Gemini, GPT-4o, or Claude	Text summary + link to original image
Audio/speech	Transcribe with Whisper or AssemblyAI	Timestamped text chunks
Video	Transcribe audio + caption sampled keyframes	Text chunks with timestamps
Tables/figures	Parse to markdown or generate natural language summary	Structured text chunks

Once converted, everything is text. Your existing chunking strategy, embedding model, vector DB, and retrieval logic all work without modification.

%%{init: {"layout": "dagre"}}%%
flowchart TB
    subgraph Ingestion
        IMG[Images] --> CAP[Caption with LLM]
        AUD[Audio] --> ASR[Whisper ASR]
        VID[Video] --> |audio track| ASR
        VID --> |keyframes| CAP
        TAB[Tables] --> PARSE[Parse to Markdown]
        TXT[Text] --> CHUNK[Chunker]
    end

    CAP --> CHUNK
    ASR --> CHUNK
    PARSE --> CHUNK

    CHUNK --> EMB[Text Embeddings]
    EMB --> VDB[(Vector DB)]

Why this works: You piggyback on mature text-RAG tooling. No new embedding models, no new indices, no alignment problems. Your retrieval quality depends on caption quality, and modern multimodal LLMs produce surprisingly good descriptions.

A note on model choice: I use Gemini for multimodal extraction specifically because it goes beyond OCR. Gemini doesn’t just read text from a diagram. It understands the relationships: “Service A calls Service B through an API gateway, which routes to a load balancer.” That structural understanding produces text chunks that actually match user queries, unlike flat OCR output that reads like a word salad of labels. For images heavy on spatial relationships (architecture diagrams, flowcharts, org charts), the quality gap between Gemini’s understanding and basic OCR is the difference between useful retrieval and noise.

The trade-off: You lose visual nuance. A text description of a complex architecture diagram captures structure and relationships but misses visual patterns. A transcript of a podcast captures words but loses tone. For most use cases, text conversion is enough. For visual similarity search (“find screenshots that look like this one”), it’s not.

When to stay here:

Your non-text content is mostly supplementary (screenshots in docs, tables in reports)
You need something working this week, not this quarter
Your retrieval queries are text-based (“what does the architecture look like?” not “find images similar to this diagram”)

Pattern 2: Shared Embedding Space with CLIP

When text conversion isn’t enough, you move to true multimodal embeddings. Models like CLIP map text and images into the same vector space, so a text query can directly retrieve images (and vice versa).

CLIP works by training a text encoder and an image encoder contrastively. During training, it learns to place matching text-image pairs close together in vector space. “A golden retriever on a beach” ends up near the photo of exactly that.

%%{init: {"layout": "dagre"}}%%
flowchart LR
    subgraph CLIP Encoding
        T[Text] --> TE[Text Encoder]
        I[Image] --> IE[Image Encoder]
    end

    TE --> VS[(Shared Vector Space)]
    IE --> VS

    Q[Text Query] --> TE2[Text Encoder]
    TE2 --> VS
    VS --> |nearest neighbors| R[Text + Image Results]

Now you can do things Pattern 1 can’t: “find diagrams similar to this one,” “given this screenshot, find the relevant documentation,” or “retrieve the most relevant chart for this question.”

Similar models exist for other modalities:

Model	Modalities	Use Case
CLIP	Text + Images	Image search, visual Q&A
CLAP	Text + Audio	Audio search, sound retrieval
ImageBind	Text + Image + Audio + Video + more	Cross-modal retrieval

The trade-off: CLIP embeddings are optimized for cross-modal alignment, not for fine-grained text similarity. A CLIP text embedding of “Kubernetes pod networking” is worse at finding related Kubernetes docs than a dedicated text embedding model like BGE-M3. You gain cross-modal retrieval but lose some within-modality precision.

When to use this:

Users search with images, not just text
You need visual similarity (“find screenshots that look like this”)
Your content is primarily visual (product catalogs, design systems, medical imaging)

Pattern 3: Multi-Vector Hybrid Retrieval

Real-world systems often need both text precision and cross-modal retrieval. Pattern 3 combines multiple indices and embedding types, then fuses results at query time.

For each document, you store multiple representations:

Document: architecture-guide.pdf, page 12

Stored vectors:
├── Text chunk embedding (BGE-M3)      → "Kubernetes uses a pod network..."
├── Image caption embedding (BGE-M3)   → "Diagram showing pod-to-pod communication"
└── CLIP embedding of raw image        → [visual features of the diagram]

At query time, you search all indices and merge:

%%{init: {"layout": "dagre"}}%%
flowchart TB
    Q[User Query] --> TQ[Text Embedding]
    Q --> CQ[CLIP Embedding]

    TQ --> TS[Text Index Search]
    TQ --> CS1[Caption Index Search]
    CQ --> CS2[CLIP Index Search]

    TS --> FUSE[Merge + Rerank]
    CS1 --> FUSE
    CS2 --> FUSE

    FUSE --> LLM[Multimodal LLM]
    LLM --> A[Answer with Text + Images]

LangChain calls this the “multi-vector retriever” pattern. You retrieve text, image captions, and raw images from separate indices, then merge them before sending to the LLM.

The fusion problem: When you search three indices, you get three ranked lists. How do you merge them? Common strategies:

Strategy	How It Works	Best For
Reciprocal Rank Fusion (RRF)	Combine ranks across lists, not scores	Different embedding models with incomparable scores
Score normalization	Normalize scores to 0-1, then weighted sum	Same embedding model across indices
LLM reranking	Send all candidates to an LLM for relevance scoring	Highest accuracy, highest latency

When to use this:

Semi-structured documents (PDFs with text, tables, charts, screenshots)
You need both precise text retrieval and visual similarity search
You’re building a production system where retrieval quality justifies the complexity

The Modality Cookbook

Each modality has its own ingestion recipe. Here’s the practical breakdown:

Images and Diagrams

# 1. Extract images from documents
images = extract_images(pdf_path)  # pdfplumber, pymupdf, unstructured

# 2. Caption each image
for img in images:
    caption = vision_llm.describe(img)  # Gemini, GPT-4o, Claude

    # 3. Store caption as text chunk + keep raw image reference
    store_chunk(
        text=caption,
        metadata={"source": pdf_path, "page": img.page, "type": "image"},
        image_path=img.saved_path  # for display in answer UI
    )

Pro tip: Prompt the vision LLM with context. “Describe this diagram from a Kubernetes networking guide” produces far better captions than “Describe this image.”

Audio and Speech

# 1. Transcribe with Whisper
transcript = whisper.transcribe(audio_path, timestamps=True)

# 2. Chunk by natural pauses or fixed intervals (30-60 second segments)
chunks = chunk_transcript(transcript, max_seconds=60)

# 3. Embed and store with timestamp metadata
for chunk in chunks:
    store_chunk(
        text=chunk.text,
        metadata={
            "source": audio_path,
            "start_time": chunk.start,
            "end_time": chunk.end
        }
    )

Timestamps let you link users directly to the relevant moment in the audio. “Jump to 14:32” is far more useful than “somewhere in this podcast.”

Video

Video is audio + sampled frames. Transcribe the audio track, sample keyframes at 1-2 per minute (or on scene changes), caption those frames, and store both as text chunks with timestamps.

Video: product-demo.mp4 (15 min)

Stored chunks:
├── 15 transcript chunks (1 per minute of speech)
├── 20 keyframe captions (scene change detection)
└── All chunks indexed with timestamps for seek links

Tables

Tables are deceptively tricky. A markdown-formatted table embeds poorly because embedding models treat it as flat text and lose the row/column relationships.

Two approaches:

Flatten to text: “In Q1 2026, revenue was $4.2M, up 23% from Q4 2025.” Embed the natural language version.
Store structured + summary: Keep the raw table for display, but generate a natural language summary for embedding and retrieval.

Option 2 gives better retrieval and better display. The LLM retrieves via the summary, then gets the full table in context for precise answers.

The Tooling Stack

A practical multimodal RAG stack in 2026:

Layer	Options
Document parsing	Unstructured, pdfplumber, pymupdf
Image captioning	Gemini, GPT-4o, Claude, Qwen-VL
Audio transcription	Whisper, AssemblyAI, Deepgram
Text embeddings	BGE-M3, OpenAI text-embedding-3, Cohere Embed v3
Multimodal embeddings	CLIP, ImageBind, SigLIP
Vector DB	pgvector, Pinecone, Weaviate, Qdrant, Milvus
Orchestration	LangChain, Haystack, LlamaIndex
Generation	GPT-4o, Claude, Qwen-VL (any multimodal LLM)

My recommendation: Start with Pattern 1 using Unstructured for parsing, Whisper for audio, Gemini for image understanding, and your existing text-RAG stack for everything else. You can add CLIP and multi-vector retrieval later when you have a specific use case that demands it.

Gotchas That Will Bite You

Bad captions poison retrieval. If your vision LLM describes a network architecture diagram as “a diagram with boxes and arrows,” that caption is useless for retrieval. Invest time in captioning prompts.

Modality imbalance skews results. If you have 10,000 text chunks and 50 image captions, text will dominate retrieval regardless of relevance. Normalize scores per modality or use separate indices.

Context windows are not infinite. Sending 5 text chunks plus 3 high-resolution images to the LLM eats context fast. A single image can consume 1,000+ tokens. Budget accordingly.

Evaluation is hard. Text retrieval has established benchmarks. Multimodal retrieval doesn’t (yet). Build domain-specific eval sets: 50-100 queries where you know which documents, images, or audio segments should be retrieved.

Latency compounds. Each additional modality adds encoding time. Whisper transcription on a 30-minute audio file takes 30-60 seconds. Captioning 50 images takes 2-3 minutes. Plan for async ingestion pipelines, not synchronous requests.

The Decision Framework

Not sure which pattern to use? Start here:

Do your users search with text only?
├── Yes: Do you need to retrieve non-text content?
│   ├── No: You don't need multimodal RAG. Stick with text-RAG.
│   └── Yes: Use Pattern 1 (text conversion).
│       └── Are captions losing critical information?
│           ├── No: Stay with Pattern 1.
│           └── Yes: Add Pattern 2 (CLIP) for that modality.
└── No (users search with images/audio):
    └── Use Pattern 2 or Pattern 3 (multi-vector hybrid).

Most teams land on Pattern 1. Some graduate to Pattern 3 for production. Very few need Pattern 2 alone.

The Bottom Line

Multimodal RAG is not a new architecture. It’s the same retrieval pipeline with different encoders plugged in at ingestion.

The pragmatic path: convert everything to text first. Captions, transcripts, and markdown tables get you surprisingly far with zero new infrastructure. When text conversion loses critical information (visual similarity, spatial relationships, audio features), add CLIP or multi-vector retrieval for that specific modality.

Don’t build Pattern 3 on day one. Build Pattern 1, measure where retrieval fails, then upgrade the modalities that need it.

Building multimodal RAG and hitting retrieval quality issues? I’d love to hear what modalities are giving you trouble. Reach out on LinkedIn.