Dense vs Sparse Retrieval: Why Your Vector-Only Search Is Missing 25% of Results

· 6 min read ·
·
AI RAG Search Embeddings Information Retrieval

Everyone switched to vector search. Embeddings feel like magic. You encode a query, find the nearest neighbors, and suddenly your search understands meaning instead of just matching keywords.

Here’s what nobody mentions: pure dense retrieval misses about 25% of relevant documents. Search for “ERR_CONNECTION_REFUSED” and your embedding model returns results about “network connectivity issues.” Search for a specific drug name or legal clause identifier and the vectors helpfully return semantically similar but factually wrong passages.

I touched on this briefly in my RAG for Dummies guide. But the dense vs sparse decision deserves a deeper look, because it’s the single most impactful architectural choice in any retrieval system. Get it wrong and no amount of reranking or prompt tuning saves you.

Two Ways to Represent Text

The fundamental difference is how each approach turns text into something searchable.

Sparse retrieval (BM25, TF-IDF, SPLADE) represents text as high-dimensional vectors where each dimension maps to a vocabulary term. Most entries are zero, hence “sparse.” A document about “PostgreSQL connection pooling” gets high weights on those exact tokens and zero everywhere else.

Dense retrieval (DPR, sentence-transformers, OpenAI embeddings) maps text into low-dimensional continuous vectors (384-1024 dimensions) where every dimension has a non-zero value. Meaning is distributed across all dimensions. “PostgreSQL connection pooling” and “database connection management” end up near each other in vector space.

Sparse (BM25)Dense (Embeddings)
Vector sizeVocabulary-sized (~30K-100K dims)Fixed (384-1024 dims)
Non-zero entriesFew (only matching terms)All dimensions
What it capturesExact lexical overlapSemantic similarity
InterpretabilityHigh (term-level scores)Low (opaque dimensions)
Training data neededNoneThousands of labeled pairs

The key insight: sparse retrieval knows exactly which words matter. Dense retrieval knows what words mean. Neither is complete on its own.

Where BM25 Still Wins

Sparse retrieval dominates in scenarios most teams encounter daily but rarely benchmark against:

Exact entity matching. Search for HIPAA_164.312(a)(1) in a compliance database. BM25 finds it instantly because it matches the exact token. Dense retrieval encodes it into a generic “healthcare regulation” neighborhood, returning HIPAA sections about patient rights, privacy notices, and a dozen other clauses that aren’t what you need.

Rare tokens and codes. Error codes, part numbers, chemical names, API endpoint paths. Anything that appears infrequently in training data gets poorly represented in embedding space. BM25 doesn’t care how rare a term is. If it’s in the document, it matches.

Zero-shot performance. BM25 needs no training data. No labeled pairs, no fine-tuning, no GPU inference. Corpus statistics (term frequency, document frequency) are computed at index time. For a new domain with no labeled data, BM25 provides a robust baseline that many fine-tuned dense retrievers struggle to beat.

Debuggability. When BM25 returns a wrong result, you can inspect exactly which terms contributed to the score and adjust analyzers, stopwords, or stemming. When dense retrieval fails, you’re staring at 768 opaque floating-point numbers.

Where Dense Retrieval Wins

Dense retrieval shines when meaning diverges from wording:

Paraphrase handling. “How do I reduce API response times?” and “optimize server latency for REST endpoints” share almost no tokens but mean the same thing. Dense retrieval captures this. BM25 returns nothing.

Conversational queries. Users don’t type keyword queries into chatbots. They ask “why is my deployment taking forever?” Dense retrieval maps this to documentation about build optimization, CI/CD bottlenecks, and Docker layer caching. BM25 matches documents containing “deployment,” “taking,” and “forever,” which is rarely useful.

Cross-lingual retrieval. Multilingual encoders place “base de datos” and “database” near each other in embedding space. BM25 treats them as completely unrelated tokens.

%%{init: {"layout": "dagre"}}%%
flowchart TB
    Q["Search for 'reduce latency'"] --> BM25[BM25]
    Q --> ENC[Encoder]

    subgraph Sparse ["Sparse Path"]
        BM25 --> R1["Docs containing\n'reduce' AND 'latency'"]
    end

    subgraph Dense ["Dense Path"]
        ENC --> ANN[ANN Search]
        ANN --> R2["Docs about performance\noptimization, response times,\ncaching strategies"]
    end

Why this matters: dense retrieval finds documents that BM25 structurally cannot find. The semantic gap between query vocabulary and document vocabulary is real, and it grows wider with natural-language interfaces.

The Infrastructure Trade-off

This isn’t just about accuracy. The operational differences are substantial.

FactorSparse (BM25)Dense (Embeddings)
Index typeInverted indexANN (HNSW, IVF, PQ)
Query latency< 10ms10-50ms (plus encoding)
GPU requiredNoYes (encoding) or high-CPU
Storage per docPostings list (compressed)768 floats = 3KB per vector
Mature toolingElasticsearch, OpenSearch, LuceneFAISS, Milvus, Pinecone, Qdrant
Scaling to 1B docsBattle-testedRequires quantization + sharding

Dense retrieval adds two costs that sparse retrieval avoids entirely: neural model inference at query time (encoding the query into a vector) and storing a fixed-size vector for every document. For a corpus of 100M documents at 768 dimensions, that’s roughly 300GB of vector storage before any compression.

Sparse inverted indexes, by contrast, only store non-zero postings. Compression techniques like delta encoding and variable-byte encoding shrink them further. Elasticsearch has been running BM25 at Google-scale for over a decade.

The bottom line on infra: if you’re building a RAG system on a startup budget with no GPUs, BM25 on Elasticsearch or pgvector with keyword search gets you remarkably far.

Hybrid Retrieval: The Actual Answer

In practice, neither approach alone is good enough. Hybrid retrieval runs both and merges results. It consistently outperforms either standalone retriever across benchmarks like BEIR and MTEB.

The numbers from production systems I’ve seen: BM25-only hits roughly 75% recall. Dense-only reaches about 80%. Hybrid pushes past 90%. That extra 10-15% often contains the exact document the user needed.

Fusion Strategies

Late score fusion is the simplest approach. Run both retrievers independently, normalize scores, combine with a weighted sum:

def hybrid_search(query, alpha=0.7):
    # alpha controls dense vs sparse weight
    dense_results = dense_search(query, k=50)
    sparse_results = bm25_search(query, k=50)

    # Normalize scores to [0, 1]
    dense_norm = min_max_normalize(dense_results)
    sparse_norm = min_max_normalize(sparse_results)

    # Weighted combination
    combined = {}
    for doc_id, score in dense_norm:
        combined[doc_id] = alpha * score
    for doc_id, score in sparse_norm:
        combined[doc_id] = combined.get(doc_id, 0) + (1 - alpha) * score

    return sorted(combined.items(), key=lambda x: x[1], reverse=True)

Tuning alpha matters. For legal search where exact clause matching is critical, I’d push alpha toward 0.3 (more sparse weight). For a conversational FAQ bot, 0.7 or higher (more dense weight).

Candidate pre-filtering uses BM25 as a fast first pass, then applies dense similarity or cross-encoder reranking on the reduced candidate set. This keeps latency low because ANN search over 100 pre-filtered candidates is cheaper than searching millions of vectors.

%%{init: {"layout": "dagre"}}%%
flowchart LR
    Q[Query] --> BM25[BM25: Top 100]
    BM25 --> RE[Dense Rerank]
    RE --> TOP[Top 5 Results]

When to Use What

Your Use CaseRecommended ApproachWhy
Code search, error lookupsSparse-heavy hybrid (alpha 0.3)Exact tokens matter most
FAQ / conversational searchDense-heavy hybrid (alpha 0.7)Semantic matching dominates
Legal / compliance docsSparse-heavy with rerankerCan’t miss exact clauses
General knowledge baseBalanced hybrid (alpha 0.5)Mixed query types
Multilingual corpusDense-heavy hybridCross-lingual embeddings
No labeled data, new domainBM25 baseline firstNo training required

Neural Sparse: The Middle Ground

Worth mentioning: neural sparse models like SPLADE use transformers to produce sparse, vocabulary-aligned vectors, but with learned term weights instead of raw TF-IDF statistics. They can expand queries with related terms the user didn’t type, while still using inverted indexes for fast retrieval.

SPLADE closes much of the gap between BM25 and dense retrieval while keeping the interpretability and infrastructure simplicity of sparse search. If you want better-than-BM25 without committing to a full vector database, neural sparse models are worth evaluating.

Recent research scaling both paradigms on decoder-only LLMs (Llama-3 variants) found that sparse retrieval with contrastive training often outperforms dense retrieval under the same compute budget. The paradigm choice isn’t settled. It depends on your training data, model scale, and query distribution.

The Bottom Line

Dense retrieval understands meaning. Sparse retrieval understands words. Your users need both.

Start with BM25. It’s free, fast, and battle-tested. Add dense retrieval when you hit the semantic gap: users asking natural-language questions that don’t share vocabulary with your documents. Run them together as hybrid search, tune the interpolation weight for your domain, and add a cross-encoder reranker if precision matters more than speed.

The teams shipping the best RAG systems aren’t picking one paradigm. They’re combining both and tuning the blend.


Building retrieval systems and debating dense vs sparse? I’d love to hear what’s working for your use case. Reach out on LinkedIn.