
If your RAG pipeline misses obvious matches — a user types an exact error code, a SKU, or a function name and the system returns vaguely related fluff — you are running into the classic limits of pure vector retrieval. Hybrid search fixes this by running keyword (BM25) and vector retrieval side by side, then fusing the results into a single ranked list. The result is a retriever that catches both exact terms and semantic matches, which is what production RAG systems actually need.
This tutorial is for engineers building or maintaining a RAG application who already have basic vector search working and want to push retrieval quality higher. You will see how hybrid search works under the hood, two production-ready implementations (Qdrant and PostgreSQL with pgvector), how Reciprocal Rank Fusion combines the two result sets, and when adding a hybrid layer is a bad idea.
What Is Hybrid Search?
Hybrid search runs a sparse keyword retriever (typically BM25) and a dense vector retriever in parallel against the same corpus, then fuses the two ranked result lists into a single output. The sparse side excels at exact tokens like product codes, function names, or rare proper nouns. The dense side captures synonyms, paraphrases, and intent. Together they cover queries that either approach alone would miss.
Most production hybrid systems combine the two with Reciprocal Rank Fusion (RRF), a fusion algorithm that scores each document based on its rank in each list rather than the raw scores. Because BM25 scores and cosine similarities live on different scales, rank-based fusion sidesteps the normalization headaches that plague naive linear combinations.
Why Vector Search Alone Falls Short
Dense embeddings are powerful, but they have well-documented weaknesses. For instance, embedding models compress meaning into a fixed-dimensional vector — typically 768 to 3,072 dimensions — and tokens that appear rarely in training data tend to collapse together. As a result, queries with proper nouns, identifiers, or domain jargon often return documents that are “semantically close” but contain none of the actual terms the user typed.
Consider a customer support knowledge base. A user types ERR_BLOCKED_BY_CLIENT. Pure vector search may surface a document about generic browser errors because the error code itself sits in a region of embedding space surrounded by similar-looking strings. Meanwhile, BM25 would return the one document that mentions the exact code first. Furthermore, vector search struggles with negation, numeric values, and version-specific terminology — all common in technical content.
Why Keyword Search Alone Falls Short
Keyword retrieval has the opposite problem. BM25 cannot bridge synonyms or rephrasings. A user query “how do I cancel my account” will not match a document titled “Closing your subscription” unless someone manually maintains synonym lists. Moreover, BM25 has no concept of intent. For instance, “best Python web framework for async” and “fastest async Python framework” should retrieve overlapping documents, but BM25 treats them as different bags of tokens.
In practice, sparse retrieval also degrades when chunks are short. A 200-token chunk has limited term frequency signal, which is exactly what BM25 depends on. Hybrid search compensates by letting the dense retriever pick up where sparse runs out of evidence.
How Hybrid Search Works: The Core Algorithms
You need a fusion algorithm because BM25 scores and cosine similarities are not comparable. BM25 produces unbounded positive numbers shaped by term frequency and inverse document frequency. Cosine similarity returns a value between -1 and 1. Multiplying or averaging them directly produces meaningless rankings.
Reciprocal Rank Fusion (RRF)
RRF is the default choice for production hybrid systems. For each document d that appears in either result list, compute:
RRF_score(d) = sum over retrievers r of (1 / (k + rank_r(d)))
The constant k is typically 60 (the original paper’s recommendation). Documents that rank high in either list get a strong score. Documents that rank well in both get an even higher score. Documents missing from one list still contribute through the other, which is the whole point.
def reciprocal_rank_fusion(result_lists: list[list[str]], k: int = 60) -> list[tuple[str, float]]:
"""Combine ranked lists of document IDs into a single ranking using RRF."""
scores: dict[str, float] = {}
for results in result_lists:
for rank, doc_id in enumerate(results, start=1):
scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
This works because rank-based scoring is scale-invariant. As a result, you avoid the trap of trying to normalize wildly different score distributions across retrievers.
Convex Linear Combination
The alternative is alpha * vector_score + (1 - alpha) * keyword_score, where both scores are first normalized (usually min-max scaled). This approach gives you a tuning knob — push alpha toward 1.0 for more semantic, toward 0.0 for more lexical. However, it requires careful score normalization, and the optimal alpha differs per query type. For most teams, RRF wins on simplicity and robustness.
Implementation: Hybrid Search With Qdrant
Qdrant’s Query API supports hybrid search natively as of version 1.10, including server-side RRF fusion. You upload both dense and sparse vectors per point, then query both in one call.
First, install the client and a sparse encoder:
pip install qdrant-client fastembed
Then create a collection with two named vector spaces — one dense, one sparse — and index a few documents:
from qdrant_client import QdrantClient
from qdrant_client.models import (
VectorParams, SparseVectorParams, Distance,
PointStruct, SparseVector, NamedVector
)
from fastembed import TextEmbedding, SparseTextEmbedding
client = QdrantClient(url="http://localhost:6333")
dense_model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")
sparse_model = SparseTextEmbedding(model_name="Qdrant/bm25")
client.recreate_collection(
collection_name="kb",
vectors_config={"dense": VectorParams(size=384, distance=Distance.COSINE)},
sparse_vectors_config={"sparse": SparseVectorParams()},
)
docs = [
"Reset your password from the account settings page.",
"ERR_BLOCKED_BY_CLIENT means an ad blocker is interfering with the request.",
"Cancel your subscription anytime from the billing dashboard.",
]
dense_vecs = list(dense_model.embed(docs))
sparse_vecs = list(sparse_model.embed(docs))
points = [
PointStruct(
id=i,
vector={
"dense": dense_vecs[i].tolist(),
"sparse": SparseVector(
indices=sparse_vecs[i].indices.tolist(),
values=sparse_vecs[i].values.tolist(),
),
},
payload={"text": docs[i]},
)
for i in range(len(docs))
]
client.upsert(collection_name="kb", points=points)
Now run a hybrid query. Qdrant’s query_points accepts a prefetch block where you list each retriever and the number of candidates it should return, then specify a fusion strategy:
from qdrant_client.models import Prefetch, Fusion, FusionQuery
query = "ad blocker stopping requests"
q_dense = list(dense_model.embed([query]))[0].tolist()
q_sparse = list(sparse_model.embed([query]))[0]
results = client.query_points(
collection_name="kb",
prefetch=[
Prefetch(
query=q_dense,
using="dense",
limit=20,
),
Prefetch(
query=SparseVector(
indices=q_sparse.indices.tolist(),
values=q_sparse.values.tolist(),
),
using="sparse",
limit=20,
),
],
query=FusionQuery(fusion=Fusion.RRF),
limit=5,
).points
for r in results:
print(r.score, r.payload["text"])
The dense retriever picks up “stopping requests” as semantically related to “blocked”, while the sparse retriever locks onto the exact ERR_BLOCKED_BY_CLIENT token. RRF promotes the document that ranks well in both lists.
Implementation: Hybrid Search With PostgreSQL
If you are already running Postgres for your application, you can implement hybrid search without adding a new system. The recipe combines pgvector for dense retrieval with Postgres’s built-in tsvector and ts_rank_cd for keyword retrieval. If you want a deeper dive on the keyword side, see our guide on PostgreSQL full-text search vs Elasticsearch vs Algolia.
Set up the table with both columns:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(384) NOT NULL,
tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', content)) STORED
);
CREATE INDEX documents_embedding_idx
ON documents USING hnsw (embedding vector_cosine_ops);
CREATE INDEX documents_tsv_idx ON documents USING GIN (tsv);
Then run two queries and combine them in application code with RRF. Postgres can also do this in pure SQL using CTEs:
WITH dense AS (
SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> $1) AS rank
FROM documents
ORDER BY embedding <=> $1
LIMIT 20
),
sparse AS (
SELECT id, ROW_NUMBER() OVER (ORDER BY ts_rank_cd(tsv, query) DESC) AS rank
FROM documents, plainto_tsquery('english', $2) query
WHERE tsv @@ query
ORDER BY ts_rank_cd(tsv, query) DESC
LIMIT 20
)
SELECT id,
COALESCE(1.0 / (60 + dense.rank), 0) + COALESCE(1.0 / (60 + sparse.rank), 0) AS rrf
FROM dense FULL OUTER JOIN sparse USING (id)
ORDER BY rrf DESC
LIMIT 5;
The $1 parameter is the query embedding as a vector literal, and $2 is the raw query text. This single statement does what a dedicated vector database does, with no extra infrastructure. For more on building RAG with Postgres, the RAG from scratch guide walks through the full stack.
Add a Reranker on Top
Hybrid search gives you a much better top-20 than either retriever alone, but the ordering within that top-20 is still noisy. Production systems typically add a cross-encoder reranker as a third stage. The pattern looks like this:
- Run hybrid search and get the top 20 to 50 candidates
- Pass each
(query, document)pair through a cross-encoder model - Re-sort by the cross-encoder score and keep the top 3 to 5 for the LLM
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-base")
def rerank(query: str, candidates: list[dict], top_k: int = 5) -> list[dict]:
pairs = [[query, c["text"]] for c in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [c for c, _ in ranked[:top_k]]
The cross-encoder reads query and document together, which catches subtle matches that bi-encoder vector search cannot. The cost is latency — typically 50 to 200ms for 20 pairs on a GPU, longer on CPU — and an extra model to deploy. For high-stakes retrieval, the precision gain is worth it.
Real-World Scenario: Internal Engineering Docs
Consider a mid-sized engineering organization with a few thousand internal documents — runbooks, postmortems, design docs, and onboarding pages. A pure vector RAG system typically performs well on conceptual questions like “how does our authentication flow work” but breaks down on specific lookups like “what is the timeout for the payments-service health check” or “which runbook covers SEV-2 incidents in eu-west-1”.
A common pattern in this kind of corpus is that documents reference specific service names, region codes, error codes, and ticket IDs. These tokens are exactly what BM25 handles best and what dense embeddings tend to smear. After switching to hybrid retrieval, teams typically see the most improvement on the long tail of identifier-heavy queries that pure vector search was quietly failing on. The conceptual queries continue to work because the dense side still contributes.
The trade-off is added complexity. You now have two retrievers to monitor, two indexes to rebuild on corpus changes, and a fusion stage to debug when relevance drops. For small corpora — say under 10,000 chunks — the operational cost often outweighs the gain. Hybrid search shines when corpus diversity is high and query patterns include both natural language and structured tokens.
When to Use Hybrid Search
- Your corpus contains identifiers, error codes, product names, or version strings that users search for verbatim
- Query intent is mixed — some users ask conceptual questions, others type exact phrases
- Pure vector retrieval misses obvious keyword matches in your evaluation set
- You have a moderate-to-large corpus (10,000+ chunks) where retrieval quality directly affects answer correctness
- You are already comfortable running and monitoring two retrievers in production
When NOT to Use Hybrid Search
- Your corpus is small (under 5,000 chunks) and dense retrieval already hits acceptable recall
- Queries are exclusively conversational — no identifiers, codes, or rare tokens
- You have not yet measured retrieval quality with a labeled eval set; adding hybrid before measuring is premature optimization
- Operational simplicity matters more than the last 5 percent of relevance — for example, a side project or early prototype
- Your stack does not yet support sparse indexes natively, and adding a second system would dominate your engineering budget
Common Mistakes With Hybrid Search
- Normalizing scores before fusion when using RRF. RRF operates on ranks, not scores. Normalization adds noise without benefit.
- Setting
ktoo low in RRF. Values below 30 cause the fusion to over-weight the very top of each list. The standardk=60is a reasonable default for most corpora. - Skipping evaluation. Switching from vector-only to hybrid without measuring on a labeled query set means you cannot tell if you actually improved anything. Build an eval set first.
- Using the same chunk size for both retrievers. Dense retrieval tolerates larger chunks (500 to 1,000 tokens); BM25 prefers smaller ones. Some teams index the same content at two granularities.
- Forgetting to update both indexes on writes. A document added to the vector index but missing from the BM25 index will silently degrade hybrid recall.
- Tuning
alphain linear combinations on a tiny eval set. The optimal weight is highly query-dependent, and small eval sets produce unreliable weights. Either use RRF or invest in a proper eval set with hundreds of labeled queries.
Hybrid Search vs Pure Vector vs Pure Keyword
| Approach | Strength | Weakness | Typical Use |
|---|---|---|---|
| Pure vector | Captures synonyms, intent, paraphrases | Misses exact identifiers, rare tokens | Conversational corpora, FAQs |
| Pure keyword (BM25) | Exact matches, rare terms, identifiers | No synonyms or intent understanding | Logs, code search, ID lookup |
| Hybrid (RRF) | Covers both patterns, robust defaults | More infra, two indexes to maintain | Production RAG with mixed queries |
| Hybrid + reranker | Highest precision in top-5 | Extra latency, GPU cost | High-stakes Q&A, customer-facing RAG |
For more on the broader retrieval choices, see our guide on vector databases compared. If you are still deciding whether RAG is even the right approach for your use case, the fine-tuning vs RAG comparison covers when each wins.
Conclusion
Hybrid search is one of the highest-leverage upgrades you can make to a RAG pipeline once basic vector retrieval is in place. The setup cost is modest — a second index and a fusion step — and the gain on identifier-heavy queries is often dramatic. Start with Reciprocal Rank Fusion at k=60 and a 20-result prefetch per retriever; this default works for most corpora without tuning. Then measure on a labeled eval set before adding complexity like learned weights or rerankers.
If your retrieval is still struggling after hybrid, the next two places to look are chunking and reranking. Our deep dive on RAG chunking strategies covers how chunk size and overlap interact with retrieval quality, and the cross-encoder reranker pattern above is the natural follow-up once your candidate pool is good but your top-5 is noisy. Treat hybrid search as the retrieval foundation, not the finish line.