RAG & Vector Search

Hybrid Search in RAG: Combining Keyword and Vector Retrieval

If your RAG pipeline misses obvious matches — a user types an exact error code, a SKU, or a function name and the system returns vaguely related fluff — you are running into the classic limits of pure vector retrieval. Hybrid search fixes this by running keyword (BM25) and vector retrieval side by side, then fusing the results into a single ranked list. The result is a retriever that catches both exact terms and semantic matches, which is what production RAG systems actually need.

This tutorial is for engineers building or maintaining a RAG application who already have basic vector search working and want to push retrieval quality higher. You will see how hybrid search works under the hood, two production-ready implementations (Qdrant and PostgreSQL with pgvector), how Reciprocal Rank Fusion combines the two result sets, and when adding a hybrid layer is a bad idea.

Hybrid search runs a sparse keyword retriever (typically BM25) and a dense vector retriever in parallel against the same corpus, then fuses the two ranked result lists into a single output. The sparse side excels at exact tokens like product codes, function names, or rare proper nouns. The dense side captures synonyms, paraphrases, and intent. Together they cover queries that either approach alone would miss.

Most production hybrid systems combine the two with Reciprocal Rank Fusion (RRF), a fusion algorithm that scores each document based on its rank in each list rather than the raw scores. Because BM25 scores and cosine similarities live on different scales, rank-based fusion sidesteps the normalization headaches that plague naive linear combinations.

Why Vector Search Alone Falls Short

Dense embeddings are powerful, but they have well-documented weaknesses. For instance, embedding models compress meaning into a fixed-dimensional vector — typically 768 to 3,072 dimensions — and tokens that appear rarely in training data tend to collapse together. As a result, queries with proper nouns, identifiers, or domain jargon often return documents that are “semantically close” but contain none of the actual terms the user typed.

Consider a customer support knowledge base. A user types ERR_BLOCKED_BY_CLIENT. Pure vector search may surface a document about generic browser errors because the error code itself sits in a region of embedding space surrounded by similar-looking strings. Meanwhile, BM25 would return the one document that mentions the exact code first. Furthermore, vector search struggles with negation, numeric values, and version-specific terminology — all common in technical content.

Why Keyword Search Alone Falls Short

Keyword retrieval has the opposite problem. BM25 cannot bridge synonyms or rephrasings. A user query “how do I cancel my account” will not match a document titled “Closing your subscription” unless someone manually maintains synonym lists. Moreover, BM25 has no concept of intent. For instance, “best Python web framework for async” and “fastest async Python framework” should retrieve overlapping documents, but BM25 treats them as different bags of tokens.

In practice, sparse retrieval also degrades when chunks are short. A 200-token chunk has limited term frequency signal, which is exactly what BM25 depends on. Hybrid search compensates by letting the dense retriever pick up where sparse runs out of evidence.

How Hybrid Search Works: The Core Algorithms

You need a fusion algorithm because BM25 scores and cosine similarities are not comparable. BM25 produces unbounded positive numbers shaped by term frequency and inverse document frequency. Cosine similarity returns a value between -1 and 1. Multiplying or averaging them directly produces meaningless rankings.

Reciprocal Rank Fusion (RRF)

RRF is the default choice for production hybrid systems. For each document d that appears in either result list, compute:

RRF_score(d) = sum over retrievers r of (1 / (k + rank_r(d)))

The constant k is typically 60 (the original paper’s recommendation). Documents that rank high in either list get a strong score. Documents that rank well in both get an even higher score. Documents missing from one list still contribute through the other, which is the whole point.

def reciprocal_rank_fusion(result_lists: list[list[str]], k: int = 60) -> list[tuple[str, float]]:
    """Combine ranked lists of document IDs into a single ranking using RRF."""
    scores: dict[str, float] = {}

    for results in result_lists:
        for rank, doc_id in enumerate(results, start=1):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)

    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

This works because rank-based scoring is scale-invariant. As a result, you avoid the trap of trying to normalize wildly different score distributions across retrievers.

Convex Linear Combination

The alternative is alpha * vector_score + (1 - alpha) * keyword_score, where both scores are first normalized (usually min-max scaled). This approach gives you a tuning knob — push alpha toward 1.0 for more semantic, toward 0.0 for more lexical. However, it requires careful score normalization, and the optimal alpha differs per query type. For most teams, RRF wins on simplicity and robustness.

Implementation: Hybrid Search With Qdrant

Qdrant’s Query API supports hybrid search natively as of version 1.10, including server-side RRF fusion. You upload both dense and sparse vectors per point, then query both in one call.

First, install the client and a sparse encoder:

pip install qdrant-client fastembed

Then create a collection with two named vector spaces — one dense, one sparse — and index a few documents:

from qdrant_client import QdrantClient
from qdrant_client.models import (
    VectorParams, SparseVectorParams, Distance,
    PointStruct, SparseVector, NamedVector
)
from fastembed import TextEmbedding, SparseTextEmbedding

client = QdrantClient(url="http://localhost:6333")

dense_model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")
sparse_model = SparseTextEmbedding(model_name="Qdrant/bm25")

client.recreate_collection(
    collection_name="kb",
    vectors_config={"dense": VectorParams(size=384, distance=Distance.COSINE)},
    sparse_vectors_config={"sparse": SparseVectorParams()},
)

docs = [
    "Reset your password from the account settings page.",
    "ERR_BLOCKED_BY_CLIENT means an ad blocker is interfering with the request.",
    "Cancel your subscription anytime from the billing dashboard.",
]

dense_vecs = list(dense_model.embed(docs))
sparse_vecs = list(sparse_model.embed(docs))

points = [
    PointStruct(
        id=i,
        vector={
            "dense": dense_vecs[i].tolist(),
            "sparse": SparseVector(
                indices=sparse_vecs[i].indices.tolist(),
                values=sparse_vecs[i].values.tolist(),
            ),
        },
        payload={"text": docs[i]},
    )
    for i in range(len(docs))
]
client.upsert(collection_name="kb", points=points)

Now run a hybrid query. Qdrant’s query_points accepts a prefetch block where you list each retriever and the number of candidates it should return, then specify a fusion strategy:

from qdrant_client.models import Prefetch, Fusion, FusionQuery

query = "ad blocker stopping requests"
q_dense = list(dense_model.embed([query]))[0].tolist()
q_sparse = list(sparse_model.embed([query]))[0]

results = client.query_points(
    collection_name="kb",
    prefetch=[
        Prefetch(
            query=q_dense,
            using="dense",
            limit=20,
        ),
        Prefetch(
            query=SparseVector(
                indices=q_sparse.indices.tolist(),
                values=q_sparse.values.tolist(),
            ),
            using="sparse",
            limit=20,
        ),
    ],
    query=FusionQuery(fusion=Fusion.RRF),
    limit=5,
).points

for r in results:
    print(r.score, r.payload["text"])

The dense retriever picks up “stopping requests” as semantically related to “blocked”, while the sparse retriever locks onto the exact ERR_BLOCKED_BY_CLIENT token. RRF promotes the document that ranks well in both lists.

Implementation: Hybrid Search With PostgreSQL

If you are already running Postgres for your application, you can implement hybrid search without adding a new system. The recipe combines pgvector for dense retrieval with Postgres’s built-in tsvector and ts_rank_cd for keyword retrieval. If you want a deeper dive on the keyword side, see our guide on PostgreSQL full-text search vs Elasticsearch vs Algolia.

Set up the table with both columns:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
    id BIGSERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    embedding vector(384) NOT NULL,
    tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', content)) STORED
);

CREATE INDEX documents_embedding_idx
    ON documents USING hnsw (embedding vector_cosine_ops);

CREATE INDEX documents_tsv_idx ON documents USING GIN (tsv);

Then run two queries and combine them in application code with RRF. Postgres can also do this in pure SQL using CTEs:

WITH dense AS (
    SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> $1) AS rank
    FROM documents
    ORDER BY embedding <=> $1
    LIMIT 20
),
sparse AS (
    SELECT id, ROW_NUMBER() OVER (ORDER BY ts_rank_cd(tsv, query) DESC) AS rank
    FROM documents, plainto_tsquery('english', $2) query
    WHERE tsv @@ query
    ORDER BY ts_rank_cd(tsv, query) DESC
    LIMIT 20
)
SELECT id,
       COALESCE(1.0 / (60 + dense.rank), 0) + COALESCE(1.0 / (60 + sparse.rank), 0) AS rrf
FROM dense FULL OUTER JOIN sparse USING (id)
ORDER BY rrf DESC
LIMIT 5;

The $1 parameter is the query embedding as a vector literal, and $2 is the raw query text. This single statement does what a dedicated vector database does, with no extra infrastructure. For more on building RAG with Postgres, the RAG from scratch guide walks through the full stack.

Add a Reranker on Top

Hybrid search gives you a much better top-20 than either retriever alone, but the ordering within that top-20 is still noisy. Production systems typically add a cross-encoder reranker as a third stage. The pattern looks like this:

  1. Run hybrid search and get the top 20 to 50 candidates
  2. Pass each (query, document) pair through a cross-encoder model
  3. Re-sort by the cross-encoder score and keep the top 3 to 5 for the LLM
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-base")

def rerank(query: str, candidates: list[dict], top_k: int = 5) -> list[dict]:
    pairs = [[query, c["text"]] for c in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [c for c, _ in ranked[:top_k]]

The cross-encoder reads query and document together, which catches subtle matches that bi-encoder vector search cannot. The cost is latency — typically 50 to 200ms for 20 pairs on a GPU, longer on CPU — and an extra model to deploy. For high-stakes retrieval, the precision gain is worth it.

Real-World Scenario: Internal Engineering Docs

Consider a mid-sized engineering organization with a few thousand internal documents — runbooks, postmortems, design docs, and onboarding pages. A pure vector RAG system typically performs well on conceptual questions like “how does our authentication flow work” but breaks down on specific lookups like “what is the timeout for the payments-service health check” or “which runbook covers SEV-2 incidents in eu-west-1”.

A common pattern in this kind of corpus is that documents reference specific service names, region codes, error codes, and ticket IDs. These tokens are exactly what BM25 handles best and what dense embeddings tend to smear. After switching to hybrid retrieval, teams typically see the most improvement on the long tail of identifier-heavy queries that pure vector search was quietly failing on. The conceptual queries continue to work because the dense side still contributes.

The trade-off is added complexity. You now have two retrievers to monitor, two indexes to rebuild on corpus changes, and a fusion stage to debug when relevance drops. For small corpora — say under 10,000 chunks — the operational cost often outweighs the gain. Hybrid search shines when corpus diversity is high and query patterns include both natural language and structured tokens.

  • Your corpus contains identifiers, error codes, product names, or version strings that users search for verbatim
  • Query intent is mixed — some users ask conceptual questions, others type exact phrases
  • Pure vector retrieval misses obvious keyword matches in your evaluation set
  • You have a moderate-to-large corpus (10,000+ chunks) where retrieval quality directly affects answer correctness
  • You are already comfortable running and monitoring two retrievers in production
  • Your corpus is small (under 5,000 chunks) and dense retrieval already hits acceptable recall
  • Queries are exclusively conversational — no identifiers, codes, or rare tokens
  • You have not yet measured retrieval quality with a labeled eval set; adding hybrid before measuring is premature optimization
  • Operational simplicity matters more than the last 5 percent of relevance — for example, a side project or early prototype
  • Your stack does not yet support sparse indexes natively, and adding a second system would dominate your engineering budget
  • Normalizing scores before fusion when using RRF. RRF operates on ranks, not scores. Normalization adds noise without benefit.
  • Setting k too low in RRF. Values below 30 cause the fusion to over-weight the very top of each list. The standard k=60 is a reasonable default for most corpora.
  • Skipping evaluation. Switching from vector-only to hybrid without measuring on a labeled query set means you cannot tell if you actually improved anything. Build an eval set first.
  • Using the same chunk size for both retrievers. Dense retrieval tolerates larger chunks (500 to 1,000 tokens); BM25 prefers smaller ones. Some teams index the same content at two granularities.
  • Forgetting to update both indexes on writes. A document added to the vector index but missing from the BM25 index will silently degrade hybrid recall.
  • Tuning alpha in linear combinations on a tiny eval set. The optimal weight is highly query-dependent, and small eval sets produce unreliable weights. Either use RRF or invest in a proper eval set with hundreds of labeled queries.

Hybrid Search vs Pure Vector vs Pure Keyword

ApproachStrengthWeaknessTypical Use
Pure vectorCaptures synonyms, intent, paraphrasesMisses exact identifiers, rare tokensConversational corpora, FAQs
Pure keyword (BM25)Exact matches, rare terms, identifiersNo synonyms or intent understandingLogs, code search, ID lookup
Hybrid (RRF)Covers both patterns, robust defaultsMore infra, two indexes to maintainProduction RAG with mixed queries
Hybrid + rerankerHighest precision in top-5Extra latency, GPU costHigh-stakes Q&A, customer-facing RAG

For more on the broader retrieval choices, see our guide on vector databases compared. If you are still deciding whether RAG is even the right approach for your use case, the fine-tuning vs RAG comparison covers when each wins.

Conclusion

Hybrid search is one of the highest-leverage upgrades you can make to a RAG pipeline once basic vector retrieval is in place. The setup cost is modest — a second index and a fusion step — and the gain on identifier-heavy queries is often dramatic. Start with Reciprocal Rank Fusion at k=60 and a 20-result prefetch per retriever; this default works for most corpora without tuning. Then measure on a labeled eval set before adding complexity like learned weights or rerankers.

If your retrieval is still struggling after hybrid, the next two places to look are chunking and reranking. Our deep dive on RAG chunking strategies covers how chunk size and overlap interact with retrieval quality, and the cross-encoder reranker pattern above is the natural follow-up once your candidate pool is good but your top-5 is noisy. Treat hybrid search as the retrieval foundation, not the finish line.

Leave a Comment