RAG & Vector Search

Reranking in RAG: Cohere Rerank and Cross-Encoders Guide

If your RAG pipeline retrieves chunks that look relevant but produce vague answers, the problem is rarely the embedding model. More often, the retriever pulls back twenty plausible candidates and the top three happen to be the wrong three. Reranking in RAG fixes that by adding a second scoring pass that reads the query and each candidate together, then reorders them by true relevance.

This guide walks through both approaches developers actually ship: Cohere Rerank as a hosted API, and self-hosted cross-encoders like BGE Reranker. You will see working Python code for each, a side-by-side comparison, and the latency and cost trade-offs that decide which one fits your stack. By the end, you should know exactly which reranker belongs in your pipeline and how to wire it in without breaking the budget.

What Is Reranking in RAG?

Reranking in RAG is a second-stage retrieval step that takes the top N candidates from a vector or hybrid search and reorders them using a more accurate, more expensive scoring model. The first stage casts a wide net using bi-encoders or BM25; the second stage applies a cross-encoder or hosted reranking API that scores each query-document pair directly. As a result, the chunks fed to the LLM are far more likely to actually answer the user’s question.

The pattern matters because dense retrieval optimizes for speed at the cost of nuance. Embeddings compress meaning into a single vector before the query is even seen, so a chunk about “Postgres connection pooling” and one about “Postgres performance tuning” can sit very close in vector space even when only one answers your question. A reranker reads both texts together and resolves that ambiguity.

Why Vector Search Alone Falls Short

Bi-encoders, the kind that produce embeddings for vector search, are trained to map text into a shared space where similar meanings cluster together. That is fast and scalable. However, it is also lossy. The query and the document never see each other during scoring; they only meet through cosine similarity between two pre-computed vectors.

In practice, this produces three recurring failure modes. First, lexical near-misses ranked too high: a chunk mentioning your exact keywords but in a different context outranks one that semantically answers the question. Second, topical drift in the top-k: the top five results all discuss the same broad topic, so the LLM has nothing fresh to work with. Third, long-document loss: a paragraph buried inside a long chunk gets averaged into mediocrity by the embedding.

If you have not yet tightened the first stage, start with chunking strategies and hybrid search before adding a reranker. Reranking improves what the retriever surfaces, but it cannot fix retrieval that misses the right chunk entirely.

How Two-Stage Retrieval Works

The reranking pattern in RAG has a simple shape: cast a wide net, then refine. Stage one retrieves between 20 and 100 candidates using fast methods. Stage two scores each candidate against the query with a cross-encoder or rerank API and keeps the top 3 to 10 for the LLM.

Here is the flow in words:

  1. User submits a query
  2. Embed the query and run vector search on your store, optionally combined with BM25 (hybrid)
  3. Take the top 20 to 100 candidates from stage one
  4. Send (query, candidate_text) pairs to a reranker
  5. Sort by rerank score and keep the top 3 to 10
  6. Pass those chunks to the LLM as context

The key insight is asymmetry. Stage one is cheap per document but imprecise. Stage two is expensive per pair but precise. By only running the expensive model on a small candidate set, you get the precision of a cross-encoder at a cost closer to vector search. For a refresher on building the first stage, see RAG from scratch.

# Pseudocode for the two-stage flow
def retrieve(query: str, top_k: int = 5) -> list[Document]:
    candidates = vector_store.search(query, top_k=50)
    pairs = [(query, doc.text) for doc in candidates]
    scores = reranker.score(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
    return [doc for doc, _ in ranked[:top_k]]

The 50-candidate window is a reasonable default. Smaller windows miss relevant chunks the bi-encoder buried; larger windows pay more for the reranker without much accuracy gain past 100.

Cohere Rerank: The Hosted API Approach

Cohere Rerank is the path of least resistance. It is a managed endpoint that accepts a query, a list of documents, and a top-N parameter, then returns the documents sorted by relevance with confidence scores. The current production model, rerank-v3.5, supports over 100 languages and handles up to 4,096 tokens per document.

First, install the client and set your API key:

pip install cohere
export COHERE_API_KEY="your-key-here"

Then wire it into your retrieval pipeline:

import os
import cohere
from dataclasses import dataclass

@dataclass
class Document:
    id: str
    text: str
    metadata: dict

co = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"])

def rerank_with_cohere(
    query: str,
    candidates: list[Document],
    top_n: int = 5,
) -> list[Document]:
    """Rerank candidates and return the top N most relevant."""
    if not candidates:
        return []

    response = co.rerank(
        model="rerank-v3.5",
        query=query,
        documents=[c.text for c in candidates],
        top_n=top_n,
    )

    return [candidates[result.index] for result in response.results]

Why this works: Cohere’s rerank model is a cross-encoder trained on retrieval pairs across many domains. Because it reads the query and each document together, it picks up on negation, intent, and context that bi-encoder embeddings miss. The top_n parameter handles the truncation server-side, so you do not have to sort yourself.

For production use, wrap the call in retries and a timeout. Reranking is on the critical path of your RAG response, so a hung request degrades user experience:

import httpx
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=0.5, min=0.5, max=4),
    retry=lambda r: isinstance(r.outcome.exception(), (httpx.TimeoutException, cohere.errors.ServiceUnavailableError)),
)
def rerank_with_retry(query: str, candidates: list[Document], top_n: int = 5):
    return rerank_with_cohere(query, candidates, top_n)

In practice, latency for 50 candidates of around 500 tokens each lands in the low hundreds of milliseconds, which is usually acceptable for chat UIs. At the time of writing, pricing is per-search, not per-document, which makes Cohere predictable to budget at small to mid scale.

Cross-Encoders: Self-Hosted Reranking

When you need to keep data on your own infrastructure, control costs at high volume, or run offline, a self-hosted cross-encoder is the way. The models that ship the best accuracy per dollar today come from the BGE family (BAAI/bge-reranker-v2-m3 for multilingual, BAAI/bge-reranker-large for English-heavy workloads) and from Jina AI.

Install sentence-transformers, which wraps the Hugging Face model and exposes a clean cross-encoder API:

pip install sentence-transformers torch

Then load the model once at startup and score query-document pairs on demand:

from sentence_transformers import CrossEncoder
from dataclasses import dataclass

@dataclass
class Document:
    id: str
    text: str
    metadata: dict

# Load once at process start; the model weights are around 1.1 GB
reranker = CrossEncoder(
    "BAAI/bge-reranker-v2-m3",
    max_length=512,
    device="cuda",  # Falls back to "cpu" if no GPU is available
)

def rerank_with_cross_encoder(
    query: str,
    candidates: list[Document],
    top_n: int = 5,
) -> list[Document]:
    """Rerank using a local cross-encoder."""
    if not candidates:
        return []

    pairs = [(query, c.text) for c in candidates]
    scores = reranker.predict(pairs, batch_size=32, show_progress_bar=False)

    ranked = sorted(
        zip(candidates, scores),
        key=lambda x: x[1],
        reverse=True,
    )
    return [doc for doc, _ in ranked[:top_n]]

Why this works: A cross-encoder concatenates the query and document into a single input, runs them through a transformer, and outputs a single relevance score. That joint attention is what gives cross-encoders their accuracy edge over bi-encoders. The trade-off is that you cannot precompute scores. Every query forces a fresh forward pass for every candidate.

For production, two things matter. First, batch size: process all candidates in one or two batches rather than scoring them one at a time. Calling predict on 50 pairs at batch size 32 finishes in roughly the same time as one pair on most GPUs. Second, truncationmax_length=512 is standard; if your chunks are longer, the model truncates from the end, which can clip your most relevant sentence. Trim chunks upstream rather than letting the reranker do it silently.

For CPU-only deployments, swap to a smaller model like BAAI/bge-reranker-base or use ONNX runtime exports. Latency on CPU for 50 candidates ranges from a few hundred milliseconds to over a second depending on chunk size; that is usually too slow for chat UIs but fine for batch document retrieval.

Cohere Rerank vs Cross-Encoders Compared

The decision usually comes down to data residency, scale, and ops appetite. Here is the comparison most teams care about:

FactorCohere RerankSelf-Hosted Cross-Encoder
Setup timeMinutes (API key + SDK)Hours (model download, GPU provisioning)
InfrastructureNoneGPU recommended for low latency
Cost modelPer search requestFixed compute cost
Data residencySent to CohereStays in your VPC
MultilingualStrong (100+ languages)Strong with bge-reranker-v2-m3
TuningLimited (managed model)Full (fine-tune on your domain)
Cold startNoneModel load on process start
Best fitStartups, small to mid scale, sensitive timelinesHigh volume, regulated data, custom domains

Accuracy on standard benchmarks is close between Cohere Rerank v3.5 and bge-reranker-v2-m3; both significantly outperform pure vector search on retrieval quality metrics like NDCG@10. The real differentiator is the operating model around the reranker, not the raw accuracy.

If you want a broader view of where rerankers fit alongside other retrieval components, the vector databases compared guide covers the first-stage layer.

Integrating Reranking With LangChain

If you already use LangChain, you do not need to write the orchestration yourself. The ContextualCompressionRetriever combined with a reranker compressor handles the two-stage flow:

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma(
    collection_name="docs",
    embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"),
)

base_retriever = vectorstore.as_retriever(search_kwargs={"k": 50})

compressor = CohereRerank(
    model="rerank-v3.5",
    top_n=5,
)

retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever,
)

docs = retriever.invoke("How do I roll my own token bucket rate limiter in Python?")

This pattern keeps your stage-one retriever and stage-two reranker swappable. Want to switch to a cross-encoder? Replace CohereRerank with the CrossEncoderReranker from langchain_community.cross_encoders. The rest of the chain does not change. For a deeper introduction to the framework, see LangChain fundamentals.

A Realistic Production Scenario

Consider a mid-sized SaaS company running a customer support RAG bot over their help center. The corpus is around 5,000 articles, embedded with OpenAI’s text-embedding-3-small, stored in pgvector. Initially, the team retrieves the top 5 chunks directly from vector search and sends them to GPT.

Users start complaining that the bot quotes near-misses. For instance, a question about “exporting invoices to CSV” returns chunks about exporting reports to PDF and exporting orders to Excel. The chunks are topically close, but none answer the specific question. Adding a reranking stage that retrieves 50 candidates and reranks down to the top 5 typically resolves this kind of issue, because the cross-encoder reads the word “invoices” against each candidate explicitly rather than relying on embedding proximity.

The cost shift is usually modest. Going from 5 chunks straight to the LLM to 50 reranked down to 5 adds one reranker call per query. The dominant cost in most RAG bots is still the LLM completion, not the rerank step. The latency cost is real but bounded: expect 100 to 300 ms added per query for Cohere Rerank, and similar or better for a GPU-hosted cross-encoder. For a customer support bot where the user already expects a few seconds of response time, that is a fair trade.

The team should also instrument retrieval quality: track which reranked chunks the LLM actually quotes in answers, and compare to the original vector ranking. If the same top chunk wins both rankings 80% of the time, the reranker is probably not earning its keep on that workload. Otherwise, it is paying for itself.

When to Use Reranking in RAG

  • Your RAG bot retrieves relevant-looking chunks that produce vague or incorrect answers
  • You serve a corpus where many documents discuss similar topics with subtle distinctions
  • You can spare 100 to 300 ms of additional latency per query
  • Your queries are short and natural-language, where intent matters more than keyword overlap
  • You have already optimized chunking and tried hybrid search and still see retrieval misses
  • You need multilingual retrieval and your embedding model is weaker on non-English text

When NOT to Use Reranking in RAG

  • Your corpus is small enough (under 1,000 chunks) that the top 5 from vector search are usually correct
  • You operate under strict sub-100 ms retrieval latency budgets, like voice agents or autocomplete
  • Your queries are exact keyword lookups where BM25 alone already returns the correct chunk
  • You have not yet fixed obvious problems in chunking or query understanding
  • The bottleneck in your pipeline is LLM generation cost, not retrieval quality

Common Mistakes With Reranking in RAG

  • Reranking too few candidates. If stage one returns 5 documents, the reranker has nothing to improve on. Pull 30 to 100 candidates so the reranker has room to surface buried gems.
  • Reranking too many candidates. Beyond 100 candidates, the marginal accuracy gain rarely justifies the latency and cost. Diminishing returns kick in fast.
  • Forgetting to truncate documents to the reranker’s context window. Most rerankers cap at 512 tokens. Long chunks get silently truncated and may lose the relevant passage.
  • Treating rerank scores as probabilities. Cross-encoder scores are not calibrated. A score of 0.85 from one model is not comparable to 0.85 from another. Use them only for sorting within a query.
  • Skipping evaluation. Without an eval set of (query, ideal_chunk) pairs, you cannot tell if the reranker is helping or hurting. Build a small golden set before shipping.
  • Adding rerankers to fix bad chunking. Reranking improves ordering, not recall. If the right chunk is not in your top 50, no reranker can save you. Fix chunking and hybrid search first.

How to Evaluate Whether Your Reranker Helps

Before you ship reranking to production, build a small evaluation set. A starter eval can be as simple as 50 to 100 hand-labeled query-and-correct-chunk pairs from real user queries. Then measure two metrics: recall@k (does the correct chunk appear in the top k?) and MRR (mean reciprocal rank, how high does it appear?). Run the same eval set with and without reranking. If MRR jumps and recall@5 climbs meaningfully, the reranker is doing its job.

If neither metric moves, something else is wrong, usually upstream. Check whether the correct chunk is even in your top 50 candidates. If it is not, the problem is recall, not ranking, and you need to improve embedding quality, chunking, or hybrid search before reranking can help. The fine-tuning vs RAG post covers the broader question of when retrieval improvements stop paying off.

Conclusion: Picking the Right Reranker

Reranking in RAG is one of the highest-leverage upgrades you can make to a retrieval pipeline once the basics are solid. Cohere Rerank gets you most of the accuracy gain in an afternoon of integration work, while self-hosted cross-encoders like BGE Reranker buy you control, data residency, and cost predictability at scale. For most teams shipping their first RAG product, Cohere Rerank is the right starting point; you can graduate to a self-hosted cross-encoder later if scale or compliance demand it.

Once your reranker is in place, the next leverage point is usually how the LLM uses the retrieved context. The building AI agents guide walks through structured planning patterns that pair well with a strong reranked retriever, so the model uses the right chunk at the right step.

Leave a Comment