RAG & Vector Search

Weaviate Hybrid Search: Vector + BM25 in One Query

If your RAG pipeline returns close-but-wrong chunks — semantically related but missing the exact product code, error message, or API name the user typed — pure vector search is the bottleneck. Weaviate hybrid search solves this by running BM25 keyword search and vector similarity in the same query, then fusing the results with a single tunable parameter. This tutorial walks through schema design, the built-in vectorizer modules, the Python v4 client, and the production trade-offs you should know before going live.

By the end, you will have a working Weaviate instance with a hybrid-searchable collection, real ingestion code, query patterns for filters and reranking, and a clear sense of when Weaviate is the right tool versus pgvector, Qdrant, or Pinecone.

What Is Weaviate?

Weaviate is an open-source vector database written in Go that stores objects together with their embeddings and supports hybrid search natively. Unlike pgvector or raw FAISS, it ships with vectorizer modules that call OpenAI, Cohere, Hugging Face, or local transformers automatically at ingestion and query time. As a result, you write your data once and Weaviate handles embedding generation, BM25 indexing, and hybrid fusion in one engine.

The project is run by Weaviate B.V. and has been production-ready since 2021. Furthermore, it offers a fully managed cloud (Weaviate Cloud), self-hosted Docker images, and Kubernetes Helm charts. The Python v4 client, released in 2024, replaced the older v3 client with a typed, gRPC-based API that this tutorial uses throughout.

Pure vector search struggles with exact-match terms because embeddings normalize meaning. For example, a query for “error code E2049” embeds similar to “system error” or “code failure,” so the literal string E2049 can drift to position 20 in the results. BM25 keeps that ranking signal intact. Consequently, combining both approaches recovers exact-match precision without losing semantic recall.

Weaviate fuses the two ranked lists using Reciprocal Rank Fusion (RRF) by default. You control the balance with an alpha parameter between 0 and 1: alpha 0 is pure BM25, alpha 1 is pure vector, and alpha 0.5 weights them equally. In practice, alpha values between 0.5 and 0.75 work well for most retrieval-augmented generation workloads.

For deeper background on the math behind this fusion, see our guide on hybrid BM25 and vector search for RAG.

Prerequisites

To follow this tutorial you need:

  • Python 3.9 or newer
  • Docker Desktop (for local development) or a Weaviate Cloud account
  • An OpenAI API key (for the vectorizer module in the examples) — Cohere and local transformers work the same way
  • Basic familiarity with embeddings and vector search; if either is new, start with our RAG from scratch walkthrough

Step 1: Run Weaviate Locally With Docker

For local development, the fastest path is Docker Compose. Create a docker-compose.yml in your project root:

services:
  weaviate:
    image: cr.weaviate.io/semitechnologies/weaviate:1.27.0
    restart: on-failure
    ports:
      - 8080:8080
      - 50051:50051
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'text2vec-openai'
      ENABLE_MODULES: 'text2vec-openai,reranker-cohere'
      OPENAI_APIKEY: ${OPENAI_API_KEY}
    volumes:
      - weaviate_data:/var/lib/weaviate

volumes:
  weaviate_data:

Port 8080 serves the REST and GraphQL APIs, while port 50051 is gRPC — the v4 Python client requires both. Then start it:

export OPENAI_API_KEY=sk-...
docker compose up -d

The ENABLE_MODULES environment variable controls which vectorizers and rerankers are loaded. Importantly, you must enable a module here before you can reference it in a collection schema later.

For production, Weaviate Cloud removes the Docker operations work; the schema and query code below are identical against either backend.

Step 2: Install the Python v4 Client

pip install weaviate-client==4.9.0

Then connect to your local instance:

import os
import weaviate
from weaviate.classes.init import Auth, AdditionalConfig, Timeout

client = weaviate.connect_to_local(
    headers={
        "X-OpenAI-Api-Key": os.environ["OPENAI_API_KEY"],
    },
    additional_config=AdditionalConfig(
        timeout=Timeout(init=30, query=60, insert=120),
    ),
)

assert client.is_ready()

The X-OpenAI-Api-Key header is what Weaviate uses to call OpenAI for vectorization at query time. Notably, this is passed per request rather than baked into the server config, which means the same Weaviate instance can serve clients using different OpenAI accounts.

For Weaviate Cloud, swap connect_to_local for connect_to_weaviate_cloud with your cluster URL and API key:

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.environ["WEAVIATE_URL"],
    auth_credentials=Auth.api_key(os.environ["WEAVIATE_API_KEY"]),
    headers={"X-OpenAI-Api-Key": os.environ["OPENAI_API_KEY"]},
)

Step 3: Define a Collection With a Vectorizer Module

A Weaviate collection is roughly equivalent to a table in Postgres or an index in Elasticsearch. The schema declares properties, the vectorizer module, and which fields contribute to the vector.

from weaviate.classes.config import Configure, Property, DataType

if client.collections.exists("Article"):
    client.collections.delete("Article")

articles = client.collections.create(
    name="Article",
    vectorizer_config=Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-small",
        dimensions=1536,
    ),
    properties=[
        Property(
            name="title",
            data_type=DataType.TEXT,
        ),
        Property(
            name="content",
            data_type=DataType.TEXT,
        ),
        Property(
            name="category",
            data_type=DataType.TEXT,
            skip_vectorization=True,
        ),
        Property(
            name="published_at",
            data_type=DataType.DATE,
            skip_vectorization=True,
        ),
    ],
    inverted_index_config=Configure.inverted_index(
        bm25_b=0.75,
        bm25_k1=1.2,
    ),
)

A few choices here matter. First, skip_vectorization=True on category and published_at keeps them out of the embedding — categorical and date fields would only add noise to a semantic vector. Second, bm25_b and bm25_k1 tune the BM25 ranking; the defaults shown match Lucene and Elasticsearch, so you rarely need to change them.

The vectorizer is text2vec-openai with the text-embedding-3-small model. Because the module is configured at the collection level, every insert and query automatically calls OpenAI without you writing the embedding code. Alternatively, you can use text2vec_coheretext2vec_huggingface, or text2vec_transformers for a self-hosted local model.

Step 4: Insert Data With Batching

For more than a handful of objects, use the batch API. It pipelines inserts over gRPC and handles vectorization concurrency for you:

articles = client.collections.get("Article")

sample_data = [
    {
        "title": "PostgreSQL Connection Pooling with PgBouncer",
        "content": "Connection pooling becomes critical when your app exceeds 100 concurrent users. PgBouncer in transaction mode is the most common production setup...",
        "category": "databases",
        "published_at": "2025-08-12T00:00:00Z",
    },
    {
        "title": "Diagnosing Error E2049 in Production Postgres",
        "content": "Error E2049 indicates a deadlock detected by the lock manager. The fix is usually retry-on-deadlock logic in your transaction wrapper...",
        "category": "databases",
        "published_at": "2025-09-04T00:00:00Z",
    },
    # ... more rows
]

with articles.batch.dynamic() as batch:
    for row in sample_data:
        batch.add_object(properties=row)
        if batch.number_errors > 10:
            print("Batch failed too many times, aborting")
            break

failed = articles.batch.failed_objects
if failed:
    print(f"{len(failed)} objects failed to insert")
    for obj in failed[:5]:
        print(obj.message)

The dynamic() batch automatically tunes batch size based on throughput. For very large ingests (millions of rows), use fixed_size(batch_size=200) instead and run multiple workers in parallel. Additionally, always check failed_objects after the batch closes — silent insert failures are the most common production bug with vector stores.

Step 5: Run a Hybrid Query

Now the core feature. A hybrid query takes a string, embeds it via the configured vectorizer, runs BM25 against the same string, and fuses the rankings:

from weaviate.classes.query import MetadataQuery

response = articles.query.hybrid(
    query="error code E2049 postgres deadlock",
    alpha=0.6,
    limit=5,
    return_metadata=MetadataQuery(score=True, explain_score=True),
)

for obj in response.objects:
    print(f"{obj.metadata.score:.3f}  {obj.properties['title']}")
    print(f"        {obj.metadata.explain_score}")

The score field shows the fused RRF score, and explain_score breaks down each contributor — invaluable when you are tuning alpha. A query like the one above benefits heavily from BM25 because the literal string “E2049” anchors the result, while the semantic vector pulls in related deadlock content.

For comparison, a query like “how do I prevent my database from running out of connections” leans the other way: BM25 will only match “database” and “connections,” but the vector understands “running out” maps to pooling and limits. Therefore, you should expect to tune alpha per workload, not per query.

Step 6: Combine Hybrid Search With Filters

Real applications rarely run unfiltered queries. Multi-tenancy, time windows, and category scoping all need filters that compose with the hybrid score:

from weaviate.classes.query import Filter
from datetime import datetime, timezone

cutoff = datetime(2025, 6, 1, tzinfo=timezone.utc)

response = articles.query.hybrid(
    query="connection pool exhaustion",
    alpha=0.65,
    filters=(
        Filter.by_property("category").equal("databases")
        & Filter.by_property("published_at").greater_than(cutoff)
    ),
    limit=10,
)

Filters run before the hybrid score, so they narrow the candidate set without distorting the ranking. Importantly, properties used in filters need either the default inverted index (text and other primitive types get one automatically) or you must enable it explicitly via Property(..., index_filterable=True).

Step 7: Add a Reranker for the Top-K

For higher-precision retrieval, layer a cross-encoder reranker on top of the hybrid results. Weaviate ships modules for Cohere Rerank, Voyage AI, and Jina AI; enable the module in docker-compose.yml first, then add it to the collection:

from weaviate.classes.query import Rerank

articles.config.update(
    reranker_config=Configure.Reranker.cohere(model="rerank-english-v3.0"),
)

response = articles.query.hybrid(
    query="how do I prevent database connection exhaustion in production",
    alpha=0.7,
    limit=20,
    rerank=Rerank(
        prop="content",
        query="database connection exhaustion production fix",
    ),
)

The reranker rescores the top 20 hybrid results using a cross-encoder, which evaluates query and document together rather than as independent embeddings. As a result, precision typically improves by 10 to 30 percent on the top 3 results, at the cost of a few hundred milliseconds and a per-token Cohere fee. For more on why this works, see our breakdown of reranking with Cohere and cross-encoders.

  • Your RAG pipeline needs both semantic recall and exact-match precision (product codes, error strings, API names)
  • You want the database to handle embedding generation rather than running a separate embedding service
  • You need multi-tenancy or multi-vector per object (Weaviate supports both natively)
  • Your workload is read-heavy with predictable latency requirements
  • You want a single binary that handles BM25, vectors, filters, and reranking

When NOT to Use Weaviate

  • You already run Postgres and your data volume is under a few million vectors — pgvector keeps your stack simpler
  • You need extreme write throughput or massive scale (Qdrant and Milvus tend to be faster at the tail end)
  • You want minimal ops and are happy with a fully managed black box — Pinecone or Weaviate Cloud both work, but Pinecone Serverless has less to configure
  • Your search is purely keyword-based — a Postgres GIN index or Elasticsearch will be faster and cheaper
  • You are prototyping and do not need a separate database yet — Chroma or in-memory FAISS will get you to a demo faster

Common Mistakes with Weaviate

  • Forgetting to enable the vectorizer module in ENABLE_MODULES before referencing it in a collection schema — Weaviate will reject the create call with an unhelpful error
  • Vectorizing every property by default; categorical, date, and numeric fields belong out of the vector via skip_vectorization=True
  • Setting alpha once globally and never tuning per query type — the right alpha depends on whether the query is exact-match heavy or semantic heavy
  • Skipping the batch error check after ingestion; silent failures are common when API keys hit rate limits during vectorization
  • Running a single-node Weaviate in production without backups — the database stores vectors on disk, so a node failure without snapshots loses everything
  • Using the v3 Python client tutorials found in older blog posts; the v4 API is a near-rewrite and the syntax does not transfer

Real-World Scenario: Documentation Search at Scale

Consider a developer documentation site with around 8,000 articles across 30 product areas. The team initially ran pure vector search with pgvector and saw user complaints that searches for specific error codes (like “E2049” or “PGRST116”) returned tangentially related articles rather than the exact troubleshooting page.

Migrating to Weaviate with hybrid search at alpha 0.65 typically resolves this kind of issue. The vector half still surfaces conceptually related content for natural-language queries, while the BM25 half pins exact-match terms to the top. A team-of-three engineering effort to migrate, including writing a backfill script that streamed from Postgres into Weaviate’s batch API, generally takes one to two weeks depending on schema complexity and ingestion volume. The biggest trade-off is the new operational surface area: another database to back up, monitor, and version-upgrade.

Closing the Connection

One small detail that bites people: the Python v4 client uses gRPC streams that should be closed cleanly:

client.close()

In a Flask or FastAPI app, create one client at startup and reuse it across requests. Then close it in the app shutdown hook. Otherwise you will leak connections and see latency spikes after a few hours.

Conclusion

Weaviate hybrid search is the right tool when you need both semantic understanding and exact-match precision in the same query, and when you want the database to handle vectorization for you. Start with the local Docker setup, define a collection with text2vec-openai, ingest with the batch API, and tune alpha against your actual query distribution before shipping. If you are evaluating other engines first, our comparison of vector databases covers Pinecone, Qdrant, Milvus, pgvector, and Weaviate side by side.

The next step is to wire your hybrid retrieval into a real RAG pipeline; if you have not picked a chunking approach yet, start with our guide on RAG chunking strategies before you ingest your full corpus.

Leave a Comment