RAG & Vector Search

Qdrant Setup and Python Integration: Hands-On Guide

If you are building a RAG pipeline or a semantic search feature and want a Rust-based vector store that runs anywhere from a laptop to a Kubernetes cluster, this Qdrant Python integration guide is for you. We will go from a Docker container on localhost to a filtered hybrid search wrapped in a FastAPI service, with the exact commands and code you need to run today.

Qdrant has become one of the most popular open-source vector databases because of its speed, payload filtering, and a Python client that hides almost all of the gRPC plumbing. However, the official docs jump between concepts without a single end-to-end path. This tutorial gives you that path. By the end, you will have a running collection, real embeddings, working searches with metadata filters, and a clear sense of when Qdrant is the right pick versus alternatives like pgvector or Pinecone.

What Is Qdrant?

Qdrant is an open-source vector database written in Rust that stores high-dimensional embeddings alongside arbitrary JSON payloads and supports approximate nearest neighbor search with rich filtering. It runs as a single binary, a Docker container, or a managed cloud service, and exposes both REST and gRPC APIs. Its standout feature is the ability to combine semantic similarity with structured filters in a single query, which is critical for production RAG.

In practice, Qdrant occupies the same niche as Pinecone, Weaviate, and Milvus. However, it leans toward developers who want self-hosting without the operational overhead of a Cassandra-style cluster. For a broader landscape view, see our vector databases compared breakdown.

Why Choose Qdrant for Python Integration

Three properties make Qdrant a strong default. First, the Rust core gives you predictable latency under load — most queries on a million-vector collection finish in single-digit milliseconds on a modest VM. Second, the payload index is a real index, not a post-filter. As a result, queries like “find semantically similar docs where tenant_id = 42 and published_at > 2025-01-01” stay fast even when filters eliminate 99% of the corpus. Third, the Python client mirrors the REST API exactly, so anything you can curl, you can call from Python.

Notably, Qdrant also supports sparse vectors, which means you can run hybrid search (BM25-style + dense) inside a single database instead of glueing Elasticsearch to a vector store. For developers coming from a Django, FastAPI, or Flask background, the Qdrant Python integration feels like using any other ORM-style client — type hints, dataclasses, and clear exceptions.

Prerequisites

Before you start, make sure you have:

  • Python 3.10 or newer
  • Docker Desktop (or a Linux Docker daemon)
  • An OpenAI API key (we will use it for embeddings; any embedding model works)
  • About 1 GB of free disk for the Qdrant container and a small dataset

If you have never built a RAG pipeline before, skim our RAG from scratch walkthrough first. The chunking and embedding concepts there are assumed here.

Step 1: Run Qdrant Locally with Docker

The fastest way to get a working Qdrant instance is the official Docker image. Run this in any terminal:

docker run -d \
  --name qdrant \
  -p 6333:6333 \
  -p 6334:6334 \
  -v "$(pwd)/qdrant_storage:/qdrant/storage" \
  qdrant/qdrant:latest

A few notes on the flags. Port 6333 exposes the REST API and the built-in web dashboard. Port 6334 exposes gRPC, which the Python client uses by default for higher throughput. The volume mount keeps your data on the host, so restarting the container does not wipe collections.

Confirm it is up:

curl http://localhost:6333/healthz
# Expected output: healthz check passed

Open http://localhost:6333/dashboard in a browser. You should see Qdrant’s web UI with an empty collections list. This dashboard becomes invaluable for debugging later, especially when inspecting payloads.

Step 2: Install the Python Client

Create a fresh virtual environment and install the dependencies:

python -m venv .venv
source .venv/bin/activate   # On Windows: .venv\Scripts\activate
pip install "qdrant-client[fastembed]" openai python-dotenv

The [fastembed] extra pulls in a lightweight ONNX-based embedding runtime, which is useful if you ever want to skip OpenAI and embed locally. The openai package is for our embedding calls in this tutorial. For dependency management trade-offs, see our pip, Poetry, and uv guide.

Create a .env file for your API key:

OPENAI_API_KEY=sk-...

Step 3: Create Your First Collection

A collection in Qdrant is roughly a table in SQL — it holds vectors plus optional payloads and defines the vector dimensions and distance metric. Here is a minimal setup script.

# create_collection.py
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

client = QdrantClient(host="localhost", port=6333)

COLLECTION = "support_articles"
EMBED_DIM = 1536  # text-embedding-3-small output size

client.recreate_collection(
    collection_name=COLLECTION,
    vectors_config=VectorParams(size=EMBED_DIM, distance=Distance.COSINE),
)

print(client.get_collection(COLLECTION))

Why recreate_collection instead of create_collection? In a tutorial, you will iterate. recreate_collection is idempotent: it drops and re-creates without raising if the collection already exists. Never use it in production code where the collection holds real data.

Why cosine distance? OpenAI’s text-embedding-3-small vectors are already normalized, so cosine and dot product produce the same ranking but cosine is the conventional default for text. Switch to Distance.DOT only if you have a measured throughput reason.

Run the script:

python create_collection.py

You should see the collection metadata printed, including status='green' and points_count=0.

Step 4: Generate and Upsert Embeddings

Now we will embed a small corpus and load it into Qdrant. In a real app, your documents come from a database, S3, or a crawl pipeline. For clarity, we will hard-code five short articles.

# upsert_data.py
import os
import uuid
from dotenv import load_dotenv
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct

load_dotenv()

openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
qdrant = QdrantClient(host="localhost", port=6333)

COLLECTION = "support_articles"
EMBED_MODEL = "text-embedding-3-small"

articles = [
    {"title": "Resetting your password", "body": "Click the forgot password link on the login screen and follow the email.", "category": "account"},
    {"title": "Updating your billing card", "body": "Go to billing settings and replace the saved payment method before the next cycle.", "category": "billing"},
    {"title": "Exporting your data", "body": "Premium plans can export full CSV archives from the data settings page.", "category": "data"},
    {"title": "Two-factor authentication setup", "body": "Enable 2FA from security settings using an authenticator app such as 1Password or Authy.", "category": "account"},
    {"title": "Invoicing for annual plans", "body": "Annual subscriptions are billed upfront and invoices appear in billing history within 24 hours.", "category": "billing"},
]

def embed(texts: list[str]) -> list[list[float]]:
    response = openai_client.embeddings.create(model=EMBED_MODEL, input=texts)
    return [item.embedding for item in response.data]

texts = [f"{a['title']}. {a['body']}" for a in articles]
vectors = embed(texts)

points = [
    PointStruct(
        id=str(uuid.uuid4()),
        vector=vector,
        payload={"title": a["title"], "body": a["body"], "category": a["category"]},
    )
    for vector, a in zip(vectors, articles)
]

qdrant.upsert(collection_name=COLLECTION, points=points)
print(f"Upserted {len(points)} points")

A few points worth noting. The embedding call batches all texts in one request, which is significantly cheaper than embedding one at a time. The id field accepts integers or UUID strings — UUIDs are safer when documents originate from multiple sources because you avoid collisions. Furthermore, the payload is plain JSON, so you can store anything serializable: timestamps, user IDs, tags, full document text.

For deeper coverage of how you should chunk longer documents before embedding, see our RAG chunking strategies deep dive.

Run it:

python upsert_data.py
# Expected output: Upserted 5 points

Step 5: Run Vector Searches

With data loaded, semantic search is one call:

# search.py
import os
from dotenv import load_dotenv
from openai import OpenAI
from qdrant_client import QdrantClient

load_dotenv()
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
qdrant = QdrantClient(host="localhost", port=6333)

COLLECTION = "support_articles"

def embed_query(text: str) -> list[float]:
    return openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=[text],
    ).data[0].embedding

query = "How do I change my credit card on file?"
query_vector = embed_query(query)

results = qdrant.query_points(
    collection_name=COLLECTION,
    query=query_vector,
    limit=3,
).points

for hit in results:
    print(f"score={hit.score:.4f} title={hit.payload['title']}")

Expected output (scores will vary by a small amount):

score=0.6512 title=Updating your billing card
score=0.4187 title=Invoicing for annual plans
score=0.2031 title=Two-factor authentication setup

Notably, the top hit is the billing card article even though the query said “credit card” and the document said “payment method”. This is the whole point of vector search — it matches on meaning, not on tokens.

The query_points method replaced the older search method in qdrant-client 1.10+. If you see client.search(...) in older tutorials, both still work, but query_points is the supported API going forward and unifies dense, sparse, and hybrid queries under one interface.

Step 6: Add Filtering with Payloads

Production RAG almost always needs filters. A multi-tenant app must restrict results to the current tenant. A versioned docs site must restrict to the current product version. In Qdrant, filters live alongside the vector query:

from qdrant_client.models import Filter, FieldCondition, MatchValue

billing_only = Filter(
    must=[FieldCondition(key="category", match=MatchValue(value="billing"))]
)

results = qdrant.query_points(
    collection_name=COLLECTION,
    query=query_vector,
    query_filter=billing_only,
    limit=3,
).points

for hit in results:
    print(f"score={hit.score:.4f} title={hit.payload['title']}")

This returns only billing articles, ranked by semantic similarity. For category fields with low cardinality, that filter is essentially free. However, once you start filtering on high-cardinality keys (tenant IDs across thousands of tenants, for example), build an explicit payload index:

from qdrant_client.models import PayloadSchemaType

qdrant.create_payload_index(
    collection_name=COLLECTION,
    field_name="category",
    field_schema=PayloadSchemaType.KEYWORD,
)

Without the index, Qdrant still works but falls back to scanning more candidates than necessary. With the index, filtered searches stay sub-10ms even on collections with millions of points.

For more advanced retrieval techniques, including reranking the top-k results with a cross-encoder, see our RAG reranking with Cohere tutorial.

Step 7: Wrap It in a FastAPI Service

Real applications expose retrieval over HTTP. Here is a minimal FastAPI service that does embedding plus search in a single endpoint.

# main.py
import os
from contextlib import asynccontextmanager
from dotenv import load_dotenv
from fastapi import FastAPI, HTTPException
from openai import OpenAI
from pydantic import BaseModel
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue

load_dotenv()

clients: dict = {}

@asynccontextmanager
async def lifespan(app: FastAPI):
    clients["openai"] = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    clients["qdrant"] = QdrantClient(host="localhost", port=6333)
    yield
    clients.clear()

app = FastAPI(lifespan=lifespan)

class SearchRequest(BaseModel):
    query: str
    category: str | None = None
    limit: int = 3

class SearchHit(BaseModel):
    title: str
    body: str
    score: float

@app.post("/search", response_model=list[SearchHit])
def search(req: SearchRequest):
    try:
        vector = clients["openai"].embeddings.create(
            model="text-embedding-3-small",
            input=[req.query],
        ).data[0].embedding
    except Exception as exc:
        raise HTTPException(status_code=502, detail=f"embedding failed: {exc}")

    query_filter = None
    if req.category:
        query_filter = Filter(
            must=[FieldCondition(key="category", match=MatchValue(value=req.category))]
        )

    points = clients["qdrant"].query_points(
        collection_name="support_articles",
        query=vector,
        query_filter=query_filter,
        limit=req.limit,
    ).points

    return [
        SearchHit(title=p.payload["title"], body=p.payload["body"], score=p.score)
        for p in points
    ]

Run it with uvicorn main:app --reload and hit it:

curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{"query": "update payment", "category": "billing"}'

A few things make this snippet production-shaped rather than toy. The lifespan context manager keeps both clients alive across requests, which avoids per-request connection overhead. Pydantic validates input, so a malformed body returns a clean 422 instead of a stack trace. The embedding call is wrapped in a try/except that surfaces upstream failures as a 502, so callers can distinguish “OpenAI is down” from “I gave bad input”.

What is intentionally missing is auth, rate limiting, and retries. Those belong in middleware or an upstream API gateway, not in the search handler. Keep handlers focused.

Deploying to Qdrant Cloud

When you are ready to leave localhost, the smallest viable production setup is Qdrant Cloud’s free 1 GB cluster. Sign up at cloud.qdrant.io, create a cluster, and grab the URL plus API key. Then update your client:

qdrant = QdrantClient(
    url="https://your-cluster.gcp.cloud.qdrant.io:6333",
    api_key=os.environ["QDRANT_API_KEY"],
)

The rest of your code is unchanged — same query_points, same filters, same payloads. This portability is one of Qdrant’s quiet wins compared to vendor-locked alternatives.

For self-hosting on your own infrastructure, the official Helm chart deploys a clustered Qdrant in a few minutes. However, do not run a single-node Qdrant on a spot VM in production — losing the storage volume loses the collection. Either use Qdrant Cloud, run with replicated storage, or accept the operational responsibility.

Consider a mid-sized SaaS company building an in-app support search feature. They have roughly 50,000 knowledge base articles spread across 800 customer workspaces, and each workspace is allowed to see only its own articles plus the global public ones. The team built v1 on Pinecone, then migrated to Qdrant six months later when their search bill crossed $1,200/month and they wanted on-prem options for a regulated enterprise customer.

The migration itself took roughly two weeks. The schema mapped cleanly — each Pinecone vector became a Qdrant point with the same payload — but two things surprised the team during cutover. First, they had been doing tenant filtering as a post-filter in Pinecone, which silently degraded recall when many results were filtered out. Qdrant’s pre-filter exposed how poor the recall actually was. They had to increase the candidate pool (limit plus an ef parameter on the HNSW index) to restore quality. Second, payload indexes on tenant_id were non-negotiable. Without one, p99 search latency tripled during peak hours.

The end state: median search latency around 12 ms on a three-node cluster, hosting cost roughly a third of the previous Pinecone bill, and the regulated customer got their on-prem deployment. The trade-off was operational: someone on the team now owns Qdrant uptime instead of an SLA from a vendor.

When to Use Qdrant

  • You want a self-hostable vector database with first-class payload filtering
  • You need predictable low-latency search on 100K to 100M+ vectors
  • Your team is comfortable running a stateful service (or you accept Qdrant Cloud)
  • You need both dense and sparse vectors in the same store for hybrid search
  • You want a Python client that mirrors a stable REST API one-to-one

When NOT to Use Qdrant

  • Your dataset is under 10K vectors and lives next to a Postgres database — pgvector on Postgres is simpler and one less service to operate
  • You want zero ops and have budget — managed Pinecone is hard to beat for pure convenience
  • You need full-text search and document storage in the same engine — Elasticsearch or OpenSearch fit that better
  • Your queries are mostly exact-match keyword lookups, not semantic similarity — a vector store is the wrong tool

Common Mistakes with Qdrant

  • Forgetting to create payload indexes on filter fields, which silently slows queries as the collection grows
  • Using recreate_collection in production code paths and wiping real data
  • Embedding query and documents with different models, producing meaningless similarity scores
  • Storing huge raw documents inside the payload instead of an object store reference, which inflates RAM usage
  • Running single-node Qdrant on ephemeral storage and discovering the failure mode the hard way
  • Skipping the dashboard at :6333/dashboard when debugging — it is the fastest way to inspect what is actually stored

Conclusion

Qdrant gives you a fast, filterable vector database with a Python client that gets out of the way. With the steps above — Docker container, collection, embeddings, filtered search, FastAPI wrapper — you have a working semantic search service that scales from prototype to production with mostly configuration changes. The Qdrant Python integration story is genuinely clean, which is rare in this space.

For the next step, layer reranking on top of your top-k results to push relevance higher, or add sparse vectors for true hybrid search. If you have not yet picked a vector store, our vector databases compared post walks through how Qdrant stacks up against Pinecone, Weaviate, and Chroma side by side.

Leave a Comment