RAG & Vector Search

OpenAI vs Voyage vs Cohere Embeddings for Production RAG

If you are building retrieval-augmented generation (RAG) in 2026, the embedding model you pick quietly drives almost everything downstream: recall quality, index size, latency, and bill. OpenAI vs Voyage vs Cohere embeddings is the practical short-list most production teams end up debating, because each of these providers ships a hosted API, supports multiple dimensions, and updates frequently enough that yesterday’s benchmark is already stale.

This guide is for intermediate to senior engineers choosing an embedding provider for a real workload, not for tutorials. You will get a side-by-side comparison of the current flagship models, what the MTEB leaderboard actually tells you (and what it hides), where each provider quietly wins, and a decision framework you can apply to your own data. By the end, you will know which of OpenAI’s text-embedding-3, Voyage’s voyage-3.5 family, or Cohere’s embed-v4 makes sense for your domain, your budget, and the constraints you already live with.

How Embedding Providers Actually Differ

Every embedding API does the same thing on the surface: it turns text into a fixed-size vector you can search with cosine similarity. The differences show up once you push past prototype scale.

First, dimensionality and Matryoshka support. OpenAI and Cohere both expose adjustable dimensions, letting you trade quality for index size on the same model. Voyage’s newer models support a similar truncation approach. This matters when your vector store cost is dominated by RAM, not API calls.

Second, domain specialization. Voyage publishes domain-tuned variants (voyage-code-3voyage-law-2voyage-finance-2) that beat general-purpose models on their target corpora. OpenAI ships a single model family and expects you to fine-tune downstream. Cohere sits in the middle with a strong multilingual model and a code-focused variant.

Third, input modalities and context length. Cohere’s embed-v4 accepts both text and images and stretches to 128K tokens of input context. OpenAI’s models cap at 8,192 tokens and text-only. Voyage caps at 32,000 tokens for most models. If you embed long contracts or multimodal documents, this constraint alone narrows the choice.

Finally, regional availability and compliance. OpenAI runs in fewer regions than Cohere, which has invested heavily in enterprise residency options. Voyage is now owned by MongoDB, which changes the integration story if you already run Atlas Vector Search.

OpenAI Embeddings: The Safe Default

OpenAI ships two general-purpose models you will hit in production: text-embedding-3-small and text-embedding-3-large. Both support Matryoshka truncation, so you can request fewer dimensions and trade a small amount of recall for a smaller index.

text-embedding-3-small produces 1,536 dimensions by default and costs roughly $0.02 per million tokens. text-embedding-3-large produces 3,072 dimensions and costs around $0.13 per million tokens. The large model scores around 64.6 on MTEB, putting it in the upper-middle of the leaderboard but well short of top open-source models on retrieval-only tasks.

from openai import OpenAI

client = OpenAI()

def embed_openai(texts: list[str], dimensions: int = 1024) -> list[list[float]]:
    """Embed a batch of texts using text-embedding-3-small with truncated dimensions."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
        dimensions=dimensions,
    )
    return [item.embedding for item in response.data]

vectors = embed_openai(["What is RAG?", "How does vector search work?"])
print(len(vectors[0]))  # 1024

The dimensions parameter is the lever most teams overlook. Truncating text-embedding-3-small from 1,536 to 512 cuts your vector storage by two-thirds while only losing about 2 MTEB points. For a 10-million-chunk index, that translates to gigabytes of RAM saved.

Where OpenAI shines is operational simplicity. The API is rock-solid, the SDK is in every language, and the rate limits are generous for tier-3 accounts. Where it struggles is domain specialization: a code-heavy or legal-heavy corpus will see noticeably worse recall than with a Voyage domain model.

Voyage AI: Quality Leader for Specialized Domains

Voyage AI publishes the strongest general-purpose embedding model on MTEB at the time of writing, with voyage-3-large scoring around 74 on the retrieval split. Its domain variants — voyage-code-3voyage-law-2voyage-finance-2 — beat both OpenAI and Cohere on their respective benchmarks by 5-10 points.

The flagship general model is voyage-3.5. It outputs 1,024 dimensions by default, supports 32,000 tokens of input, and costs around $0.06 per million tokens. Voyage also exposes voyage-3.5-lite for cheaper bulk embedding at roughly $0.02 per million tokens.

import voyageai

vo = voyageai.Client()

def embed_voyage(texts: list[str], model: str = "voyage-3.5") -> list[list[float]]:
    """Embed with Voyage, using the appropriate input_type for retrieval."""
    result = vo.embed(
        texts=texts,
        model=model,
        input_type="document",  # use "query" at search time
        output_dimension=1024,
    )
    return result.embeddings

doc_vectors = embed_voyage(["Postgres pgvector supports HNSW indexes."])

One detail teams miss with Voyage: the input_type parameter actually changes the embedding. Passing "document" at index time and "query" at search time produces noticeably better retrieval than treating both the same way. OpenAI does not do this; Cohere does (and calls it input_type as well).

Voyage’s downsides are real. The provider is smaller, the API has had occasional capacity issues during model launches, and the MongoDB acquisition has nudged the roadmap toward Atlas Vector Search integration. If you are already on MongoDB Atlas, that is a feature. If you run pgvector with Postgres or Qdrant, it is neutral at best.

Cohere Embeddings: Multilingual and Multimodal

Cohere’s flagship is embed-v4.0, which handles text and images in a single model and stretches to 128,000 tokens of input context. Default output is 1,536 dimensions with Matryoshka support down to 256. Pricing sits around $0.12 per million tokens, putting it between OpenAI’s small and large tiers.

Cohere’s traditional strength is multilingual retrieval. embed-multilingual-v3.0 covers more than 100 languages with shared embedding space, meaning a French query can retrieve English documents directly. OpenAI’s models handle multilingual queries, but their cross-lingual retrieval is consistently weaker on benchmarks like MIRACL.

import cohere

co = cohere.Client()

def embed_cohere(texts: list[str], input_type: str = "search_document") -> list[list[float]]:
    """Embed with Cohere embed-v4, with input_type set for retrieval."""
    response = co.embed(
        texts=texts,
        model="embed-v4.0",
        input_type=input_type,  # "search_document" or "search_query"
        output_dimension=1024,
        embedding_types=["float"],
    )
    return response.embeddings.float

vectors = embed_cohere(["Production RAG requires monitoring retrieval quality."])

Cohere also exposes the only mainstream embedding API that accepts image inputs in the same call as text. For multimodal RAG over slide decks, product catalogs, or scanned documents, this collapses what would otherwise be a two-pipeline architecture into one.

The trade-off is cost at scale. For a pure-text English workload, Cohere is more expensive than voyage-3.5 while scoring slightly lower on English retrieval benchmarks. Choose Cohere when multilingual or multimodal is part of the requirement, not when it is not.

OpenAI vs Voyage vs Cohere: At a Glance

FeatureOpenAI text-embedding-3-largeVoyage voyage-3.5Cohere embed-v4.0
Default dimensions3,0721,0241,536
Matryoshka truncationYes (down to 256)Yes (256, 512, 1024)Yes (down to 256)
Max input tokens8,19232,000128,000
ModalitiesText onlyText onlyText + image
MTEB retrieval score~64.6~74.0~67
Price per 1M tokens$0.13$0.06$0.12
Domain-specialized modelsNoYes (code, law, finance)Yes (code)
Strong multilingualModerateModerateYes (100+ languages)
Input type parameterNoYesYes

This table is the featured-snippet anchor for the post: it answers the “which embedding model is best” question in one scan. Read MTEB scores as a directional signal, not a verdict — your domain corpus may rank these models in a completely different order.

Benchmarks Are a Starting Point, Not an Answer

MTEB is useful for ruling out weak models, but the leaderboard rewards general English retrieval and underweights the things that matter in production: long-document handling, domain vocabulary, latency under burst load, and behavior on noisy real-world chunks. A model that scores 74 on MTEB can lose to one that scores 67 on your specific corpus.

The only reliable answer is to evaluate on your own data. Build a small labeled set — 50 to 200 queries with known-relevant chunks — and measure recall@k for each provider. A homemade script with pandas and your retrieval client will get you there in an afternoon. If the difference is under two percentage points, default to the cheaper or operationally simpler option.

Another factor benchmarks hide is chunking interaction. A model with a 32K input window can embed a full document, which sometimes outperforms multiple 512-token chunks of the same content, especially for summary-style queries. If your chunking strategy produces small fragments, OpenAI’s 8K limit is enough. If you embed legal contracts whole, Cohere’s 128K window is the only option of the three.

Cost Math for Production-Scale RAG

For a workload ingesting 100 million tokens per month and serving 5 million query tokens, the embedding cost difference is real but rarely dominant.

  • OpenAI text-embedding-3-small: ~$2.10/month
  • OpenAI text-embedding-3-large: ~$13.65/month
  • Voyage voyage-3.5: ~$6.30/month
  • Voyage voyage-3.5-lite: ~$2.10/month
  • Cohere embed-v4.0: ~$12.60/month

At this scale, embedding API cost is dwarfed by your LLM inference bill and your vector database hosting. The real cost lever is dimension count, because storage and index memory scale linearly with vector size. Cutting from 3,072 to 1,024 dimensions on a 10-million-vector index saves roughly 80GB of RAM in a typical Pinecone Serverless or Weaviate setup, which moves the monthly bill far more than the embedding API itself.

In short: pick the embedding model that gives you the best recall, then truncate dimensions to control infrastructure cost. Do not optimize for per-token embedding price unless you are reindexing petabytes nightly.

Switching Providers Without Rewriting Your Pipeline

A common production mistake is hardcoding one provider’s SDK throughout the codebase. Wrap the embedding call behind a thin interface from day one, so swapping providers during evaluation is a one-line change.

from typing import Protocol

class EmbeddingProvider(Protocol):
    def embed_documents(self, texts: list[str]) -> list[list[float]]: ...
    def embed_query(self, text: str) -> list[float]: ...

class VoyageProvider:
    def __init__(self, client, model: str = "voyage-3.5", dim: int = 1024):
        self.client = client
        self.model = model
        self.dim = dim

    def embed_documents(self, texts: list[str]) -> list[list[float]]:
        result = self.client.embed(
            texts=texts, model=self.model,
            input_type="document", output_dimension=self.dim,
        )
        return result.embeddings

    def embed_query(self, text: str) -> list[float]:
        result = self.client.embed(
            texts=[text], model=self.model,
            input_type="query", output_dimension=self.dim,
        )
        return result.embeddings[0]

The Protocol pattern lets your retrieval code stay provider-agnostic. When you decide to A/B test Cohere against Voyage, you write one more class and swap it in the factory. This pairs well with frameworks like LlamaIndex or LangChain, both of which already abstract embeddings but reward you for thinking in this shape.

When to Use Each Embedding Provider

Pick OpenAI When…

  • You want the lowest-friction integration with strong general English performance
  • Your team is already on the OpenAI stack and minimizing vendor sprawl matters
  • Your chunks fit comfortably under 8K tokens (most do)
  • You prefer dimension truncation as your main cost lever
  • You need predictable rate limits and battle-tested SDKs

Pick Voyage When…

  • Your corpus is code, legal, financial, or another domain Voyage has a tuned model for
  • You need the absolute best retrieval quality on a general English corpus
  • You want 32K token windows without paying Cohere prices
  • You are on MongoDB Atlas and want native vector integration

Pick Cohere When…

  • You serve a multilingual user base and need strong cross-lingual retrieval
  • Your documents include images, charts, or scanned PDFs you want embedded directly
  • You embed very long documents (50K+ tokens) and chunking is undesirable
  • You have enterprise compliance requirements that Cohere already meets in your region

When NOT to Use These Embedding APIs

  • You are processing data that cannot leave your network and self-hosting open models like bge-large-en-v1.5 or mxbai-embed-large-v1 is feasible
  • Your latency budget is under 50ms per query and the round-trip to a hosted API is too expensive
  • You are running an offline batch job at petabyte scale where self-hosted GPU inference becomes cheaper than per-token API pricing
  • Your use case is keyword search, not semantic search — in which case BM25 in your existing search engine is faster and more interpretable

Common Mistakes When Choosing Embeddings

  • Picking by MTEB score alone. The leaderboard is general English retrieval. Your corpus is not. Always run a small in-domain eval before committing.
  • Ignoring input_type parameters. Voyage and Cohere both produce better retrieval when you tag documents and queries differently. OpenAI does not, but mixing them up on the providers that do silently hurts recall.
  • Hardcoding the provider SDK. Wrap embeddings behind an interface so you can swap providers during evaluation without touching retrieval logic.
  • Using maximum dimensions by default. Matryoshka truncation usually loses 1-3 points of MTEB while saving 50-75% of vector storage. Start truncated, only expand if your eval demands it.
  • Embedding once, never re-embedding. Embedding models improve fast. Plan for a full reindex every 6-12 months as a normal operating cost, not a crisis.
  • Mixing models in one index. Embeddings from different providers (or different model versions) are not comparable. Any reindex is a full reindex.

A Realistic Selection Scenario

A mid-sized SaaS company building an internal documentation search tool over 200,000 markdown files faces this choice. The corpus is English, technical, and includes a lot of code blocks. The team is already on Postgres with pgvector, so vector store choice is locked.

Running a 100-query eval on a labeled subset, they see text-embedding-3-small at 1,024 dimensions hit 78% recall@5, voyage-3.5 at 1,024 dimensions hit 84%, and voyage-code-3 hit 89%. Cohere embed-v4.0 lands at 80%. On a $40/month vector hosting bill, the 11-point recall difference between OpenAI and voyage-code-3 is worth orders of magnitude more than the $4/month embedding cost gap.

They ship Voyage’s code model with truncated 1,024 dimensions. Six months later, when Voyage releases a new code model, the wrapper they put around the SDK means the reindex is a configuration change and a background job, not a refactor. That is the shape every embedding decision should take: low-friction to evaluate, cheap to switch.

Final Recommendation

For most production RAG systems in 2026, the right starting point is Voyage voyage-3.5 for quality-sensitive workloadsOpenAI text-embedding-3-small for operational simplicity and cost, and Cohere embed-v4.0 when multilingual or multimodal is non-negotiable. The MTEB leaderboard is a filter, not an answer; your own labeled eval set is the only authority that matters.

Pick one, wrap it behind an interface, run a small recall@k evaluation on real data, and revisit the decision in six months when the next generation of models lands. If you want to go deeper, read about hybrid search with BM25 and vectors to see how reranking changes the equation, or compare LlamaIndex vs LangChain for the framework layer that sits on top.

Leave a Comment