Portkey AI Gateway: Caching, Fallbacks, Observability

If you ship LLM features to real users, three problems show up fast: OpenAI returns a 500, your bill doubles because the same prompts run over and over, and you have no idea which request caused that angry Slack message at 2 a.m. The Portkey AI Gateway solves all three from a single proxy layer, without rewriting your application code.

This tutorial walks through setting up Portkey AI Gateway with the OpenAI, Anthropic, and Groq SDKs, configuring semantic caching to cut repeat inference cost, building fallback chains that survive provider outages, and using the built-in observability dashboard to debug production issues. Furthermore, you will learn when Portkey is the right choice versus a self-hosted alternative like LiteLLM.

What Is the Portkey AI Gateway?

Portkey is an LLM gateway that sits between your application and 200+ model providers, adding caching, fallbacks, load balancing, retries, observability, and prompt management without changing your SDK code. Specifically, you keep using the official OpenAI or Anthropic SDK, point its base URL at Portkey, and the gateway routes the request, applies your configured policies, and streams the response back.

The product has two deployment modes. The hosted SaaS handles infrastructure, observability storage, and dashboards for you. Meanwhile, the open-source AI Gateway runs as a single Go binary or Docker container in your VPC, which is the option to pick when SOC 2 or HIPAA scope rules out a third party in the request path.

Portkey is closest in spirit to other LLM gateways like LiteLLM and Bifrost, but it leans harder into observability and prompt versioning. As a result, teams that want both a routing proxy and a control plane for prompts often pick Portkey over the alternatives.

Why Teams Adopt an LLM Gateway

A gateway is the right answer when at least two of the following are true. First, you are using more than one provider and want a single API surface. Second, your inference cost has crossed the point where caching repeat queries materially helps. Third, you need provider failover for SLAs. Finally, your support team keeps asking, “Can we see what the model said to that user?”

Without a gateway, each of these requirements becomes a separate library, a separate dashboard, and a separate failure mode in your code. With Portkey AI Gateway, they collapse into a single configuration object that ships with every request. Consequently, you spend less time on plumbing and more time on the actual product.

Setting Up Portkey in Five Minutes

The fastest path to a working setup uses the hosted gateway. First, sign up at portkey.ai and create a virtual key for each provider you want to route to. A virtual key stores your real OpenAI or Anthropic key inside Portkey and gives you a short identifier to reference instead.

Next, install the Portkey SDK alongside your existing client library:

# Python
pip install portkey-ai openai anthropic

# TypeScript / Node
npm install portkey-ai openai @anthropic-ai/sdk

Now you have two ways to use the gateway. The simplest is the Portkey SDK, which speaks an OpenAI-compatible interface to every provider:

import os
from portkey_ai import Portkey

client = Portkey(
    api_key=os.environ["PORTKEY_API_KEY"],
    virtual_key=os.environ["PORTKEY_OPENAI_VIRTUAL_KEY"],
)

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Summarize MapReduce in one sentence."}],
)
print(response.choices[0].message.content)

The other option keeps the official OpenAI SDK and only swaps the base URL:

import os
from openai import OpenAI
from portkey_ai import PORTKEY_GATEWAY_URL, createHeaders

client = OpenAI(
    api_key="placeholder",  # Portkey ignores this; the virtual key carries the real one
    base_url=PORTKEY_GATEWAY_URL,
    default_headers=createHeaders(
        api_key=os.environ["PORTKEY_API_KEY"],
        virtual_key=os.environ["PORTKEY_OPENAI_VIRTUAL_KEY"],
    ),
)

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello"}],
)

This pattern matters for migrations. If you already have hundreds of OpenAI SDK calls scattered across a codebase, you can adopt the Portkey AI Gateway by changing two lines in your client constructor. In contrast, full SDK replacement would require touching every call site. For an end-to-end OpenAI setup that this slots into cleanly, see the OpenAI API integration guide.

Verifying the Connection

After your first request, open the Portkey dashboard and check the Logs page. You should see the request you just made, the model that handled it, the token counts, the cost in dollars, and the full prompt and response. Furthermore, if nothing appears, the most common cause is that the virtual key points at a provider that the model parameter does not match — for example, sending gpt-4o-mini through an Anthropic virtual key.

Configuring Caching: Simple and Semantic

Caching is where a gateway pays for itself in the first month. Portkey ships two cache modes. Simple cache hashes the exact request body and returns stored responses for byte-identical inputs. Semantic cache embeds the request, compares it against stored requests by cosine similarity, and returns the closest match above a threshold you control.

Caching is enabled through a config object, which you can either define inline or store on the Portkey side and reference by ID. The inline form looks like this:

from portkey_ai import Portkey

cache_config = {
    "cache": {
        "mode": "semantic",
        "max_age": 3600,  # seconds; responses older than this are evicted
    }
}

client = Portkey(
    api_key=os.environ["PORTKEY_API_KEY"],
    virtual_key=os.environ["PORTKEY_OPENAI_VIRTUAL_KEY"],
    config=cache_config,
)

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
)

# Inspect cache hit status from response headers
print(response._headers.get("x-portkey-cache-status"))
# "HIT" on subsequent identical or semantically similar queries

For a deeper look at the underlying technique, including embedding choice and threshold tuning, the Anthropic prompt caching guide covers the model-side equivalent that you can combine with gateway caching.

When Simple Cache Is Enough

Simple cache fits any workload where requests repeat byte-for-byte. Typical examples include system prompts paired with deterministic variable substitutions, scheduled batch jobs that re-process the same documents, and internal tools where users hit the same canned queries. In all these cases, semantic cache adds embedding cost without finding more hits.

When Semantic Cache Pays Off

Semantic cache wins on user-facing chatbots where the same intent gets asked twenty different ways. “How do I cancel my subscription,” “can I cancel my plan,” and “where is the cancel button” all hash differently but mean the same thing. With a similarity threshold around 0.95, Portkey returns a cached answer for all three. As a result, support-style applications often see cache hit rates jump from single digits with simple cache to 30-50% with semantic cache.

Cache Pitfalls to Watch For

Two failure modes show up in production. The first is over-aggressive thresholds: setting similarity to 0.85 will return answers about Stripe billing to questions about PayPal. Start at 0.95 and only relax after you have logs to compare. The second is forgetting that streaming responses cache differently. Specifically, the gateway must buffer the full response before storing it, so the first cache hit on a streamed query may feel slightly slower than the original — only subsequent hits return instantly.

Building Fallbacks and Load Balancing

A production LLM app cannot assume the primary provider will be up. OpenAI has had multi-hour incidents. Anthropic has rate-limited tier-1 customers during launches. Therefore, the Portkey AI Gateway lets you define a routing strategy that automatically retries on a different provider when the first one fails.

The configuration is declarative. Below, the gateway tries GPT-4o first, falls through to Claude Sonnet on any 5xx or rate limit error, and finally lands on Groq’s Llama 3.1 70B if both upstream models are unavailable:

fallback_config = {
    "strategy": {
        "mode": "fallback",
        "on_status_codes": [429, 500, 502, 503, 504],
    },
    "targets": [
        {
            "virtual_key": os.environ["PORTKEY_OPENAI_VIRTUAL_KEY"],
            "override_params": {"model": "gpt-4o"},
        },
        {
            "virtual_key": os.environ["PORTKEY_ANTHROPIC_VIRTUAL_KEY"],
            "override_params": {"model": "claude-sonnet-4-5"},
        },
        {
            "virtual_key": os.environ["PORTKEY_GROQ_VIRTUAL_KEY"],
            "override_params": {"model": "llama-3.1-70b-versatile"},
        },
    ],
}

client = Portkey(
    api_key=os.environ["PORTKEY_API_KEY"],
    config=fallback_config,
)

The gateway handles the retry transparently. Your application code makes one call and either gets a successful response or a final error after all targets have been exhausted. Importantly, the response includes a header indicating which provider actually served the request, which is essential for downstream logging.

Load Balancing for Cost or Latency

Switch mode from fallback to loadbalance and add a weight to each target, and Portkey distributes requests across providers in the specified ratio. This pattern fits two scenarios well. First, you can route 90% of traffic to a cheaper open-weights provider like Groq and keep 10% on GPT-4o for quality comparisons. Second, you can split traffic across regions of the same provider to stay under per-region rate limits.

loadbalance_config = {
    "strategy": {"mode": "loadbalance"},
    "targets": [
        {"virtual_key": "groq-vk", "weight": 0.9, "override_params": {"model": "llama-3.1-70b-versatile"}},
        {"virtual_key": "openai-vk", "weight": 0.1, "override_params": {"model": "gpt-4o-mini"}},
    ],
}

If your application is built around fast Groq inference for the bulk of traffic, the Groq API tutorial explains the model lineup and where it slots into a routing strategy like this.

Conditional Routing

For more advanced cases, Portkey supports conditional routing based on request metadata. As an example, you can route paid-tier users to GPT-4o and free-tier users to a cheaper model purely from a metadata.tier header attached to each call:

conditional_config = {
    "strategy": {
        "mode": "conditional",
        "conditions": [
            {
                "query": {"metadata.tier": {"$eq": "paid"}},
                "then": "premium",
            },
            {
                "query": {"metadata.tier": {"$eq": "free"}},
                "then": "budget",
            },
        ],
        "default": "budget",
    },
    "targets": [
        {"name": "premium", "virtual_key": "openai-vk", "override_params": {"model": "gpt-4o"}},
        {"name": "budget", "virtual_key": "groq-vk", "override_params": {"model": "llama-3.1-70b-versatile"}},
    ],
}

response = client.with_options(
    metadata={"tier": "paid", "user_id": "u_123"}
).chat.completions.create(
    model="placeholder",  # overridden by routing config
    messages=[{"role": "user", "content": "Explain transformers"}],
)

This eliminates a class of if-else routing logic that otherwise sprawls across application code.

Observability: Logs, Traces, and Metrics

Observability is the feature that pushes most teams from a script-and-pray setup to a real gateway. Portkey captures every request and surfaces three views.

Logs

The Logs view is a searchable, filterable list of every request the gateway has handled. Each entry includes the full prompt, the response, token counts, cost, latency, provider, model, and any metadata you attached. Crucially, you can filter by user ID, tier, environment, or any custom field you pass through the metadata header. For support escalations, this is the difference between a five-minute fix and an afternoon of grep through CloudWatch.

Traces

Traces group multiple LLM calls under a single trace ID, which matters for agent loops and RAG pipelines. A single user query might fan out into a query rewrite, a vector search, a reranking call, and a final generation. The gateway stitches them all under one trace and shows latency and cost broken down per step.

trace_id = "req_" + str(uuid.uuid4())

response = client.with_options(
    trace_id=trace_id,
    metadata={"span_name": "query_rewrite"},
).chat.completions.create(...)

response = client.with_options(
    trace_id=trace_id,  # same trace ID
    metadata={"span_name": "final_generation"},
).chat.completions.create(...)

Now both calls appear under the same trace in the dashboard, alongside any spans your application produces from OpenTelemetry exporters.

Metrics and Alerts

The Metrics view aggregates cost per model, per user, per day. You can set alert thresholds — for instance, a Slack ping when daily spend on GPT-4o crosses $200, or an email when p95 latency for any provider exceeds 8 seconds. Compared with rolling your own observability on top of Prometheus and Grafana, the time-to-first-useful-dashboard drops from days to minutes.

Compared to LiteLLM Proxy

LiteLLM offers a similar logging table and metrics, but the dashboards are more bare-bones, and tracing across multi-step agents requires more manual wiring. If observability depth is the deciding factor, Portkey usually wins on the demo. However, if you are already running Langfuse or Helicone, the LiteLLM setup guide covers how to forward gateway logs to those tools instead of using the LiteLLM UI.

A Real-World Production Scenario

Consider a mid-sized B2B SaaS company that has rolled out an in-app AI assistant to roughly 5,000 daily active users. The assistant answers product questions, generates reports, and drafts replies to customer emails. The team is using OpenAI exclusively.

In a typical month, three problems emerge. First, the same fifty support questions account for around 20% of all traffic, and the team is paying for each one to be generated from scratch. Second, a four-hour OpenAI outage during a product launch surfaces angry tweets and a missed demo. Third, when a customer complains that the assistant generated an inappropriate reply, the engineering team needs an hour to dig through their own logs because the user ID was not attached to the LLM call.

A Portkey rollout typically addresses all three within a sprint. Semantic caching with a 0.96 threshold drops repeat-query cost by 30-40% in the first week. A fallback config sending overflow traffic to Claude Sonnet and Llama 3.1 on Groq prevents the next OpenAI incident from being user-visible. Adding user_id and conversation_id as metadata on every request turns post-incident root cause analysis from an hour into ninety seconds in the Portkey logs.

Importantly, this assumes the team already has a chat backend in place. If you are still building one, the AI chatbot streaming responses guide covers the streaming patterns that work cleanly behind the Portkey AI Gateway.

When to Use Portkey AI Gateway

You are running production LLM traffic and need caching, fallbacks, and observability without building each one yourself
You use multiple providers and want one consistent API surface
You want a managed UI for prompt versioning, A/B tests, and team collaboration
Compliance is satisfied by a hosted service, or you can run Portkey’s open-source gateway in your VPC
Your team values fast time-to-dashboard over deep customization of the observability stack

When NOT to Use Portkey AI Gateway

You have a single provider and trivial volume — direct SDK calls are simpler and cheaper
You already run a mature Langfuse or Helicone stack and only need request routing — LiteLLM is lighter
Strict compliance rules out any third-party request proxy and you do not want to operate the self-hosted gateway yourself
Your workload is internal batch jobs where seconds of added latency from a network hop are unacceptable

Common Mistakes with Portkey

Setting the semantic cache similarity threshold too low (below 0.93) and watching unrelated answers leak between intents
Forgetting to attach user_id and environment metadata to every request, then losing the ability to filter logs after an incident
Hard-coding model names in application code instead of letting the routing config override them, which makes per-tier routing impossible later
Configuring fallback on_status_codes to only include 500 — rate-limit 429s and gateway 502s account for most real-world failures and must be in the list
Treating virtual keys as static configuration and not rotating the underlying provider keys, which defeats the security benefit of the virtual key layer
Skipping the observability piece entirely because “we already have Datadog,” then realizing too late that LLM request bodies are not in Datadog and trace correlation across providers is missing

Conclusion

The Portkey AI Gateway is the right pick when you need caching, multi-provider fallbacks, and observability shipped behind one configuration object rather than three separate libraries. Start by routing one model family through the hosted gateway, enable semantic caching with a conservative threshold, and add a single fallback target — then expand once the dashboard shows you where the next quarter of inference cost is being burned. If you are still deciding between gateways, the Bifrost vs LiteLLM comparison covers the two main open-source alternatives so you can make the call with the full landscape in view.

Portkey AI Gateway: Caching, Fallbacks, and Observability

What Is the Portkey AI Gateway?

Why Teams Adopt an LLM Gateway

Setting Up Portkey in Five Minutes

Verifying the Connection

Configuring Caching: Simple and Semantic

When Simple Cache Is Enough

When Semantic Cache Pays Off

Cache Pitfalls to Watch For

Building Fallbacks and Load Balancing

Load Balancing for Cost or Latency

Conditional Routing

Observability: Logs, Traces, and Metrics

Logs

Traces

Metrics and Alerts

Compared to LiteLLM Proxy

A Real-World Production Scenario

When to Use Portkey AI Gateway

When NOT to Use Portkey AI Gateway

Common Mistakes with Portkey

Conclusion

Leave a Comment Cancel reply

What Is the Portkey AI Gateway?

Why Teams Adopt an LLM Gateway

Setting Up Portkey in Five Minutes

Verifying the Connection

Configuring Caching: Simple and Semantic

When Simple Cache Is Enough

When Semantic Cache Pays Off

Cache Pitfalls to Watch For

Building Fallbacks and Load Balancing

Load Balancing for Cost or Latency

Conditional Routing

Observability: Logs, Traces, and Metrics

Logs

Traces

Metrics and Alerts

Compared to LiteLLM Proxy

A Real-World Production Scenario

When to Use Portkey AI Gateway

When NOT to Use Portkey AI Gateway

Common Mistakes with Portkey

Conclusion

Leave a Comment Cancel reply

Related Articles

LiteLLM Setup: Unified Proxy for Multi-Provider LLMs

Bifrost vs LiteLLM: When 50x Faster Actually Matters