
If your application talks to OpenAI today and you suddenly need Claude for long-context tasks, Gemini for vision, and a local Llama model for sensitive data, you have a problem. Each provider exposes a slightly different SDK, different parameter names, different error formats, and different rate-limit semantics. A clean LiteLLM setup collapses that complexity into a single OpenAI-compatible interface, with optional proxy features for caching, fallbacks, rate limiting, and cost tracking across every model your team uses.
This tutorial walks through both modes of LiteLLM: the Python SDK you embed in code, and the standalone proxy server you run as a gateway in front of every provider. Furthermore, you will see how to configure routing, set per-user budgets, enable semantic caching, and ship the whole thing to production without leaking API keys.
What Is LiteLLM?
LiteLLM is an open-source library and proxy server that exposes 100+ LLM providers, including OpenAI, Anthropic, Google, Azure, Bedrock, and self-hosted models, through a single OpenAI-style API. As a result, you write code against openai.ChatCompletion-shaped calls once, and LiteLLM translates them to the right provider format under the hood.
The project ships in two layers. First, the Python SDK is a thin drop-in replacement that maps any model name (like claude-sonnet-4-6 or gemini-2.5-pro) to the correct provider request. Second, the LiteLLM Proxy is a self-hostable FastAPI server that adds virtual API keys, budgets, fallbacks, retries, caching, and observability hooks on top of the SDK.
For a wider view of how gateways fit into AI infrastructure, see our API gateway patterns for SaaS applications guide. The same principles apply here, just with LLMs sitting behind the gateway instead of microservices.
When to Use a LiteLLM Setup
Use LiteLLM when:
- Your codebase already speaks OpenAI’s chat-completions format and you want to add a second provider without forking call sites.
- You need fallback logic when one provider has an outage or rate-limit spike.
- You want centralized cost tracking and per-team budgets across multiple LLM accounts.
- You are running an internal AI platform and need to issue virtual keys instead of distributing raw provider tokens to every developer.
When NOT to Use a LiteLLM Setup
Avoid LiteLLM when:
- You only call one provider and have no roadmap to add others. The abstraction adds little value over a direct SDK.
- You need provider-specific features that LiteLLM does not yet translate cleanly, such as Anthropic’s computer-use tool calls or OpenAI’s Realtime API streaming. Check the docs for current coverage before committing.
- Your latency budget is in the low milliseconds and you cannot tolerate any proxy hop. The SDK mode avoids this, but the proxy mode adds a network round trip.
Common Mistakes with LiteLLM
- Hardcoding API keys in
config.yamlinstead of usingos.environreferences, which leaks secrets into git history. - Skipping the database setup for the proxy, then losing all budget and key data on every restart.
- Forgetting to enable streaming in the proxy when downstream code expects it, which silently buffers responses.
- Mixing model alias names with provider-prefixed names (
gpt-5vsopenai/gpt-5) and getting confusing routing errors.
Prerequisites
Before starting the LiteLLM setup, make sure you have:
- Python 3.9 or newer with
pipavailable. - API keys for at least two providers you want to compare. The walkthrough uses OpenAI and Anthropic, but any combination works.
- Docker installed if you plan to run the proxy in container mode.
- Optional: a Postgres database for production proxy state. SQLite is fine for local development.
If you are brand new to either provider, our OpenAI API guide and getting started with the Claude API tutorials cover the basics of obtaining keys and making your first request.
Step 1: Install LiteLLM and Run Your First Call
Start with the SDK, even if your end goal is the proxy. The SDK mode is the fastest way to verify your keys work and to feel out how LiteLLM normalizes provider responses.
# Create a fresh virtualenv to avoid polluting system packages
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install LiteLLM with all optional provider integrations
pip install "litellm[proxy]"
The [proxy] extra pulls in FastAPI, uvicorn, and Postgres drivers that the proxy server needs later. For the SDK alone, plain pip install litellm is enough.
Next, set your API keys as environment variables. LiteLLM reads them automatically when it detects matching model names.
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
Now make your first call. Notice how the same completion function handles two completely different providers:
# first_call.py
import os
from litellm import completion
# OpenAI request
openai_response = completion(
model="gpt-5",
messages=[{"role": "user", "content": "Explain mutex vs semaphore in two sentences."}],
)
print("OpenAI:", openai_response.choices[0].message.content)
# Anthropic request — same function, different model alias
claude_response = completion(
model="anthropic/claude-sonnet-4-6",
messages=[{"role": "user", "content": "Explain mutex vs semaphore in two sentences."}],
)
print("Claude:", claude_response.choices[0].message.content)
Why this works: LiteLLM inspects the model prefix and dispatches to the correct provider client. The response object always follows the OpenAI schema, so downstream parsing logic stays identical regardless of where the answer came from.
Run the script with python first_call.py. If both responses print, your keys and network path are good and you can move on to the proxy.
Step 2: Configure the LiteLLM Proxy
The proxy is where LiteLLM earns its keep in production. Instead of every service importing the SDK, you run one process that accepts OpenAI-compatible requests and routes them across providers based on a config file.
Create a config.yaml in your project root:
# config.yaml
model_list:
- model_name: gpt-5
litellm_params:
model: openai/gpt-5
api_key: os.environ/OPENAI_API_KEY
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: gemini-pro
litellm_params:
model: gemini/gemini-2.5-pro
api_key: os.environ/GEMINI_API_KEY
# Smart alias: "fast-default" picks gpt-5-mini, falls back to claude
- model_name: fast-default
litellm_params:
model: openai/gpt-5-mini
api_key: os.environ/OPENAI_API_KEY
litellm_settings:
drop_params: true # Silently ignore params a provider does not support
set_verbose: false
cache: true
cache_params:
type: redis
host: os.environ/REDIS_HOST
port: os.environ/REDIS_PORT
router_settings:
routing_strategy: simple-shuffle
fallbacks:
- fast-default: ["gpt-5", "claude-sonnet"]
num_retries: 2
timeout: 30
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
database_url: os.environ/DATABASE_URL
Why this structure works: Each entry in model_list is a “deployment” that pairs a public alias (what clients send) with provider-specific parameters. The drop_params flag is particularly useful because OpenAI clients often send fields like logit_bias that other providers reject, and dropping them silently keeps cross-provider calls working.
Generate a master key once and store it in your secret manager:
# Generate a strong random key — store it securely, never commit it
python -c "import secrets; print('sk-' + secrets.token_urlsafe(32))"
Then export the environment variables and start the proxy:
export LITELLM_MASTER_KEY="sk-your-generated-master-key"
export DATABASE_URL="postgresql://user:pass@localhost:5432/litellm"
export REDIS_HOST="localhost"
export REDIS_PORT="6379"
litellm --config config.yaml --port 4000
The proxy comes up on http://localhost:4000. Confirm with a curl request that uses the OpenAI client format:
curl -X POST http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet",
"messages": [{"role": "user", "content": "ping"}]
}'
If you get a JSON response with choices, the proxy is routing correctly.
Step 3: Issue Virtual Keys for Teams
Distributing the master key to every developer defeats the purpose. Instead, use it once to mint per-team or per-service virtual keys with their own budgets and rate limits.
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"models": ["gpt-5", "claude-sonnet"],
"max_budget": 50,
"duration": "30d",
"tpm_limit": 100000,
"rpm_limit": 100,
"metadata": {"team": "search-quality"}
}'
The response includes a key field that starts with sk-. Hand that to the team’s service, and the proxy will enforce a $50 monthly budget, 100 requests per minute, and 100K tokens per minute on that key alone.
Why this matters: A single leaked provider key can drain thousands of dollars before someone notices. Virtual keys cap the blast radius and give you per-team cost reporting without manual log parsing.
Step 4: Add Fallbacks and Retries
The single most useful production feature in a LiteLLM setup is automatic fallback. When OpenAI returns a 429 or a 5xx, you do not want to push that error back to your users; you want the proxy to try Claude or Gemini and serve the request anyway.
The fallbacks block in config.yaml already covers the happy case, but you also need per-error routing for tighter control:
router_settings:
routing_strategy: usage-based-routing-v2
num_retries: 2
timeout: 30
retry_after: 5
fallbacks:
- gpt-5: ["claude-sonnet", "gemini-pro"]
- claude-sonnet: ["gpt-5"]
context_window_fallbacks:
- gpt-5: ["claude-sonnet"] # claude has a longer context window
allowed_fails: 3
cooldown_time: 60 # seconds to skip a deployment after repeated failures
The usage-based-routing-v2 strategy spreads traffic across healthy deployments based on observed tokens-per-minute, which prevents one model from getting rate-limited while another sits idle. Furthermore, cooldown_time quarantines a failing deployment for a minute, giving the provider time to recover instead of hammering it.
If you have not seen routing patterns at the API layer before, the same load-balancing and circuit-breaker patterns apply. The API gateway patterns for SaaS applications post explains the underlying ideas in a non-LLM context.
Step 5: Enable Caching to Cut Costs
LLM calls are expensive, and identical prompts get repeated more often than teams expect: dashboards refreshing, idempotent retries, cached UI states. Redis-backed caching in LiteLLM can cut bills by 20-40% in apps with repeat query patterns.
The config above already turned caching on. To verify it works, send the same request twice and watch the latency drop:
# First call — hits the provider
time curl -s http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer $VIRTUAL_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-5","messages":[{"role":"user","content":"What is CAP theorem?"}]}' > /dev/null
# Second call — served from Redis
time curl -s http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer $VIRTUAL_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-5","messages":[{"role":"user","content":"What is CAP theorem?"}]}' > /dev/null
The second response typically returns in 5-20 milliseconds versus several seconds for a fresh provider call. For tunable behavior, you can also pass cache: {"no-cache": True} per-request to skip the cache when you need fresh output.
For semantic caching, where prompts that are similar (but not identical) reuse the same answer, point LiteLLM at a vector store instead of plain Redis. Our vector databases compared post walks through the trade-offs between common backends like Qdrant, Pinecone, and pgvector.
Step 6: Stream Responses Through the Proxy
Most chat UIs stream tokens as they generate. LiteLLM passes streaming through end-to-end, but client code needs the stream: true flag and an SSE-aware reader:
# stream_demo.py
import os
from openai import OpenAI
# Point the OpenAI client at the LiteLLM proxy
client = OpenAI(
api_key=os.environ["LITELLM_VIRTUAL_KEY"],
base_url="http://localhost:4000/v1",
)
stream = client.chat.completions.create(
model="claude-sonnet",
messages=[{"role": "user", "content": "Stream a haiku about distributed systems"}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
print(delta, end="", flush=True)
Why this works: The OpenAI Python client speaks server-sent events, and LiteLLM forwards SSE frames from any underlying provider in OpenAI’s format. As a result, your frontend code does not need to know whether the tokens came from Anthropic, Google, or a local vLLM instance.
If you are wiring up a chat UI for the first time, our walkthrough on AI chatbot streaming responses covers the SSE patterns and reconnection logic you will need on the frontend side.
Step 7: Add Observability
Without tracing, debugging an LLM gateway is painful. A user complains a response was wrong; you have no idea which model served it, what prompt arrived, or whether the cache returned a stale result. LiteLLM ships with built-in callbacks for Langfuse, Helicone, OpenTelemetry, and several others.
Add a callbacks block to litellm_settings:
litellm_settings:
drop_params: true
cache: true
cache_params:
type: redis
host: os.environ/REDIS_HOST
port: os.environ/REDIS_PORT
success_callback: ["langfuse"]
failure_callback: ["langfuse"]
Then set the Langfuse credentials in the environment:
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
export LANGFUSE_HOST="https://cloud.langfuse.com"
Every request now appears in Langfuse with the prompt, response, latency, tokens, cost, and which deployment served it. Consequently, when something looks wrong, you can filter by user, model, or time window and see the full trace within seconds.
Real-World Scenario: Multi-Tenant SaaS Platform
A small SaaS team building an analytics product wanted to let each customer pick their preferred LLM, while keeping cost visibility and not leaking customer prompts across tenants. Their initial implementation embedded the OpenAI SDK directly, and every new provider request meant a new code path and a new key management problem.
After moving to a LiteLLM proxy, the architecture became cleaner. Customer prompts hit the proxy with a tenant-scoped virtual key; the key restricted which models that tenant could use and capped monthly spend per account. Furthermore, when the team added Claude support for one enterprise customer, no application code changed: they added a model_list entry and a fallback rule, restarted the proxy, and the customer’s model field in the request just started working.
The trade-off was operational. The team had to run and monitor an extra service, set up Redis and Postgres for cache and budgets, and write runbooks for proxy restarts during deployments. For a single-product startup, that overhead is real. In their case, the time saved on key rotation and per-customer billing reconciliation paid back the operational cost within the first quarter of running it.
Production Deployment Checklist
Before pointing real traffic at your LiteLLM setup, verify:
- Secrets are externalized. Every
api_keyinconfig.yamlshould useos.environ/NAMEsyntax, never the raw key. - Postgres is provisioned. SQLite works locally, but virtual keys and budgets need durable storage for any production load.
- Redis has authentication and TLS. Cached responses can include sensitive customer data; treat the cache like any other data store.
- Health checks point at
/health/livelinessand/health/readiness, not at/, so your orchestrator can distinguish “process is up” from “downstream providers are reachable.” - Master key is rotated quarterly and stored in your secret manager (AWS Secrets Manager, HashiCorp Vault, etc.), not in CI environment files.
- Per-team budgets are set conservatively at first. It is much easier to raise a budget than to explain a $20K bill.
For a Dockerized deployment, use the official image and mount the config:
docker run -d \
--name litellm-proxy \
-p 4000:4000 \
-e LITELLM_MASTER_KEY=$LITELLM_MASTER_KEY \
-e DATABASE_URL=$DATABASE_URL \
-e OPENAI_API_KEY=$OPENAI_API_KEY \
-e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
-v $(pwd)/config.yaml:/app/config.yaml \
ghcr.io/berriai/litellm:main-stable \
--config /app/config.yaml --port 4000
The main-stable tag is what the LiteLLM team recommends for production. Pin to a specific SHA in your real deployment so you control upgrade timing.
Comparing LiteLLM to Direct Provider SDKs
| Feature | Direct SDK (per provider) | LiteLLM SDK | LiteLLM Proxy |
|---|---|---|---|
| Single client API | No | Yes | Yes |
| Provider fallbacks | Manual code | Manual code | Built-in |
| Virtual keys | No | No | Yes |
| Cost tracking | Manual | Per-call | Aggregated, by-key |
| Adds a network hop | No | No | Yes |
| Caching | No | In-process | Redis, shared |
| Best for | Single-provider apps | Code-level swap-out | Platform / gateway |
The proxy mode is overkill for a side project that only uses OpenAI. However, once you add a second provider and a second team, the SDK alone leaves too much glue code in your application.
Conclusion
A clean LiteLLM setup turns multi-provider LLM chaos into a single OpenAI-compatible endpoint with built-in fallbacks, caching, virtual keys, and cost tracking. Start with the SDK to verify your provider keys, then graduate to the proxy when you need centralized control across teams or customers. Pair it with observability (Langfuse or OpenTelemetry) from day one, because debugging an opaque gateway is far harder than debugging the underlying providers directly.
For your next step, explore how to add semantic caching with a vector store so that near-duplicate prompts hit your cache instead of the provider; our vector databases compared post is a good starting point for picking a backend. If you are also looking at agent frameworks that sit on top of your gateway, the building AI agents with tools, planning, and execution guide explains how routing decisions interact with agent loops.
1 Comment