
If your app calls OpenAI, Anthropic, or any hosted model, you will eventually hit a 429 Too Many Requests. It happens during a traffic spike, a batch job, or a noisy retry loop that makes things worse. This guide covers LLM rate limiting and retry strategies that hold up in production: how provider limits actually work, how to back off correctly, and how to stop a single failing dependency from taking down your whole service.
The audience here is intermediate-to-senior engineers shipping LLM features who need more than a naive try/except. You will leave with working Python patterns for exponential backoff, jitter, client-side throttling, and circuit breaking, plus a clear sense of when each one earns its keep.
Why LLM APIs Rate-Limit You in the First Place
LLM providers enforce limits to protect shared GPU capacity and to keep one customer from starving another. Unlike a typical CRUD API that caps requests per second, LLM APIs usually enforce two independent ceilings: requests per minute (RPM) and tokens per minute (TPM). You can blow past either one, and hitting the token limit is often the surprise.
For example, a summarization job that sends 50 small requests might sail under the RPM limit but exceed TPM because each request carries a 4,000-token document. Therefore, you have to reason about both dimensions when you size your throughput.
Most providers expose your current standing through response headers. After each call, OpenAI returns headers such as x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens, and x-ratelimit-reset-tokens. Reading these lets you slow down before you get rejected, which is far cheaper than reacting to a 429 after the fact.
What Counts as a Retryable Error
Not every failure deserves a retry. Retrying a genuinely broken request wastes money and latency, and in some cases it amplifies an outage. The rule of thumb: retry transient, server-side, or throttling errors, and fail fast on client mistakes.
| Status | Meaning | Retry? |
|---|---|---|
| 429 | Rate limit or quota exceeded | Yes, with backoff |
| 500 | Internal server error | Yes, with backoff |
| 502 / 503 | Bad gateway / unavailable | Yes, with backoff |
| 529 | Anthropic overloaded | Yes, with backoff |
| 408 | Request timeout | Yes, limited |
| 400 | Malformed request | No, fix the payload |
| 401 / 403 | Auth or permission failure | No, fix credentials |
| 404 | Model or endpoint not found | No |
The distinction matters because a 400 from a token-limit-exceeded prompt will never succeed on retry. As a result, you should classify errors explicitly rather than blindly retrying everything.
Exponential Backoff With Jitter
The core retry pattern for LLM rate limiting is exponential backoff: wait longer after each failure so the upstream service gets room to recover. A fixed delay is not enough, because if 100 clients all fail at once and all wait exactly two seconds, they retry in a synchronized wave and cause a second pileup. This is the “thundering herd” problem.
Jitter fixes it by randomizing each wait. Instead of everyone retrying at t + 2s, each client picks a random point in the window, spreading the load smoothly.
Here is a self-contained implementation using the official OpenAI SDK, with no external retry library:
import random
import time
import logging
from openai import OpenAI, APIStatusError, APIConnectionError, RateLimitError
logger = logging.getLogger(__name__)
client = OpenAI()
RETRYABLE_STATUS = {408, 429, 500, 502, 503, 529}
def chat_with_backoff(
messages,
model="gpt-4o-mini",
max_retries=6,
base_delay=1.0,
max_delay=60.0,
):
"""Call the chat API with exponential backoff and full jitter."""
for attempt in range(max_retries + 1):
try:
return client.chat.completions.create(model=model, messages=messages)
except RateLimitError as err:
# Honor the server's own hint when present.
retry_after = _retry_after_seconds(err)
delay = retry_after if retry_after else _backoff(attempt, base_delay, max_delay)
except APIStatusError as err:
if err.status_code not in RETRYABLE_STATUS:
raise # 400/401/403 etc. — not worth retrying
delay = _backoff(attempt, base_delay, max_delay)
except APIConnectionError:
# Network blip; treat as transient.
delay = _backoff(attempt, base_delay, max_delay)
if attempt == max_retries:
logger.error("Exhausted retries after %s attempts", attempt + 1)
raise
logger.warning("Retry %s in %.2fs", attempt + 1, delay)
time.sleep(delay)
def _backoff(attempt, base_delay, max_delay):
"""Full jitter: random point in [0, capped exponential window]."""
window = min(max_delay, base_delay * (2 ** attempt))
return random.uniform(0, window)
def _retry_after_seconds(err):
"""Parse the Retry-After header if the provider sent one."""
headers = getattr(err.response, "headers", {}) or {}
value = headers.get("retry-after")
return float(value) if value else None
Why this works: the _backoff helper caps the exponential growth so waits never balloon past a minute, and full jitter (random.uniform(0, window)) prevents synchronized retries. Meanwhile, _retry_after_seconds respects the server when it explicitly tells you how long to wait, which is always more accurate than your guess.
Respect the Retry-After Header
When a provider sends a Retry-After header, use it. It reflects the server’s real reset window, so sleeping for that exact duration avoids both premature retries (which fail again) and over-long waits (which hurt latency). The code above prefers the header and only falls back to computed backoff when it is absent.
Using tenacity for Cleaner Retry Logic
Hand-rolled loops work, but they clutter business logic. The tenacity library lets you declare retry policy as a decorator, which keeps call sites readable. It is the pattern most production Python teams reach for.
import logging
from openai import OpenAI, RateLimitError, APIConnectionError, InternalServerError
from tenacity import (
retry,
stop_after_attempt,
wait_random_exponential,
retry_if_exception_type,
before_sleep_log,
)
logger = logging.getLogger(__name__)
client = OpenAI()
@retry(
retry=retry_if_exception_type(
(RateLimitError, APIConnectionError, InternalServerError)
),
wait=wait_random_exponential(multiplier=1, max=60),
stop=stop_after_attempt(6),
before_sleep=before_sleep_log(logger, logging.WARNING),
reraise=True,
)
def summarize(text, model="gpt-4o-mini"):
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "Summarize the text in two sentences."},
{"role": "user", "content": text},
],
)
return response.choices[0].message.content
Why this works: wait_random_exponential gives you exponential growth plus jitter in one call, retry_if_exception_type restricts retries to transient failures only, and reraise=True surfaces the original exception after the final attempt instead of wrapping it. Notably, before_sleep_log gives you free observability into every retry without extra code.
The OpenAI and Anthropic SDKs also retry automatically (twice by default). You can tune that with max_retries on the client, but wrapping calls yourself gives finer control over which errors qualify and how you log them. In practice, teams often set the SDK’s built-in retries to zero and manage the policy explicitly.
Client-Side Rate Limiting With a Token Bucket
Retries handle failures after they happen. A smarter move is to avoid the 429 entirely by pacing your own requests below the provider’s ceiling. A token bucket is the standard tool: tokens refill at a fixed rate, each request consumes one, and when the bucket is empty callers wait. If you want the theory behind these algorithms, see our breakdown of token bucket, leaky bucket, and fixed window strategies.
For LLM APIs, you often need two buckets — one for requests and one for tokens — because both limits apply. Here is an async limiter that gates on both:
import asyncio
import time
class DualRateLimiter:
"""Throttle by both requests-per-minute and tokens-per-minute."""
def __init__(self, rpm: int, tpm: int):
self.rpm = rpm
self.tpm = tpm
self._req_tokens = float(rpm)
self._tok_tokens = float(tpm)
self._updated = time.monotonic()
self._lock = asyncio.Lock()
async def acquire(self, estimated_tokens: int):
"""Block until both request and token budgets allow the call."""
while True:
async with self._lock:
self._refill()
if self._req_tokens >= 1 and self._tok_tokens >= estimated_tokens:
self._req_tokens -= 1
self._tok_tokens -= estimated_tokens
return
wait = self._time_until_ready(estimated_tokens)
await asyncio.sleep(wait)
def _refill(self):
now = time.monotonic()
elapsed = now - self._updated
self._req_tokens = min(self.rpm, self._req_tokens + elapsed * self.rpm / 60)
self._tok_tokens = min(self.tpm, self._tok_tokens + elapsed * self.tpm / 60)
self._updated = now
def _time_until_ready(self, estimated_tokens):
req_wait = max(0, (1 - self._req_tokens)) * 60 / self.rpm
tok_wait = max(0, (estimated_tokens - self._tok_tokens)) * 60 / self.tpm
return max(req_wait, tok_wait, 0.05)
Why this works: the limiter refills continuously based on elapsed time rather than resetting on a fixed schedule, so it smooths bursts instead of allowing a full quota every 60 seconds and then stalling. Because it checks the token budget too, a few large prompts cannot silently blow the TPM ceiling. You pass an estimate of the request’s token count, which you can get cheaply before the call — see our guide on token counting and budget management for accurate counts.
Using it looks like this:
limiter = DualRateLimiter(rpm=500, tpm=200_000)
async def guarded_call(messages, estimated_tokens):
await limiter.acquire(estimated_tokens)
return await async_client.chat.completions.create(
model="gpt-4o-mini", messages=messages
)
Set your limiter’s rpm and tpm slightly below the provider’s actual caps, leaving 10–20% headroom for estimation error and clock skew.
Circuit Breakers: Stop Hammering a Dead Service
Backoff assumes the service will recover soon. But when a provider is having a genuine outage, retrying — even politely — wastes time and can slow your own recovery, because every request holds a connection and a worker while it waits. A circuit breaker solves this by tracking failures and, once they cross a threshold, “opening” to reject calls instantly for a cooldown period. This pattern comes straight from microservices resilience; our post on circuit breakers and resilience patterns covers the broader theory.
import time
class CircuitBreaker:
"""Fail fast when an upstream is clearly down."""
def __init__(self, failure_threshold=5, recovery_timeout=30.0):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self._failures = 0
self._opened_at = None
def before_call(self):
if self._opened_at is None:
return
if time.monotonic() - self._opened_at >= self.recovery_timeout:
# Half-open: allow one trial request through.
return
raise RuntimeError("Circuit open: upstream LLM unavailable")
def record_success(self):
self._failures = 0
self._opened_at = None
def record_failure(self):
self._failures += 1
if self._failures >= self.failure_threshold:
self._opened_at = time.monotonic()
Why this works: after five consecutive failures the breaker opens, and every subsequent call fails immediately instead of waiting through six backoff attempts. After the cooldown it enters a half-open state and lets one request test the waters; success closes the circuit, and failure reopens it. Consequently, a downstream outage degrades your app gracefully instead of hanging every request thread.
In production, combine all three layers: the token bucket prevents self-inflicted 429s, backoff handles occasional throttling, and the breaker protects you during real outages. Many teams get these for free by routing through a gateway like LiteLLM, which centralizes retries, fallbacks, and rate limiting across providers.
Handling Streaming Responses
Streaming complicates retries because a failure can happen mid-stream, after you have already yielded partial tokens to the user. You cannot cleanly retry a request whose first half already reached the client. The safe approach is to only retry failures that occur before the first chunk arrives.
def stream_with_retry(messages, model="gpt-4o-mini", max_retries=3):
for attempt in range(max_retries + 1):
first_chunk_seen = False
try:
stream = client.chat.completions.create(
model=model, messages=messages, stream=True
)
for chunk in stream:
first_chunk_seen = True
delta = chunk.choices[0].delta.content
if delta:
yield delta
return
except (RateLimitError, APIConnectionError):
if first_chunk_seen or attempt == max_retries:
raise # Can't safely restart a partially-sent stream
time.sleep(_backoff(attempt, 1.0, 30.0))
Why this works: the first_chunk_seen flag draws a hard line. Before any output reaches the caller, a failure is safe to retry from scratch. Once you have streamed even one token, restarting would produce a garbled, duplicated response, so the code re-raises instead. For a deeper look at delivery mechanics, see our comparison of streaming responses over SSE vs WebSockets.
A Real-World Scenario: The Batch Job That Took Down Chat
Consider a mid-sized SaaS product with a live chat assistant and a nightly job that re-summarizes thousands of documents. Both features hit the same OpenAI account, so they share one RPM/TPM budget. When the batch job kicks off, it fires requests as fast as asyncio allows, saturates the token limit, and the interactive chat feature starts throwing 429s to real users during the overlap window.
The naive fix — wrapping chat calls in aggressive retries — makes it worse, because those retries add even more load to an already-saturated account. The retry storm extends the outage instead of shortening it.
The durable fix has three parts. First, put a shared token bucket in front of both features so total throughput stays under the provider ceiling. Next, give the interactive chat a higher-priority lane so batch work yields to user-facing traffic. Finally, add a circuit breaker so that if the provider genuinely degrades, both features fail fast instead of piling on. Over a few weeks of tuning the RPM/TPM headroom, a small team can turn a recurring nightly incident into a non-event.
When to Use Client-Side Rate Limiting
- Your app has multiple features or workers sharing one provider account and budget
- You run batch or background jobs alongside latency-sensitive user traffic
- You consistently approach your RPM or TPM ceiling during normal operation
- You want predictable spend and need to cap throughput deliberately
When NOT to Use Aggressive Retries
- The error is a
400,401, or403— the request will never succeed as-is - You are streaming and have already sent partial output to the user
- The provider is in a confirmed outage; open a circuit breaker instead of retrying
- Your prompt exceeds the model’s context window; retrying wastes tokens and money
Common Mistakes with LLM Rate Limiting
- Retrying without jitter, which turns many clients into a synchronized thundering herd
- Ignoring the
Retry-Afterheader and guessing the wait time instead - Counting only requests and forgetting the token-per-minute limit that trips large prompts
- Setting an unbounded retry count, so a hard failure hangs the caller indefinitely
- Retrying non-idempotent side effects (like writing the same result twice) without deduplication
- Running retries and the SDK’s built-in retries at the same time, silently doubling attempts
How to Layer These Strategies Together
For a production LLM feature, apply the strategies in this order:
- Estimate tokens before the call and reserve budget from a token bucket
- Pace requests so you stay 10–20% under the provider’s RPM and TPM caps
- Wrap the call in exponential backoff with full jitter for transient errors
- Honor
Retry-Afterwhenever the provider sends it - Trip a circuit breaker after repeated failures to fail fast during outages
- Cache repeated prompts so you never spend a rate-limit slot twice
That last step is easy to overlook. If your traffic includes repeated or near-identical prompts, semantic caching removes those calls from your rate budget entirely, which is often the cheapest capacity win available.
Conclusion
Effective LLM rate limiting is not one technique but a stack: a token bucket to avoid 429s, exponential backoff with jitter to absorb the ones you can’t avoid, and a circuit breaker to stay sane during real outages. Get the layering right and traffic spikes become routine instead of incidents.
Start by adding jittered backoff to your highest-traffic call today, then measure how often you actually hit limits before you build the full token bucket. If you are wiring up an LLM app from scratch, pair this with our guides on building apps with the OpenAI API and token counting and budget management to keep both your reliability and your spend under control.