If your app calls OpenAI, Anthropic, or any hosted model, you will eventually hit a 429 Too Many Requests. It happens during a traffic spike, a batch job, or a noisy retry loop that makes things worse. This guide covers LLM rate limiting and retry strategies that hold up in production: how provider limits actually work, how to back off correctly, and how to stop a single failing dependency from taking down your whole service.

The audience here is intermediate-to-senior engineers shipping LLM features who need more than a naive try/except. You will leave with working Python patterns for exponential backoff, jitter, client-side throttling, and circuit breaking, plus a clear sense of when each one earns its keep.

Why LLM APIs Rate-Limit You in the First Place

LLM providers enforce limits to protect shared GPU capacity and to keep one customer from starving another. Unlike a typical CRUD API that caps requests per second, LLM APIs usually enforce two independent ceilings: requests per minute (RPM) and tokens per minute (TPM). You can blow past either one, and hitting the token limit is often the surprise.

For example, a summarization job that sends 50 small requests might sail under the RPM limit but exceed TPM because each request carries a 4,000-token document. Therefore, you have to reason about both dimensions when you size your throughput.

Most providers expose your current standing through response headers. After each call, OpenAI returns headers such as x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens, and x-ratelimit-reset-tokens. Reading these lets you slow down before you get rejected, which is far cheaper than reacting to a 429 after the fact.

What Counts as a Retryable Error

Not every failure deserves a retry. Retrying a genuinely broken request wastes money and latency, and in some cases it amplifies an outage. The rule of thumb: retry transient, server-side, or throttling errors, and fail fast on client mistakes.

Status	Meaning	Retry?
429	Rate limit or quota exceeded	Yes, with backoff
500	Internal server error	Yes, with backoff
502 / 503	Bad gateway / unavailable	Yes, with backoff
529	Anthropic overloaded	Yes, with backoff
408	Request timeout	Yes, limited
400	Malformed request	No, fix the payload
401 / 403	Auth or permission failure	No, fix credentials
404	Model or endpoint not found	No

The distinction matters because a 400 from a token-limit-exceeded prompt will never succeed on retry. As a result, you should classify errors explicitly rather than blindly retrying everything.

Exponential Backoff With Jitter

The core retry pattern for LLM rate limiting is exponential backoff: wait longer after each failure so the upstream service gets room to recover. A fixed delay is not enough, because if 100 clients all fail at once and all wait exactly two seconds, they retry in a synchronized wave and cause a second pileup. This is the “thundering herd” problem.

Jitter fixes it by randomizing each wait. Instead of everyone retrying at t + 2s, each client picks a random point in the window, spreading the load smoothly.

Here is a self-contained implementation using the official OpenAI SDK, with no external retry library:

import random
import time
import logging
from openai import OpenAI, APIStatusError, APIConnectionError, RateLimitError

logger = logging.getLogger(__name__)
client = OpenAI()

RETRYABLE_STATUS = {408, 429, 500, 502, 503, 529}

def chat_with_backoff(
    messages,
    model="gpt-4o-mini",
    max_retries=6,
    base_delay=1.0,
    max_delay=60.0,
):
    """Call the chat API with exponential backoff and full jitter."""
    for attempt in range(max_retries + 1):
        try:
            return client.chat.completions.create(model=model, messages=messages)
        except RateLimitError as err:
            # Honor the server's own hint when present.
            retry_after = _retry_after_seconds(err)
            delay = retry_after if retry_after else _backoff(attempt, base_delay, max_delay)
        except APIStatusError as err:
            if err.status_code not in RETRYABLE_STATUS:
                raise  # 400/401/403 etc. — not worth retrying
            delay = _backoff(attempt, base_delay, max_delay)
        except APIConnectionError:
            # Network blip; treat as transient.
            delay = _backoff(attempt, base_delay, max_delay)

        if attempt == max_retries:
            logger.error("Exhausted retries after %s attempts", attempt + 1)
            raise
        logger.warning("Retry %s in %.2fs", attempt + 1, delay)
        time.sleep(delay)


def _backoff(attempt, base_delay, max_delay):
    """Full jitter: random point in [0, capped exponential window]."""
    window = min(max_delay, base_delay * (2 ** attempt))
    return random.uniform(0, window)


def _retry_after_seconds(err):
    """Parse the Retry-After header if the provider sent one."""
    headers = getattr(err.response, "headers", {}) or {}
    value = headers.get("retry-after")
    return float(value) if value else None

Why this works: the _backoff helper caps the exponential growth so waits never balloon past a minute, and full jitter (random.uniform(0, window)) prevents synchronized retries. Meanwhile, _retry_after_seconds respects the server when it explicitly tells you how long to wait, which is always more accurate than your guess.

Respect the Retry-After Header

When a provider sends a Retry-After header, use it. It reflects the server’s real reset window, so sleeping for that exact duration avoids both premature retries (which fail again) and over-long waits (which hurt latency). The code above prefers the header and only falls back to computed backoff when it is absent.

Using tenacity for Cleaner Retry Logic

Hand-rolled loops work, but they clutter business logic. The tenacity library lets you declare retry policy as a decorator, which keeps call sites readable. It is the pattern most production Python teams reach for.

import logging
from openai import OpenAI, RateLimitError, APIConnectionError, InternalServerError
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
    retry_if_exception_type,
    before_sleep_log,
)

logger = logging.getLogger(__name__)
client = OpenAI()

@retry(
    retry=retry_if_exception_type(
        (RateLimitError, APIConnectionError, InternalServerError)
    ),
    wait=wait_random_exponential(multiplier=1, max=60),
    stop=stop_after_attempt(6),
    before_sleep=before_sleep_log(logger, logging.WARNING),
    reraise=True,
)
def summarize(text, model="gpt-4o-mini"):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "Summarize the text in two sentences."},
            {"role": "user", "content": text},
        ],
    )
    return response.choices[0].message.content

Why this works: wait_random_exponential gives you exponential growth plus jitter in one call, retry_if_exception_type restricts retries to transient failures only, and reraise=True surfaces the original exception after the final attempt instead of wrapping it. Notably, before_sleep_log gives you free observability into every retry without extra code.

The OpenAI and Anthropic SDKs also retry automatically (twice by default). You can tune that with max_retries on the client, but wrapping calls yourself gives finer control over which errors qualify and how you log them. In practice, teams often set the SDK’s built-in retries to zero and manage the policy explicitly.

Client-Side Rate Limiting With a Token Bucket

Retries handle failures after they happen. A smarter move is to avoid the 429 entirely by pacing your own requests below the provider’s ceiling. A token bucket is the standard tool: tokens refill at a fixed rate, each request consumes one, and when the bucket is empty callers wait. If you want the theory behind these algorithms, see our breakdown of token bucket, leaky bucket, and fixed window strategies.

For LLM APIs, you often need two buckets — one for requests and one for tokens — because both limits apply. Here is an async limiter that gates on both:

import asyncio
import time

class DualRateLimiter:
    """Throttle by both requests-per-minute and tokens-per-minute."""

    def __init__(self, rpm: int, tpm: int):
        self.rpm = rpm
        self.tpm = tpm
        self._req_tokens = float(rpm)
        self._tok_tokens = float(tpm)
        self._updated = time.monotonic()
        self._lock = asyncio.Lock()

    async def acquire(self, estimated_tokens: int):
        """Block until both request and token budgets allow the call."""
        while True:
            async with self._lock:
                self._refill()
                if self._req_tokens >= 1 and self._tok_tokens >= estimated_tokens:
                    self._req_tokens -= 1
                    self._tok_tokens -= estimated_tokens
                    return
                wait = self._time_until_ready(estimated_tokens)
            await asyncio.sleep(wait)

    def _refill(self):
        now = time.monotonic()
        elapsed = now - self._updated
        self._req_tokens = min(self.rpm, self._req_tokens + elapsed * self.rpm / 60)
        self._tok_tokens = min(self.tpm, self._tok_tokens + elapsed * self.tpm / 60)
        self._updated = now

    def _time_until_ready(self, estimated_tokens):
        req_wait = max(0, (1 - self._req_tokens)) * 60 / self.rpm
        tok_wait = max(0, (estimated_tokens - self._tok_tokens)) * 60 / self.tpm
        return max(req_wait, tok_wait, 0.05)

Why this works: the limiter refills continuously based on elapsed time rather than resetting on a fixed schedule, so it smooths bursts instead of allowing a full quota every 60 seconds and then stalling. Because it checks the token budget too, a few large prompts cannot silently blow the TPM ceiling. You pass an estimate of the request’s token count, which you can get cheaply before the call — see our guide on token counting and budget management for accurate counts.

Using it looks like this:

limiter = DualRateLimiter(rpm=500, tpm=200_000)

async def guarded_call(messages, estimated_tokens):
    await limiter.acquire(estimated_tokens)
    return await async_client.chat.completions.create(
        model="gpt-4o-mini", messages=messages
    )

Set your limiter’s rpm and tpm slightly below the provider’s actual caps, leaving 10–20% headroom for estimation error and clock skew.

Circuit Breakers: Stop Hammering a Dead Service

Backoff assumes the service will recover soon. But when a provider is having a genuine outage, retrying — even politely — wastes time and can slow your own recovery, because every request holds a connection and a worker while it waits. A circuit breaker solves this by tracking failures and, once they cross a threshold, “opening” to reject calls instantly for a cooldown period. This pattern comes straight from microservices resilience; our post on circuit breakers and resilience patterns covers the broader theory.

import time

class CircuitBreaker:
    """Fail fast when an upstream is clearly down."""

    def __init__(self, failure_threshold=5, recovery_timeout=30.0):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self._failures = 0
        self._opened_at = None

    def before_call(self):
        if self._opened_at is None:
            return
        if time.monotonic() - self._opened_at >= self.recovery_timeout:
            # Half-open: allow one trial request through.
            return
        raise RuntimeError("Circuit open: upstream LLM unavailable")

    def record_success(self):
        self._failures = 0
        self._opened_at = None

    def record_failure(self):
        self._failures += 1
        if self._failures >= self.failure_threshold:
            self._opened_at = time.monotonic()

Why this works: after five consecutive failures the breaker opens, and every subsequent call fails immediately instead of waiting through six backoff attempts. After the cooldown it enters a half-open state and lets one request test the waters; success closes the circuit, and failure reopens it. Consequently, a downstream outage degrades your app gracefully instead of hanging every request thread.

In production, combine all three layers: the token bucket prevents self-inflicted 429s, backoff handles occasional throttling, and the breaker protects you during real outages. Many teams get these for free by routing through a gateway like LiteLLM, which centralizes retries, fallbacks, and rate limiting across providers.

Handling Streaming Responses

Streaming complicates retries because a failure can happen mid-stream, after you have already yielded partial tokens to the user. You cannot cleanly retry a request whose first half already reached the client. The safe approach is to only retry failures that occur before the first chunk arrives.

def stream_with_retry(messages, model="gpt-4o-mini", max_retries=3):
    for attempt in range(max_retries + 1):
        first_chunk_seen = False
        try:
            stream = client.chat.completions.create(
                model=model, messages=messages, stream=True
            )
            for chunk in stream:
                first_chunk_seen = True
                delta = chunk.choices[0].delta.content
                if delta:
                    yield delta
            return
        except (RateLimitError, APIConnectionError):
            if first_chunk_seen or attempt == max_retries:
                raise  # Can't safely restart a partially-sent stream
            time.sleep(_backoff(attempt, 1.0, 30.0))

Why this works: the first_chunk_seen flag draws a hard line. Before any output reaches the caller, a failure is safe to retry from scratch. Once you have streamed even one token, restarting would produce a garbled, duplicated response, so the code re-raises instead. For a deeper look at delivery mechanics, see our comparison of streaming responses over SSE vs WebSockets.

A Real-World Scenario: The Batch Job That Took Down Chat

Consider a mid-sized SaaS product with a live chat assistant and a nightly job that re-summarizes thousands of documents. Both features hit the same OpenAI account, so they share one RPM/TPM budget. When the batch job kicks off, it fires requests as fast as asyncio allows, saturates the token limit, and the interactive chat feature starts throwing 429s to real users during the overlap window.

The naive fix — wrapping chat calls in aggressive retries — makes it worse, because those retries add even more load to an already-saturated account. The retry storm extends the outage instead of shortening it.

The durable fix has three parts. First, put a shared token bucket in front of both features so total throughput stays under the provider ceiling. Next, give the interactive chat a higher-priority lane so batch work yields to user-facing traffic. Finally, add a circuit breaker so that if the provider genuinely degrades, both features fail fast instead of piling on. Over a few weeks of tuning the RPM/TPM headroom, a small team can turn a recurring nightly incident into a non-event.

When to Use Client-Side Rate Limiting

Your app has multiple features or workers sharing one provider account and budget
You run batch or background jobs alongside latency-sensitive user traffic
You consistently approach your RPM or TPM ceiling during normal operation
You want predictable spend and need to cap throughput deliberately

When NOT to Use Aggressive Retries

The error is a 400, 401, or 403 — the request will never succeed as-is
You are streaming and have already sent partial output to the user
The provider is in a confirmed outage; open a circuit breaker instead of retrying
Your prompt exceeds the model’s context window; retrying wastes tokens and money

Common Mistakes with LLM Rate Limiting

Retrying without jitter, which turns many clients into a synchronized thundering herd
Ignoring the Retry-After header and guessing the wait time instead
Counting only requests and forgetting the token-per-minute limit that trips large prompts
Setting an unbounded retry count, so a hard failure hangs the caller indefinitely
Retrying non-idempotent side effects (like writing the same result twice) without deduplication
Running retries and the SDK’s built-in retries at the same time, silently doubling attempts

How to Layer These Strategies Together

For a production LLM feature, apply the strategies in this order:

Estimate tokens before the call and reserve budget from a token bucket
Pace requests so you stay 10–20% under the provider’s RPM and TPM caps
Wrap the call in exponential backoff with full jitter for transient errors
Honor Retry-After whenever the provider sends it
Trip a circuit breaker after repeated failures to fail fast during outages
Cache repeated prompts so you never spend a rate-limit slot twice

That last step is easy to overlook. If your traffic includes repeated or near-identical prompts, semantic caching removes those calls from your rate budget entirely, which is often the cheapest capacity win available.

Conclusion

Effective LLM rate limiting is not one technique but a stack: a token bucket to avoid 429s, exponential backoff with jitter to absorb the ones you can’t avoid, and a circuit breaker to stay sane during real outages. Get the layering right and traffic spikes become routine instead of incidents.

Start by adding jittered backoff to your highest-traffic call today, then measure how often you actually hit limits before you build the full token bucket. If you are wiring up an LLM app from scratch, pair this with our guides on building apps with the OpenAI API and token counting and budget management to keep both your reliability and your spend under control.

LLM Rate Limiting and Retry Strategies in Production

Why LLM APIs Rate-Limit You in the First Place

What Counts as a Retryable Error

Exponential Backoff With Jitter

Respect the Retry-After Header

Using tenacity for Cleaner Retry Logic

Client-Side Rate Limiting With a Token Bucket

Circuit Breakers: Stop Hammering a Dead Service

Handling Streaming Responses

A Real-World Scenario: The Batch Job That Took Down Chat

When to Use Client-Side Rate Limiting

When NOT to Use Aggressive Retries

Common Mistakes with LLM Rate Limiting

How to Layer These Strategies Together

Conclusion

Leave a Comment Cancel reply

Why LLM APIs Rate-Limit You in the First Place

What Counts as a Retryable Error

Exponential Backoff With Jitter

Respect the Retry-After Header

Using tenacity for Cleaner Retry Logic

Client-Side Rate Limiting With a Token Bucket

Circuit Breakers: Stop Hammering a Dead Service

Handling Streaming Responses

A Real-World Scenario: The Batch Job That Took Down Chat

When to Use Client-Side Rate Limiting

When NOT to Use Aggressive Retries

Common Mistakes with LLM Rate Limiting

How to Layer These Strategies Together

Conclusion

Leave a Comment Cancel reply

Related Articles

Streaming LLM Responses: SSE vs WebSockets

Token Counting and Budget Management for LLM Apps

Semantic Caching for LLMs: Cut Repeat Inference Cost