System Design & Architecture

System Design Interview: Designing a Rate Limiter

Every production API needs a rate limiter. Without one, a single misbehaving client — whether malicious or simply buggy — can overwhelm your servers and degrade the experience for every other user. Designing a rate limiter is also one of the most practical system design interview questions because it touches distributed systems, algorithm trade-offs, data store selection, and failure mode reasoning in a compact, well-scoped problem.

The challenge when designing a rate limiter goes beyond picking an algorithm. You need to decide where in the request path to enforce limits, how to share rate limit state across multiple servers, what happens when the rate limiter itself fails, and how to communicate limits clearly to API consumers. This walkthrough covers each of these decisions with production-grade implementations and the trade-offs that inform them.

If you want a refresher on rate limiting fundamentals before diving into system design, that foundation helps contextualize the architectural decisions ahead.

Clarifying Requirements

Rate limiters vary enormously in scope. A rate limiter for a public API like GitHub’s differs from one protecting an internal microservice. Establish the scope before designing.

Functional Requirements

  • Limit requests per client: Enforce a maximum number of requests within a time window (e.g., 100 requests per minute per API key)
  • Multiple limit tiers: Support different limits for different user plans (free, pro, enterprise)
  • Accurate counting: The limiter should not allow significantly more requests than the configured limit
  • Informative responses: Rejected requests receive clear headers indicating the limit, remaining quota, and reset time

Non-Functional Requirements

  • Low latency: The rate limit check must add minimal overhead to each request — ideally under 1ms
  • High availability: If the rate limiter goes down, the system should degrade gracefully rather than blocking all traffic
  • Distributed: The limiter must work correctly across multiple API servers sharing the same rate limit state
  • Consistency: Two servers checking the same client’s limit simultaneously should not both allow requests that together exceed the limit

Scale Estimation

Assume a public API platform with 10,000 active API keys, each limited to 1,000 requests per minute. At peak, roughly 20% of keys approach their limits simultaneously.

Peak rate limit checks: 10,000 keys × 1,000 req/min / 60 seconds ≈ ~170,000 checks per second

Storage per key: Each rate limit entry requires a key identifier, counter, and timestamp — roughly 100 bytes. Total active state: 10,000 × 100 bytes = ~1 MB. This is trivially small, which tells you that the bottleneck is throughput (operations per second), not storage capacity.

Where to Place the Rate Limiter

Before choosing an algorithm, decide where in the request path the rate limiter sits. This architectural decision affects latency, accuracy, and operational complexity.

Option 1: API Gateway Layer

Place the rate limiter in the API gateway, before requests reach your application servers. This is the most common pattern for public APIs.

Client → API Gateway (rate limit check) → Application Server → Database

Advantages: Rejected requests never reach your application servers, saving compute resources. The gateway is a natural enforcement point because all traffic flows through it. Additionally, managed gateways like AWS API Gateway and Kong include built-in rate limiting.

Disadvantages: Gateway-level rate limiting typically supports simple rules (requests per key per time window) but struggles with complex rules (different limits per endpoint, tiered limits based on response cost).

For teams already using an API gateway pattern, adding rate limiting at this layer is the path of least resistance.

Option 2: Application Middleware

Implement rate limiting as middleware within your application, after authentication but before the request handler.

Client → Load Balancer → Application Server (auth → rate limit → handler) → Database

Advantages: Full access to request context — you can apply different limits per endpoint, per user tier, or per operation cost. The middleware runs in the same process as your application, which simplifies deployment.

Disadvantages: Requests still consume load balancer and application server resources before rejection. Additionally, every application server needs to query the shared rate limit state, adding a network round-trip per request.

Option 3: Dedicated Rate Limiter Service

Run the rate limiter as a separate service that your application queries before processing each request.

Advantages: Centralized rate limit logic, clean separation of concerns, reusable across multiple APIs.

Disadvantages: Adds another network hop and another service to operate. This complexity is justified for large platforms with multiple APIs sharing rate limit rules, but overkill for most applications.

For most teams, application middleware backed by Redis provides the best balance. You get full request context, straightforward deployment, and shared state across servers. Move to gateway-level rate limiting when you need to protect against DDoS or want to reject traffic before it reaches your application servers.

Rate Limiting Algorithms

The algorithm determines how requests are counted and when they are allowed or rejected. Each algorithm makes different trade-offs between accuracy, memory usage, and burst handling. For a detailed comparison of token bucket, leaky bucket, and fixed window strategies, the algorithm breakdown covers the mathematical properties of each. Here, the focus is on implementing them in a distributed system.

Fixed Window Counter

Divide time into fixed windows (e.g., one-minute intervals) and count requests per window. When the count exceeds the limit, reject requests until the next window starts.

import redis
import time

redis_client = redis.Redis(host='limiter.internal', port=6379, decode_responses=True)

def is_allowed_fixed_window(client_id: str, limit: int, window_seconds: int) -> bool:
    current_window = int(time.time() // window_seconds)
    key = f"rl:{client_id}:{current_window}"

    # Atomic increment and expiry
    pipe = redis_client.pipeline()
    pipe.incr(key)
    pipe.expire(key, window_seconds + 1)  # Extra second to prevent race on window boundary
    results = pipe.execute()

    current_count = results[0]
    return current_count <= limit

Why this is the simplest option: One Redis key per client per window, one atomic increment per request. Memory usage is minimal, and expired windows clean up automatically.

The boundary problem: If a client sends 100 requests at 11:00:59 and another 100 at 11:01:01, they effectively send 200 requests in 2 seconds despite a 100-per-minute limit. The window boundary creates a burst vulnerability.

Sliding Window Log

Track the timestamp of every request within the window. When a new request arrives, remove timestamps older than the window and count the remaining ones.

def is_allowed_sliding_log(client_id: str, limit: int, window_seconds: int) -> bool:
    now = time.time()
    window_start = now - window_seconds
    key = f"rl:log:{client_id}"

    pipe = redis_client.pipeline()
    # Remove entries outside the window
    pipe.zremrangebyscore(key, 0, window_start)
    # Add the current request
    pipe.zadd(key, {str(now): now})
    # Count entries in the window
    pipe.zcard(key)
    # Set expiry on the whole key
    pipe.expire(key, window_seconds + 1)
    results = pipe.execute()

    current_count = results[2]
    return current_count <= limit

Why this is more accurate: No boundary problem — the window slides continuously with each request. A client limited to 100 requests per minute can never exceed 100 requests in any rolling 60-second period.

The cost: Memory usage grows linearly with the number of requests. A client making 1,000 requests per minute stores 1,000 timestamps in a Redis sorted set. At scale, this adds up — 10,000 active clients × 1,000 entries × ~50 bytes per entry = roughly 500 MB. Consequently, sliding window logs work well for moderate rate limits but become expensive for high-volume endpoints.

Sliding Window Counter

A hybrid approach that approximates the sliding window without storing individual timestamps. Maintain counters for the current and previous windows, then calculate a weighted count based on how far through the current window you are.

def is_allowed_sliding_counter(client_id: str, limit: int, window_seconds: int) -> bool:
    now = time.time()
    current_window = int(now // window_seconds)
    previous_window = current_window - 1
    elapsed_ratio = (now % window_seconds) / window_seconds

    current_key = f"rl:{client_id}:{current_window}"
    previous_key = f"rl:{client_id}:{previous_window}"

    pipe = redis_client.pipeline()
    pipe.get(current_key)
    pipe.get(previous_key)
    results = pipe.execute()

    current_count = int(results[0] or 0)
    previous_count = int(results[1] or 0)

    # Weighted estimate: previous window's contribution decreases as time progresses
    estimated_count = previous_count * (1 - elapsed_ratio) + current_count

    if estimated_count >= limit:
        return False

    # Increment current window
    pipe = redis_client.pipeline()
    pipe.incr(current_key)
    pipe.expire(current_key, window_seconds * 2)
    pipe.execute()

    return True

Why this is the practical choice: Memory usage matches fixed window (two counters per client), while accuracy approaches sliding window. The estimation introduces a small error margin, but in practice the difference is negligible. Most production rate limiters use this approach because it provides the best trade-off between accuracy and resource usage.

Token Bucket

The token bucket algorithm works differently from window-based approaches. Each client has a bucket that fills with tokens at a steady rate. Every request consumes one token. If the bucket is empty, the request is rejected.

def is_allowed_token_bucket(
    client_id: str,
    bucket_capacity: int,
    refill_rate: float,  # tokens per second
) -> bool:
    key = f"rl:tb:{client_id}"
    now = time.time()

    # Atomic Lua script to avoid race conditions
    lua_script = """
    local key = KEYS[1]
    local capacity = tonumber(ARGV[1])
    local refill_rate = tonumber(ARGV[2])
    local now = tonumber(ARGV[3])

    local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
    local tokens = tonumber(bucket[1]) or capacity
    local last_refill = tonumber(bucket[2]) or now

    -- Refill tokens based on elapsed time
    local elapsed = now - last_refill
    tokens = math.min(capacity, tokens + elapsed * refill_rate)

    if tokens >= 1 then
        tokens = tokens - 1
        redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
        redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)
        return 1
    else
        redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
        redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)
        return 0
    end
    """

    result = redis_client.eval(lua_script, 1, key, bucket_capacity, refill_rate, now)
    return result == 1

Why token bucket excels at burst handling: A client with a bucket capacity of 100 and a refill rate of 10 per second can burst 100 requests instantly, then sustain 10 per second after that. This matches how real API clients behave — they often fetch a batch of data at startup, then make steady requests afterward. Window-based algorithms treat that initial burst as abuse, while token bucket accommodates it naturally.

Which Algorithm to Choose

AlgorithmMemoryAccuracyBurst HandlingComplexity
Fixed WindowVery LowHas boundary spikesPoorLowest
Sliding Window LogHighExactGoodMedium
Sliding Window CounterLowNear-exactGoodMedium
Token BucketLowExact for sustained rateExcellentHigher

For most public APIs, sliding window counter provides the best overall balance. For APIs where burst tolerance matters (mobile apps, batch processing clients), token bucket is the stronger choice.

Handling Race Conditions

In a distributed system, two requests from the same client can arrive at two different servers simultaneously. Both servers check the rate limit, both see “under limit,” and both allow the request — exceeding the actual limit.

The Read-Check-Increment Race

Server A: READ counter → 99 (under limit of 100)
Server B: READ counter → 99 (under limit of 100)
Server A: INCREMENT → 100 ✓
Server B: INCREMENT → 101 ✗ (should have been rejected)

Solution: Atomic Operations

Redis provides atomic commands that eliminate this race. The INCR command reads and increments in a single atomic operation, so two concurrent increments always produce sequential values.

For more complex algorithms (like token bucket), use Redis Lua scripts. Lua scripts execute atomically on the Redis server — no other command can interleave during execution. The token bucket implementation above uses a Lua script specifically to avoid this race condition.

Solution: Redis Cluster Considerations

If you use Redis Cluster for high availability, be aware that keys for the same client might hash to different shards. Use hash tags to ensure all rate limit keys for a given client route to the same shard:

# Without hash tag — keys might land on different shards
key_a = f"rl:{client_id}:current"
key_b = f"rl:{client_id}:previous"

# With hash tag — both keys guaranteed to be on the same shard
key_a = f"rl:{{{client_id}}}:current"
key_b = f"rl:{{{client_id}}}:previous"

The curly braces tell Redis Cluster to hash only the content between them, ensuring both keys map to the same slot.

Rate Limit Response Headers

Clear communication with API consumers is as important as the rate limiting logic itself. Standard headers tell clients exactly where they stand.

from fastapi import FastAPI, Request, Response
from fastapi.responses import JSONResponse

app = FastAPI()

@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
    client_id = request.headers.get("X-API-Key", request.client.host)
    limit = get_client_limit(client_id)
    window = 60  # seconds

    allowed, current_count, reset_time = check_rate_limit(client_id, limit, window)

    if not allowed:
        response = JSONResponse(
            status_code=429,
            content={"error": "Rate limit exceeded", "retry_after": reset_time}
        )
    else:
        response = await call_next(request)

    # Always include rate limit headers, even on successful requests
    response.headers["X-RateLimit-Limit"] = str(limit)
    response.headers["X-RateLimit-Remaining"] = str(max(0, limit - current_count))
    response.headers["X-RateLimit-Reset"] = str(reset_time)
    response.headers["Retry-After"] = str(reset_time) if not allowed else ""

    return response

Including X-RateLimit-Remaining on every response — not just rejected ones — lets well-behaved clients throttle themselves before hitting the limit. This proactive communication reduces the total number of rejected requests and improves the developer experience for API consumers.

Failure Modes and Graceful Degradation

What happens when Redis (your rate limit store) goes down? This question separates production-ready rate limiters from toy implementations.

Fail-Open vs Fail-Closed

Fail-open: If the rate limiter cannot check the limit (Redis is unreachable), allow the request through. This prioritizes availability — your API stays functional even when the rate limiter is down.

Fail-closed: If the rate limiter cannot check, reject the request. This prioritizes safety — no one can exceed their limits, even during outages.

def check_rate_limit_with_fallback(client_id: str, limit: int, window: int) -> bool:
    try:
        return is_allowed_sliding_counter(client_id, limit, window)
    except redis.ConnectionError:
        # Fail-open: allow the request if Redis is unreachable
        log.warning(f"Rate limiter unavailable, allowing request for {client_id}")
        return True

Most public APIs fail-open because a brief period without rate limiting causes less damage than a complete API outage. However, for financial APIs or authentication endpoints where abuse has severe consequences, fail-closed is the safer choice. Combined with circuit breaker patterns, you can detect persistent Redis failures quickly and switch to a local in-memory fallback rather than choosing between fully open and fully closed.

Local Fallback

When Redis is unavailable, fall back to per-server in-memory rate limiting. The limits become less accurate (each server tracks independently), but you maintain some protection against abuse.

from collections import defaultdict
import threading

local_counters = defaultdict(lambda: {"count": 0, "window": 0})
local_lock = threading.Lock()

def local_rate_limit(client_id: str, limit: int, window_seconds: int) -> bool:
    current_window = int(time.time() // window_seconds)

    with local_lock:
        entry = local_counters[client_id]
        if entry["window"] != current_window:
            entry["count"] = 0
            entry["window"] = current_window

        entry["count"] += 1
        # Use a higher per-server limit since this is approximate
        per_server_limit = limit // expected_server_count
        return entry["count"] <= per_server_limit

Real-World Scenario: Rate Limiting a Multi-Tier API

A developer tools company offers a public API with three pricing tiers: free (60 requests per minute), pro (600 per minute), and enterprise (6,000 per minute). The API runs across eight application servers behind an AWS Application Load Balancer, with a Redis Cluster for rate limit state.

Initially, the team implements a fixed window counter. This works for months until a free-tier user discovers they can double their effective rate by timing requests at window boundaries. The user’s monitoring dashboard sends bursts at the end and start of each minute, getting roughly 120 requests per minute despite a 60-request limit.

The team switches to a sliding window counter, which closes the boundary exploit. They also add per-endpoint rate limits — the search endpoint, which is computationally expensive, gets a separate lower limit of 10 requests per minute for free-tier users.

Six months later, an enterprise client reports sporadic 429 errors despite being well under their 6,000-per-minute limit. Investigation reveals that the client’s requests hash to two different Redis Cluster shards because the rate limit keys lack hash tags. Two shards each see half the requests and allow them, but the combined total occasionally exceeds the limit when both servers increment simultaneously. Adding hash tags ({client_id} in the key pattern) routes all of a client’s keys to the same shard, resolving the issue.

The lesson from this progression is that rate limiter bugs tend to be subtle. They manifest as occasional inconsistencies rather than obvious failures, and they often go undetected until a client actively monitors their usage.

When to Use a Custom Rate Limiter

  • You need rate limiting logic that is more complex than what managed services offer (per-endpoint limits, cost-based limits, adaptive limits)
  • Your application already uses Redis, and adding rate limiting requires minimal new infrastructure
  • You want full visibility into rate limit state for debugging and monitoring
  • System design interviews where designing a rate limiter is the explicit question

When NOT to Build Your Own

  • Your API gateway already includes rate limiting that meets your requirements — most managed gateways (AWS API Gateway, Kong, Nginx) include configurable rate limiting out of the box
  • You need DDoS protection, which operates at a different layer (network/transport) than application-level rate limiting
  • Your rate limiting needs are straightforward (single limit per API key) and a library like express-rate-limit or fastapi-limiter handles it in a few lines of code

Common Mistakes When Designing a Rate Limiter

  • Using the fixed window algorithm without understanding the boundary burst problem, then being surprised when clients exceed their limits by nearly 2x at window transitions
  • Not using atomic operations for the read-check-increment cycle, allowing race conditions that let concurrent requests slip through the limit
  • Forgetting to set TTLs on rate limit keys in Redis, causing memory to grow indefinitely as stale keys for inactive clients accumulate
  • Implementing fail-closed without a fallback, causing a Redis outage to take down the entire API rather than just the rate limiting
  • Returning 429 errors without Retry-After or X-RateLimit-Reset headers, forcing clients to guess when they can retry and leading to unnecessary retry storms
  • Applying the same rate limit to all endpoints when some endpoints cost significantly more to serve than others
  • Not monitoring the rate limiter itself — if you cannot see how often clients hit their limits, you cannot tell whether your limits are too restrictive or too generous

Completing the Rate Limiter Design

Designing a rate limiter combines algorithm selection, distributed state management, and failure mode reasoning into a focused system design problem. The sliding window counter provides the best starting point for most applications — it is memory-efficient, accurate, and avoids the boundary exploit of fixed windows. Redis serves as the shared state store because its atomic operations prevent race conditions and its sub-millisecond latency keeps the overhead negligible.

The production details matter as much as the algorithm: fail-open with a local fallback keeps your API available during Redis outages, hash tags prevent split-brain counting in Redis Cluster, and clear rate limit headers make your API a pleasure to integrate with. When designing a rate limiter in an interview, demonstrating awareness of these operational concerns is what distinguishes a strong answer from an average one.

Leave a Comment