Designing Rate Limiters: A System Design Deep Dive

Intermediate 11 min read

What you'll learn

✓How token bucket, leaky bucket, and sliding window differ
✓Where to enforce limits in a distributed topology
✓How to handle clock skew and counter contention
✓When to use Redis vs in-process limits
✓How to expose limit state to clients gracefully

Prerequisites

•Familiar with how APIs work
•Basic Redis knowledge

What and Why

Rate limiters protect a system from overload and abuse. They are the first guardrail between hostile traffic and your core services. Without them, a single misbehaving client, a runaway script, or a DDoS attempt can starve every legitimate user. With them, you get predictable tail latency, fair multi-tenant behavior, and a clear contract for API consumers.

A rate limiter answers one question: should this request proceed right now, given some budget? Everything else is implementation detail.

Mental Model

Think of a rate limiter as a budget plus a clock. The budget says how many requests fit in a window. The clock decides when budget is refilled. Different algorithms encode different policies.

Fixed window: count requests per discrete time bucket. Simple but suffers from edge bursts.
Sliding window: count requests in a rolling time range. More accurate, more memory.
Token bucket: tokens drip in at a steady rate, requests consume tokens. Permits bursts up to bucket capacity.
Leaky bucket: requests queue and drain at a fixed rate. Smooths traffic at the cost of latency.

For most APIs, token bucket is the right default. It mirrors how clients actually behave: bursty workloads with steady averages.

Architecture

In a distributed system, the limiter must agree across nodes about the budget. The usual approach is a centralized counter store (Redis) with atomic Lua scripts, or a sidecar limiter like Envoy’s ratelimit service.

Client -> Edge LB -> API Gateway --check--> Redis (token bucket)
                        |                       ^
                        | allow                 | atomic INCR/EVAL
                        v                       |
                     Service Pod ---------------+
                        |
                        v
                     Downstream

Distributed rate limiter at the edge

The gateway evaluates the limit before forwarding. Redis holds one key per (tenant, route) tuple. A Lua script does the read-modify-write atomically so two pods cannot double-spend a token.

For lower latency, push the limit into the gateway process itself with a small in-memory cache and periodic Redis reconciliation. This trades accuracy for speed: a tenant might briefly exceed the global limit during sync windows.

Trade-offs

Every limiter design picks a point in a triangle of accuracy, latency, and cost.

Centralized Redis: high accuracy, one network hop per request, single point of contention. Hot keys become bottlenecks once a tenant exceeds a few thousand RPS on a single shard.
Local counters with gossip: low latency, eventually consistent, can overshoot during a burst. Good for soft limits where occasional overshoot is acceptable.
Sharded Redis with consistent hashing: spreads hot keys across nodes. Adds complexity in failure modes when a shard goes down.
Probabilistic (e.g., approximate sliding window): trades exactness for memory. Useful at extreme scale where you cannot afford a counter per tenant per minute.

Clock skew matters more than people expect. If two pods disagree on the current second, fixed window limiters produce phantom overflows. Token bucket avoids this because it measures elapsed time, not absolute time.

Another quiet trap is retry storms. A 429 with no Retry-After header invites every client to retry simultaneously, recreating the spike. Always include Retry-After and document jittered backoff in your client SDK.

Practical Tips

Limit by identity, not IP. IPs are NAT-shared, especially on mobile. Limit per API key, user ID, or tenant.
Layer your limits. Global (per service), tenant (fair share), and per-endpoint (protect expensive routes). Reject at the cheapest layer first.
Use sorted sets in Redis for sliding window. ZADD timestamp, ZREMRANGEBYSCORE old entries, ZCARD for count. One Lua script, atomic.
Emit limit headers. X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset. Clients can self-throttle and you reduce 429 volume.
Separate read and write budgets. Writes are usually scarcer; treat them as a different bucket.
Plan for the limiter being down. Fail open for low-risk endpoints, fail closed for write paths. Make that policy explicit.
Test with adversarial workloads. Replay traffic with realistic burst patterns; don’t rely on synthetic uniform load.

A worked token-bucket Lua snippet:

-- KEYS[1]=bucket key, ARGV[1]=now_ms, ARGV[2]=rate, ARGV[3]=capacity
local data = redis.call("HMGET", KEYS[1], "tokens", "ts")
local tokens = tonumber(data[1]) or tonumber(ARGV[3])
local ts = tonumber(data[2]) or tonumber(ARGV[1])
local elapsed = math.max(0, tonumber(ARGV[1]) - ts)
tokens = math.min(tonumber(ARGV[3]), tokens + elapsed * tonumber(ARGV[2]) / 1000)
if tokens < 1 then return 0 end
tokens = tokens - 1
redis.call("HMSET", KEYS[1], "tokens", tokens, "ts", ARGV[1])
redis.call("PEXPIRE", KEYS[1], 60000)
return 1

Wrap-up

A rate limiter looks trivial on a whiteboard and turns into a distributed systems problem in production. Pick token bucket as your default, enforce at the gateway, store state in Redis with atomic scripts, and design the failure mode before the happy path. Most outages I have seen from rate limiters were not from the algorithm being wrong; they were from the limiter itself becoming the bottleneck. Keep the hot path cheap, the failure mode loud, and the client contract honest.