System Design: Token Bucket Rate Limiter

Intermediate 10 min read

What you'll learn

✓How the token bucket algorithm works
✓When to choose token bucket vs leaky bucket
✓Implementing rate limiting with Redis
✓Handling distributed counters safely
✓Surface the right errors to clients

Prerequisites

•Familiar with HTTP and databases

What and Why

A rate limiter caps how many requests a client can make in a window. Without one, a single buggy client or attacker can starve your service for everyone else. Even friendly clients sometimes go rogue after a deploy. Rate limiting is your first line of defense for shared resources.

The token bucket algorithm is popular because it allows short bursts while enforcing a sustained rate. AWS, Stripe, and many CDN providers use variants of it.

Mental Model

Imagine a bucket that holds tokens. Tokens drip in at a constant rate, say 10 per second, up to a maximum capacity, say 100. Every request consumes one token. If the bucket is empty, the request is rejected or queued.

This gives you two knobs: a refill rate that defines steady-state throughput and a capacity that defines burst tolerance.

Architecture

For a single process, a counter and a timestamp are enough. For a distributed service, the bucket must live in a shared store so all instances see the same state. Redis is the standard choice because of atomic Lua scripts.

Client -> API Gateway -> Redis (Lua script)
                          |
                          v
                 tokens available? 
                   yes -> 200 OK
                   no  -> 429 Too Many Requests

Distributed token bucket with Redis

A Lua script atomically reads the current token count and last-refill timestamp, computes how many tokens have accrued since the last call, deducts one if available, and writes the new state back. This avoids race conditions between read and write.

Key naming usually includes the client identity and the resource, like rl:user:42:search. TTLs slightly larger than the refill interval keep memory bounded.

Trade-offs

Token bucket allows bursts, which is friendly to real apps but can spike load briefly. Leaky bucket smooths traffic to a constant rate but is harsher on legitimate spikes.

A central Redis becomes a bottleneck and a single point of failure. Sharding by user id distributes load but complicates global limits. Some systems trade a little accuracy for speed by limiting locally per node and accepting that the global cap may be exceeded by a small factor.

Sliding window counters give more precise per-window enforcement but cost more memory because they store multiple buckets per key.

Practical Tips

Return a clear 429 Too Many Requests with Retry-After and X-RateLimit-Remaining headers so clients can self-throttle. Hiding the limit forces clients to guess and retry aggressively, which makes things worse.

Make limits configurable per plan or per endpoint. A search endpoint may tolerate 5 req/s while a write endpoint should be tighter. Store these in a config service so you can adjust without redeploying.

Always test under failure. If Redis is unreachable, decide: fail open (allow traffic) or fail closed (block). Fail open is friendlier but riskier; many teams pick fail closed for write paths and fail open for reads.

Cache the limiter decision briefly on the edge to avoid hammering Redis on every request from the same client.

Wrap-up

Token bucket strikes a nice balance between burst tolerance and steady-state control. A few lines of Redis Lua plus a thoughtful key scheme is enough to protect most services. The interesting work is around fault tolerance, observability, and the client experience when limits are hit.

Design the limiter as a small, dedicated module. You will reuse it across services for years.