System Design: Design a Rate Limiter

Intermediate 10 min read

What you'll learn

✓Compare token bucket, leaky bucket, and sliding window
✓Implement a distributed limiter with Redis
✓Choose between centralized and per-node counters
✓Decide where the limiter lives in your stack
✓Handle clock skew and bursts cleanly

Prerequisites

•Familiarity with HTTP APIs. See [REST API Design Best Practices](/blog/rest-api-design-best-practices).
•Basic understanding of caching and Redis.

Every public API has a rate limiter, even if it is a sloppy one. Stripe limits per key, GitHub limits per token, internal services limit to avoid noisy-neighbor problems. The job is to count events per identity within a window and reject the excess fast, ideally before any business logic runs.

Functional Requirements

Limit requests per identity (user, IP, API key) per time window.
Support multiple rules: per-second, per-minute, per-day.
Return HTTP 429 with Retry-After and X-RateLimit-* headers.
Allow per-route configuration.
Support burst allowance.

Non-Functional Requirements

Latency overhead: under 5 ms p99. The limiter is in the hot path of every request.
Throughput: 100k QPS aggregate across a fleet.
Accuracy: small over-allowance is acceptable; under-allowance (rejecting legitimate traffic) is not.
Availability: if the limiter fails, fail open or fail closed depending on the route. Public reads usually fail open, payment writes fail closed.

High-Level Architecture

Limiter middleware sits in the API gateway or service mesh sidecar.
Each request extracts an identity key, looks up the rule, and asks a counter store whether to allow or deny.
Counter store is Redis (or a Redis-compatible service) sharded by identity key.
Rules live in a config service with a local cache to avoid a round trip per request.

The limiter must be a separate concern from the business code. Mixing them invites bugs and makes load tests lie.

Data Model

The counter is the entire data model. For a fixed window:

key:   rl:{rule_id}:{identity}:{window_start}
value: integer count
ttl:   window size + small slack

For token bucket:

key:   rl:{rule_id}:{identity}
value: { tokens: float, last_refill_ts: int }
ttl:   long enough to avoid cold starts

Rule config:

{
  "rule_id": "public_read",
  "limit": 100,
  "window_seconds": 60,
  "algorithm": "token_bucket",
  "burst": 20
}

Key APIs

The user-facing API is whatever your service exposes. Internally:

POST /limiter/check
  body: rule_id, identity
  returns: allowed (bool), remaining, reset_at

Headers on the outer response:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 42
X-RateLimit-Reset: 1718000000
Retry-After: 17

Algorithms

Fixed window. Increment a counter keyed by identity:floor(ts/window). Cheap and simple but allows 2x burst at window boundaries.

Sliding window log. Store a sorted set of timestamps and trim expired entries on each request. Accurate but memory-heavy.

Sliding window counter. Keep the current window count and the previous window count, then weight the previous by how much of the sliding window it still covers. Good accuracy, low memory, the default I reach for.

Token bucket. Tokens refill at a steady rate. Each request consumes one token. Burst is the bucket size. Great for APIs that want smooth average rates with occasional spikes.

Leaky bucket. A queue drained at a constant rate. Smooths bursts into a steady stream. Useful when downstream cannot tolerate spikes at all.

Scaling and Tradeoffs

Centralized vs distributed counters. A single Redis cluster is simple and exact. The cost is a network hop on every request. For 5 ms budgets, this is fine if Redis is in the same availability zone.

Per-node approximate counters. Each app server tracks its own counts and syncs to a central store every N requests or every T milliseconds. Much faster but loses accuracy under load. Good for soft limits.

Sharding. Hash identity into N Redis shards. Hot identities (a single huge customer) can pin one shard — handle this by splitting that customer across sub-keys.

Atomicity. Use Redis INCR with EXPIRE carefully — the two commands are not atomic. A Lua script or SET with EX and NX is the right pattern. Token bucket needs a Lua script to read, refill, and decrement in one round trip.

Clock skew. Never trust client time. Use the server’s monotonic clock or Redis TIME. For multi-region, accept that two regions may slightly disagree and pick one as authoritative for global limits.

Where to enforce. API gateway is cheapest. Service mesh sidecar gives per-service granularity. Inside the application is the most flexible but also the slowest to deploy changes.

For background on why request middleware order matters and how containers package these limiters, see What is Docker and What is Kubernetes.

Fail mode. If Redis is down, public reads should fail open and log loudly. Anything money-related fails closed.

What to Say in an Interview

Name the algorithm before drawing anything. Most candidates say “rate limiter” without picking one.
Specify where the limiter runs: gateway, sidecar, or in-process. The interviewer wants to hear you have thought about latency.
Mention atomicity. Saying “INCR and EXPIRE are not atomic, so I use a Lua script” is the kind of detail that signals experience.
Cover the fail-open vs fail-closed choice explicitly per route.
Talk through the hot-identity problem and how you would split a whale customer across sub-keys.

Wrap up

A rate limiter is small, hot, and unforgiving. Pick sliding-window-counter or token-bucket as your default, put it in Redis with a Lua script, and run it at the gateway. The interesting questions are about atomicity, hot keys, and what to do when the limiter itself is down — the rest is bookkeeping.