REST API Throttling and Rate Limiting

Intermediate 9 min read

What you'll learn

✓The difference between throttling and rate limiting
✓Token bucket vs leaky bucket algorithms
✓Standard rate-limit headers
✓Per-user vs per-IP vs per-key strategies
✓Communicating limits to clients

Prerequisites

•Familiar with HTTP

What and Why

Every public API gets hammered eventually. A buggy client retries in a tight loop, a scraper goes wide, or a free-tier user discovers your most expensive endpoint. Rate limiting is the shield: a policy that caps how many requests a client can send in a window.

Throttling is related but slightly different. Throttling smooths traffic by delaying or queueing requests; rate limiting outright rejects requests over the cap. Most production systems combine both.

The goal is fairness and stability. One noisy client should not degrade everyone else.

Mental Model

The two classic algorithms are token bucket and leaky bucket.

A token bucket holds N tokens. Each request consumes one. Tokens refill at a constant rate. If the bucket is empty, the request is rejected. This allows short bursts (the bucket fills back up between bursts) while bounding average rate.

A leaky bucket queues requests and processes them at a fixed rate. Bursts are smoothed rather than rejected.

Refill: +1 token / sec, capacity = 10

time   bucket   request   result
0s     [##########] req -> [#########] OK
0s     [#########]  req -> [########]  OK
...    10 quick requests -> bucket empty
0s     [          ] req -> REJECTED 429
1s     [#         ] req -> [          ] OK
2s     [#         ] req -> [          ] OK

Token bucket: bursts allowed, average bounded

Token bucket is the default for most REST APIs because it is simple and forgives small bursts.

Hands-on Example

Imagine /v1/messages allows 60 requests per minute per API key. A token bucket with capacity 60 and refill of 1/sec works well.

When the client is within the limit, return the resource plus standard headers:

HTTP/1.1 200 OK
RateLimit-Limit: 60
RateLimit-Remaining: 42
RateLimit-Reset: 18

The IETF draft RateLimit headers are gaining adoption. The older X-RateLimit-* variants are still common; pick one and document it.

When the client exceeds the limit, return 429 Too Many Requests with Retry-After:

HTTP/1.1 429 Too Many Requests
RateLimit-Limit: 60
RateLimit-Remaining: 0
RateLimit-Reset: 12
Retry-After: 12
Content-Type: application/json

{
  "error": {
    "code": "rate_limited",
    "message": "Too many requests. Retry in 12 seconds."
  }
}

A minimal Redis-backed token bucket in pseudocode:

def allow(key, capacity=60, refill_per_sec=1):
    now = time.time()
    state = redis.hgetall(key) or {"tokens": capacity, "ts": now}
    elapsed = now - float(state["ts"])
    tokens = min(capacity, float(state["tokens"]) + elapsed * refill_per_sec)
    if tokens < 1:
        return False, tokens
    tokens -= 1
    redis.hset(key, mapping={"tokens": tokens, "ts": now})
    return True, tokens

Use a Lua script in production to make the read-modify-write atomic.

Common Pitfalls

Limiting by IP only. NAT, corporate proxies, and mobile carriers share IPs. Whole organizations get blocked. Prefer API key or user ID with IP as a secondary signal.

Ignoring write vs read cost. A search query may cost ten times a simple GET. Charge tokens proportional to work, not per request.

Forgetting to limit unauthenticated traffic. Login and signup endpoints are favorite targets. Add stricter limits before authentication.

Silent throttling. Slowing requests without telling the client looks like a server bug. Always communicate via headers or status codes.

Global limits only. A single shared bucket across users means one user can starve another. Almost always layer per-key and global limits together.

Wrong status code. 429 is the correct response. 503 means the whole service is unavailable; reserve it for that.

Practical Tips

Implement multiple tiers: per-key, per-IP, and global. Reject on the first one that trips.
Use sliding windows or token buckets, not fixed windows. Fixed windows allow double the limit at the boundary.
Expose limits in your docs and on the response. Clients that know their budget behave better.
Add jitter to Retry-After suggestions so clients do not all retry at the same instant.
Test your limiter under load. A buggy limiter that calls Redis on every request becomes the bottleneck.
Whitelist health checks and internal services so monitoring does not eat the budget.
Track 429 rates as a SLO. A spike often signals a real client bug worth investigating.

Wrap-up

Rate limiting is one of those features users only notice when it is missing or broken. Choose a token bucket for most APIs, return 429 with Retry-After and RateLimit-* headers, and key your limits by API key plus IP.

Done right, rate limiting is invisible to good citizens and decisive against bad ones. That is the whole job.