LLM Rate Limit and Retry Patterns
How to handle provider rate limits, transient failures, and quota exhaustion in production LLM apps with backoff, queues, and graceful degradation.
What you'll learn
- ✓How LLM rate limits actually work
- ✓Why naive retries make outages worse
- ✓How to implement exponential backoff with jitter
- ✓When to queue, when to shed load
- ✓How to degrade gracefully across providers
Prerequisites
- •Basic LLM API usage
- •Familiar with HTTP error codes
What and Why
Every LLM provider rate-limits you. The limits are usually expressed as requests per minute and tokens per minute, with separate quotas for input and output. The first time you hit a limit is rarely your peak; it is usually a buggy loop, a retry storm, or a launch day spike. How your code handles that moment determines whether your users see slow responses or a hard outage.
Resilience here is not exotic. It is the same retry, backoff, queue, and circuit-breaker patterns you would apply to any flaky external dependency, with LLM-specific tweaks for token accounting and idempotency.
Mental Model
Rate limits exist at three layers: the provider’s per-key quota, the provider’s global capacity (which can throttle you even within your quota during incidents), and your own internal budget. A robust client treats all three the same way: detect a limit, back off, queue if possible, and eventually shed load if the queue grows too long.
The cardinal sin is the retry storm. A naive while not success: retry() against a rate-limited endpoint turns a brief 429 into a sustained outage and burns through your quota. Every retry must be backed off, jittered, and capped.
Hands-on Example
Consider a service that takes user requests and forwards them to an LLM. Under normal load it works fine. Then a viral moment hits and traffic triples in five minutes.
request
|
v
+----------------+
| token budget | reject early if quota gone
+--------+-------+
|
v
+----------------+
| primary call |
+--------+-------+
| |
ok 429 / 5xx
| |
v v
[done] +---------------+
| retry w/ |
| exp backoff | base * 2^n + jitter
| + jitter |
+--------+------+
|
still failing
|
v
+----------------+
| fallback model | cheaper, different provider
+--------+-------+
|
still failing
|
v
+----------------+
| queue or | return 503 with retry hint
| graceful fail |
+----------------+
The shape is the same as any resilient HTTP client, with two LLM-specific additions. First, the token budget check up front lets you fail fast when you know you have no quota. Second, the fallback path can route to a different model family entirely, which protects you from single-provider incidents.
Trade-offs
Aggressive retries hide failures from your monitoring. If every error is silently retried, you never see the underlying problem until the queue blows up. Always log the original error, even when you recover.
Queueing improves success rate at the cost of latency. A request that succeeds after thirty seconds in a queue is often worse than a request that fails fast and lets the client retry. Match queue depth to your latency budget.
Cross-provider fallback adds resilience but doubles your prompt-engineering work. The fallback model behaves differently and may need a different prompt. Test the fallback path regularly; the worst time to discover it is broken is during the outage you built it for.
Practical Tips
Use exponential backoff with full jitter, not fixed intervals. Synchronized retries cause thundering herds. A formula like sleep = random(0, base * 2^attempt) spreads retries out and lets the provider recover.
Cap the number of retries. Three to five attempts is plenty for transient errors. Anything more is a queue in disguise; make it an explicit queue.
Honor the Retry-After header when the provider sends one. Providers know their recovery timeline better than your backoff formula does.
Treat 429 (rate limit) and 5xx (server error) as different categories. 429 means slow down; 5xx means try again or fail over. Mixing them up leads to retrying when you should back off.
Make requests idempotent where possible. Pass an idempotency key so a duplicate request from a retry does not duplicate a side effect.
Track retry rate as a first-class metric. A rising retry rate is an early warning of provider trouble or of a runaway client.
Wrap-up
Rate limits are not a bug to work around; they are a contract to design with. Build backoff, jitter, queues, and fallbacks into your LLM client from day one, and treat retry rate as a signal you watch as carefully as latency or error rate. The systems that stay up during the next provider incident are the ones that were already prepared for one.
Related articles
- LLMs LLM Cost Tracking in Production
A practical guide to attributing, monitoring, and controlling LLM spend per user, per feature, and per request without slowing down delivery.
- LLMs LLM Fine-tuning vs Prompting Trade-offs
Decide between prompt engineering, retrieval, and fine-tuning by weighing cost, latency, control, and data requirements honestly.
- LLMs LLM Function Schema Best Practices
How to design tool schemas that LLMs actually call correctly, with naming, description, and parameter patterns that survive real users and adversarial inputs.
- LLMs Grok vs Claude vs GPT: A Practical Comparison
An engineering-focused comparison of Grok, Claude, and GPT model families across reasoning, tool use, context, latency, and real production trade-offs.