The Circuit Breaker Pattern Explained

Intermediate 9 min read

What you'll learn

✓Why dependencies fail in clusters
✓The three states of a circuit breaker
✓How to tune thresholds and timeouts
✓When to combine with retries and bulkheads
✓Common monitoring signals

Prerequisites

•Familiar with HTTP and databases

What and Why

When a downstream service is unhealthy, retrying every call piles work on top of an already failing system. Threads block, queues fill, and the failure spreads upstream. A circuit breaker stops this cascade by failing fast once a dependency looks unhealthy.

The pattern comes from electrical engineering: trip the breaker when current spikes; reset after a cool-down.

Mental Model

A breaker wraps each call to a remote dependency and tracks recent successes and failures. It has three states:

Closed: calls flow through. Failures are counted.
Open: calls fail immediately for a cool-down period.
Half-open: a few trial calls pass through. If they succeed, close the breaker; if they fail, open it again.

Architecture

     successes ok
 +------------------+
 |                  v
[Closed] --failures--> [Open]
 ^                    |
 |                cool-down expires
 |                    |
 +---trial ok--- [Half-Open]
        (else)        |
        back to       v
       Open <------ trial fail

Circuit breaker states

The thresholds are: failure rate or count to open, cool-down duration, and trial-call count for half-open. A typical starting point: open at 50% failures over 20 calls, cool down 30 seconds, allow 3 trial calls.

Libraries like Resilience4j, Polly, and Hystrix implement this for you. Most service meshes (Envoy, Istio) ship a breaker too.

Trade-offs

A trigger-happy breaker hides minor blips and confuses operators. A sleepy one fails to protect during a real incident. Tune with production data, not guesses.

Retries inside a circuit breaker can amplify load when the breaker is closed but the dependency is degraded. Use jittered backoff and limit total attempts.

Per-instance breakers protect each caller but cannot share signal across the fleet. A global breaker (in a central store or service mesh) is more expensive but reacts faster.

Practical Tips

Pair breakers with timeouts. Without a timeout, slow calls accumulate and your service falls over even though the breaker is technically closed. Every remote call needs a hard ceiling.

Add bulkheads: isolate calls to different dependencies in separate thread pools or semaphores. One sick dependency should not exhaust the resources of healthy callers.

Surface breaker state in metrics: open count, trip rate, time spent open. These are leading indicators for upstream incidents.

Test breakers with fault injection. A breaker you have never seen trip is a breaker you cannot trust.

Wrap-up

Circuit breakers are simple state machines that prevent a small fire from becoming a wildfire. Drop one in front of every remote dependency, tune thresholds with real data, and pair with timeouts and bulkheads. Your incidents get shorter, your dashboards get clearer, and your team gets to sleep through more nights.