System Design: Design a Load Balancer

Intermediate 11 min read

What you'll learn

✓Distinguish L4 and L7 load balancing
✓Pick a balancing algorithm for your workload
✓Run health checks without false positives
✓Terminate TLS and handle session affinity
✓Scale the load balancer itself

Prerequisites

•Familiarity with TCP and HTTP.
•Comfort with DNS basics. See [What is AWS](/blog/what-is-aws).

A load balancer takes one virtual IP and turns it into N healthy backends. Easy at small scale, surprisingly subtle at large scale: TLS termination, connection draining, hot keys, sticky sessions, and the load balancer’s own scaling all become real problems.

Functional Requirements

Distribute incoming traffic across N backend servers.
Health check backends and remove unhealthy ones from rotation.
Support TLS termination.
Support HTTP routing rules (path, host, header).
Support sticky sessions when needed.

Non-Functional Requirements

Throughput: 10M packets per second per node, 1M HTTP requests per second per cluster.
Latency overhead: under 1 ms for L4, under 5 ms for L7.
Availability: 99.999 percent — the LB is in the critical path of every request.
New connection rate: hundreds of thousands per second.

High-Level Architecture

Anycast or DNS round-robin spreads traffic across LB clusters in multiple regions.
ECMP (equal-cost multipath) at the network layer spreads packets across LB nodes within a cluster.
LB nodes pick a backend per connection (L4) or per request (L7), then forward.
A control plane pushes config: backend lists, health, routing rules.
A health checker probes backends and updates the data plane.

Layer 4 vs Layer 7

L4 (TCP/UDP). The LB sees connections, not requests. Forwards bytes based on the 5-tuple. Cheap, fast, indifferent to the protocol on top.

L7 (HTTP/gRPC). The LB parses requests and routes by path, host, header, or cookie. More expensive — terminates TCP, often terminates TLS, re-establishes a connection to the backend. Lets you do path-based routing, header rewrites, retries, rate limiting, and auth.

In practice you stack them: L4 in front, L7 behind. The L4 layer absorbs raw traffic and DDoS; the L7 layer does smart routing.

Algorithms

Round robin. Simple, ignores load. Fine when backends are uniform.
Least connections. Send to the backend with the fewest open connections. Good for long-lived connections.
Weighted round robin. Account for unequal backend capacity.
Power of two choices. Pick two backends at random, send to the less-loaded. Surprisingly close to optimal, dirt cheap.
Consistent hashing. Hash a request attribute (user ID, cache key) to a backend. Stable across small membership changes — critical for caches and stateful backends.

Default to least-connections for HTTP and consistent hashing for stateful services.

Health Checks

Active checks: LB periodically probes a backend’s /healthz. Mark unhealthy after K failures, healthy again after M successes. Hysteresis prevents flapping.

Passive checks: track real request outcomes (5xx rate, timeouts). Faster signal than active probes.

Pitfall: a slow /healthz that does too much work (touches the database, calls dependencies) causes mass-marking of healthy backends when a downstream blips. Keep healthchecks shallow; use a separate readiness probe for deeper checks.

TLS Termination

LB owns the certificate, decrypts the request, forwards plaintext to the backend on a private network. Saves backends from cert management and crypto cost. Backends still validate the LB identity (mTLS to the LB) for sensitive workloads.

For very high throughput, use hardware crypto offload or TLS session resumption to avoid full handshakes.

Session Affinity

Some backends are stateful (in-memory session, sticky cache warming). Affinity options:

Cookie-based. LB inserts a cookie naming the backend. Honest and resilient.
Source IP hash. Maps client IP to backend. Breaks under NATs and mobile networks.
Consistent hashing. Hash on user ID from a header.

Cookie is usually right for web. Consistent hashing is right for service-to-service.

Scaling the Load Balancer

A single LB node is a single point of failure. Run a fleet behind anycast or ECMP.

Connection draining. When removing a backend, stop sending new connections, let existing ones finish, then close. Same for the LB itself during deploy.

State sharing. Connection tables can be huge. Two designs:

Stateful pair: active-passive with state replication.
Stateless: each LB node makes routing decisions from a shared config. State is recreated on connection setup.

Modern designs (Google Maglev, AWS Hyperplane) are stateless with consistent hashing so any node can handle any packet.

DDoS. Push SYN cookies and rate limiting to the L4 layer. Block at the edge before WAF and L7 logic run.

Multi-region. DNS-level geo routing sends users to the nearest cluster. Failover by lowering TTLs (60 seconds is reasonable) and shifting weighted records.

For deployment, the LB is typically a managed service like ALB/NLB, an Envoy fleet, or HAProxy on dedicated nodes. On Kubernetes, ingress controllers play this role — see What is Kubernetes. For broader context on REST-based health endpoints, see REST API Design Best Practices.

What to Say in an Interview

Open by separating L4 and L7. Most candidates blur them.
Pick least-connections or power-of-two-choices, not bare round robin, and say why.
Explain healthcheck hysteresis. Saying “shallow healthcheck, separate deep readiness” signals operational experience.
Cover connection draining on backend removal and LB redeploy.
Mention anycast or ECMP for scaling the LB itself. Otherwise the LB is the bottleneck you just designed away from the backends.

Wrap up

A load balancer is a routing decision plus a healthcheck plus a way to scale itself. Stack L4 and L7, default to least-connections, use consistent hashing for state, terminate TLS at the edge, and drain connections on shutdown. The interview-worthy detail is always healthchecks and self-scaling — get those right and the rest is mechanical.