Kubernetes Readiness vs Liveness Probes: A Practical Guide

Intermediate 9 min read

What you'll learn

✓The difference between readiness and liveness probes
✓When to use HTTP, TCP, or exec probes
✓How to tune timeouts, periods, and thresholds
✓How probes affect service traffic and pod restarts
✓Production patterns that avoid cascading restarts

Prerequisites

•Familiar with YAML and containers

What and Why

Kubernetes provides three probe types to keep your workloads healthy: liveness, readiness, and startup. Liveness probes answer “is this container still alive, or should I kill and restart it?” Readiness probes answer “is this container ready to serve traffic right now?” They look similar in YAML but produce very different behaviors.

Without readiness probes, traffic hits pods before they finish warming caches or opening database connections, causing 5xx errors during rollouts. Without liveness probes, deadlocked processes silently consume capacity. Used together they make rolling updates smooth and self-healing reliable.

Mental Model

Think of readiness as a switch on the Service load balancer. When a readiness probe fails, the kubelet removes that pod’s IP from the endpoints object. Traffic stops, but the container keeps running. When the probe passes again, traffic resumes.

Liveness is a restart trigger. When a liveness probe fails the configured number of times, the kubelet kills the container and lets the restart policy bring it back. Liveness does not touch endpoints; it touches PIDs.

A startup probe is a one-shot gate that disables the other probes until the app finishes a slow boot. Use it for legacy apps that take a minute to start.

Hands-on Example

Here is a Deployment using all three probe types correctly.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: orders-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: orders-api
  template:
    metadata:
      labels:
        app: orders-api
    spec:
      containers:
        - name: app
          image: registry.example.com/orders-api:1.4.2
          ports:
            - containerPort: 8080
          startupProbe:
            httpGet:
              path: /healthz/start
              port: 8080
            failureThreshold: 30
            periodSeconds: 5
          readinessProbe:
            httpGet:
              path: /healthz/ready
              port: 8080
            periodSeconds: 5
            timeoutSeconds: 2
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /healthz/live
              port: 8080
            periodSeconds: 10
            timeoutSeconds: 2
            failureThreshold: 5

The startup probe gives the JVM up to 150 seconds to boot. Once it passes, readiness and liveness take over.


  +----------------+        readiness pass        +-----------+
  |  Service / LB  | <-------------------------- |   Pod A   |
  +----------------+                              +-----------+
          |                                            ^
          | endpoint removed                           | liveness fail
          v                                            v
  +----------------+        readiness fail        +-----------+
  |  Service / LB  |  --x-->                      |  kubelet  |
  +----------------+                              |  restart  |
                                                  +-----------+

How probe results map to traffic and restarts

Common Pitfalls

The biggest mistake is pointing liveness at a deep health check that touches the database. If the database hiccups, every pod fails liveness, every pod restarts, and you create a thundering herd that crashes the database harder. Liveness should only check the process itself.

Another trap is identical readiness and liveness endpoints. If readiness flaps, you remove traffic; that is fine. If liveness flaps, you restart pods in a loop. Keep them separate.

Setting initialDelaySeconds too low causes false failures on cold starts. Prefer startupProbe instead, since it gives you a long warm-up window without making steady-state checks slow.

Production Tips

Use failureThreshold generously for liveness (5 or more) to absorb GC pauses and brief network blips. Keep readiness failureThreshold small (2-3) so you fail fast out of the load balancer during real incidents.

Expose three distinct HTTP endpoints in your application: /healthz/live returns 200 as long as the event loop is running, /healthz/ready checks downstream dependencies your pod truly needs, and /healthz/start returns 200 once initialization is complete.

For batch jobs or workers, use exec probes that touch a heartbeat file the worker updates each iteration. For TCP-only services like databases, a tcpSocket probe is enough.

Finally, watch the kube_pod_container_status_restarts_total metric. Frequent restarts almost always mean a misconfigured liveness probe, not a sick app.

Wrap-up

Readiness gates traffic; liveness gates lifetime. Keep the two endpoints shallow, distinct, and aware of what they actually need to verify. Done right, probes turn rolling updates into a non-event and give your cluster real self-healing.