Docker Healthchecks and Restart Policies Explained

Intermediate 9 min read

What you'll learn

✓How Docker decides a container is healthy
✓The four restart policies and when to pick each
✓Writing a HEALTHCHECK in a Dockerfile
✓Coordinating start-period with slow boots
✓Production tips for orchestrators and logs

Prerequisites

•Familiar with shell and YAML

A running container is not the same as a working container. A Node process can be alive while it cannot reach its database. A Python worker can be stuck in a retry loop. Docker has two primitives to handle this: healthchecks describe how to ask the container if it is okay, and restart policies describe what Docker does when something goes wrong.

What and Why

A healthcheck is a command Docker runs periodically inside the container. If it exits 0, the container is healthy. If it exits non-zero enough times in a row, the container is marked unhealthy. Healthchecks do not by themselves restart anything. They expose a status that humans, orchestrators, and load balancers can read.

A restart policy is how Docker reacts when a container exits. no means do nothing. on-failure restarts when the exit code is non-zero. always restarts no matter what, including after a daemon restart. unless-stopped restarts unless you explicitly stopped it. These policies look at exit, not at health, which is the source of most confusion.

Combine the two and you get a feedback loop: the healthcheck flags a sick container, your orchestrator or your own tooling decides to kill it, and the restart policy brings it back.

Mental Model

Think of a container as having two channels of state. One is the exit channel: the process is either running or it stopped with some code. The other is the health channel: starting, healthy, or unhealthy. Docker mainly acts on the exit channel. The health channel is informational unless an orchestrator above (Compose, Swarm, Kubernetes equivalents) is listening.

   starting --(healthcheck ok)--> healthy
     |                             |
     |                       (3 fails)
     v                             v
 unhealthy <---------------- unhealthy
     |
 (process exits)
     v
   exited --(restart policy)--> starting

States and transitions for a containerized process

Hands-on Example

Below is a Dockerfile for a small HTTP service. The HEALTHCHECK instruction tells Docker to curl a local endpoint every 10 seconds. It allows 30 seconds for startup before failures count.

FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev
COPY . .
EXPOSE 3000
HEALTHCHECK --interval=10s --timeout=2s --start-period=30s --retries=3 \
  CMD wget -qO- http://127.0.0.1:3000/healthz || exit 1
CMD ["node", "server.js"]

Now run it with a restart policy. unless-stopped is the safe default for long-lived services.

docker run -d --name api \
  --restart unless-stopped \
  -p 3000:3000 \
  myorg/api:1.4.2

Inspect health status at any time:

docker inspect --format '{{.State.Health.Status}}' api
docker inspect --format '{{json .State.Health}}' api | jq

For Compose, the same idea lives under the service definition:

services:
  api:
    image: myorg/api:1.4.2
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "wget", "-qO-", "http://127.0.0.1:3000/healthz"]
      interval: 10s
      timeout: 2s
      start_period: 30s
      retries: 3
  worker:
    image: myorg/worker:1.4.2
    depends_on:
      api:
        condition: service_healthy

depends_on with service_healthy is the bit that turns a passive signal into orchestration: the worker waits until the API is reporting healthy before it starts.

Common Pitfalls

The first pitfall is checking the wrong thing. A healthcheck that hits / and returns 200 even when the database is down is worse than useless: it lies. Add a dedicated /healthz endpoint that verifies the critical dependencies your container actually needs to do its job.

The second pitfall is no start-period. If your app takes 20 seconds to boot but you set retries to 3 with a 5-second interval and no start-period, the container is marked unhealthy before it ever finished starting. Use start-period to set a grace window.

The third pitfall is conflating restart policy with health. Docker will not restart an unhealthy container by itself. It only restarts on exit. If you want automatic restart on health failure, run an orchestrator that watches health, or have your app crash on its own when it detects unrecoverable state.

The fourth pitfall is heavy healthchecks. A check that does a full database query every five seconds adds load. Keep checks cheap. Verify connectivity, not correctness.

Production Tips

Treat the health endpoint as part of the contract. Keep it documented, stable across versions, and free of authentication. Load balancers and orchestrators need to reach it.

Differentiate liveness (am I running?) from readiness (am I able to serve traffic?). In Docker alone, the line blurs. In Kubernetes, they are separate probes. Even in Docker-only environments, an internal split lets you fail readiness during warmup without triggering restart loops.

Log a single line on every failed healthcheck with the reason. When debugging at 2 AM, you want to know which dependency caused the flap, not just that the container went unhealthy.

Set realistic timeouts. A 2-second timeout for a check that calls a slow service will produce false positives on a busy day. Tune the values against real latency distributions, not guesses.

Finally, monitor the health status itself. Export the unhealthy count to your metrics system. Flapping containers are a leading indicator of a real problem.

Wrap-up

Healthchecks describe what alive means for your container, and restart policies describe how Docker reacts when the process dies. Used together, they form the bottom layer of reliability for containerized services. Get them right and you build everything else, from orchestration to alerting, on solid ground.