Topic 67

Healthchecks

OperationsHealth

A running container is not the same as a working one. driftwood/web can be up, listening on its port, and still returning 500s because its database connection died mid-shift. The process is alive; the service is not. A HEALTHCHECK is a command Docker runs inside the container on an interval to answer the only question that matters — "is this actually serving?" — and turns the answer into a health status that other tools can act on.

On a single host, though, Docker reports the status and does almost nothing about it on its own. It will flip a container to unhealthy and leave it running, broken, until something else intervenes. That gap — between knowing a container is sick and doing something about it — is the whole point of this topic. The healthcheck is an input to a restart policy and to Compose's startup ordering, not a self-healing mechanism by itself.

Defining the Check

You declare a healthcheck in the Dockerfile with HEALTHCHECK, or in Compose under a healthcheck: key. The command runs inside the container, against its own service — for driftwood/web that is usually a request to a route that exercises the real dependency path. Exit code 0 means healthy; 1 means unhealthy. A typical definition probes a /health endpoint and curls its own port:

Dockerfile — a healthcheck that hits the app's own health route

HEALTHCHECK --interval=15s --timeout=3s --start-period=30s --retries=3 \
  CMD curl -f http://localhost:8000/health || exit 1

The -f flag makes curl exit non-zero on an HTTP error status, so a 500 from /health registers as a failed probe rather than a success that happened to return error text. The command runs in the container's own namespaces, which is why it targets localhost — it is talking to the same process it is checking.

The Three Statuses

A container with a healthcheck moves through three states. During the start-period grace window it is starting, and failures don't count against it. After that it is healthy or unhealthy depending on the most recent probes. docker ps shows the status in parentheses next to the container — Up 4 minutes (healthy) — so a quick listing tells you which containers are actually serving. For the detail of why a check fails, docker inspect exposes the recent probe output:

The three health statuses a container moves through

starting

→

probe runs

→

healthy

→

unhealthy

Read the last health-check results and their output

docker inspect --format '{{json .State.Health}}' web | jq

That returns the current status, the failing-streak counter, and a rolling log of the last few probe invocations with their exit codes and stdout — enough to see that the check is timing out, or that /health is returning a 503 because the database is down.

What Docker Does — and Doesn't — Do on a Single Host

This is the part that surprises people. By itself, the daemon flips the status to unhealthy and stops there. It does not restart the container, replace it, or stop it on a health failure. The restart policy (Chapter 3, topic 17) reacts only to the process exiting — and an unhealthy container whose process is still running has not exited. Only an orchestrator acts on unhealthy directly. So on one host, a healthcheck that goes red and a container that keeps running side by side is the default, not a bug.

The single-host operator has to bridge that gap deliberately: combine the healthcheck with something that actually exits or restarts on failure. The status is a signal; you still have to wire up the response.

Tuning Interval, Timeout, Retries, Start-Period

Four knobs shape the check. interval sets how often it runs; timeout caps how long a single probe may take before it counts as a failure; retries sets how many consecutive failures flip the container to unhealthy; and start-period is a grace window in which failures don't count yet. The two failure modes are opposite mistakes. A start-period that is too short marks a slow-booting driftwood/web unhealthy before it has finished running its database migrations — the image looks broken when the timing was simply wrong. A timeout longer than the interval lets probes overlap and pile up. And too long an interval means a service that died seconds ago still shows healthy for minutes.

Tie to Compose depends_on

Compose's depends_on: condition: service_healthy (Chapter 8, topic 48) holds web from starting until db reports healthy, turning the healthcheck into an ordering gate. Without it, depends_on waits only for the container to start — not to be ready — and web races ahead of a Postgres still opening its socket, failing on first boot with a connection-refused error. The healthcheck on db is what makes "wait until it's actually accepting connections" expressible instead of "wait until the process launched."

Tie to Restart Policy

Pairing a healthcheck with an app that exits on persistent, unrecoverable failure lets the restart policy (Chapter 3, topic 17) recycle the container. The common single-host pattern is three parts working together: the healthcheck reports status, the app exits on fatal failure, and restart: unless-stopped brings it back from the exit. Docker will not connect "unhealthy" to "restart" for you — but "process exited" to "restart" is exactly what the restart policy does, so the trick is to make a fatal health condition cause the process to exit.

Common Mistakes

Setting start-period too short for a container that runs database migrations or warms a cache on boot — driftwood/web is flagged unhealthy (and killed, under an orchestrator) before it ever finished starting, which looks like a broken image when the timing was the only problem.
Assuming Docker restarts an unhealthy container on a single host — it does not; the status changes and nothing acts on it unless an orchestrator or a deliberate restart-on-exit pattern is in place.
Writing a healthcheck that only confirms the port is open, not that the app works — a process stuck in a deadlock still accepts the TCP connection, so an nc -z style check reports healthy while every real request hangs.
Giving the check a timeout longer than the interval, so probes overlap and pile up, or running an expensive query every few seconds that adds real load to driftwood/db.
Relying on depends_on without condition: service_healthy to order startup — Compose waits only for db to start, so web connects before Postgres is accepting connections and fails on first boot.

Best Practices

Probe an endpoint that exercises the real dependency path — a /health route that touches the database — not just the listening port, so healthy means actually serving.
Set start-period to comfortably cover the container's slowest legitimate boot, including migrations and cache warm, so startup failures aren't counted as health failures.
Gate dependent services with Compose depends_on: condition: service_healthy so web waits for db to report healthy, not merely to start.
Pair the healthcheck with a restart policy and an app that exits on fatal, unrecoverable failure, since Docker on one host will not restart an unhealthy container for you.

Comparable tools Kubernetes liveness, readiness, and startup probes that do act on failure Podman · Compose honor the same HEALTHCHECK directive Healthchecks.io · Blackbox Exporter probe from outside the host rather than inside the container

Knowledge Check

On a single Docker host with no orchestrator, what does the daemon do when a container's healthcheck flips it to unhealthy?

It records the unhealthy status and otherwise leaves the container running — it does not restart, stop, or replace it
It immediately restarts the container in place to try to recover it, the way an orchestrator would
It stops and removes the unhealthy container automatically so that a fresh replacement can be scheduled and started in its place
It triggers the configured restart policy, which reacts directly to the reported health status

What does the start-period setting control, and what goes wrong if it is too short?

It is a grace window where failing probes don't count; too short, and a slow-booting container is flagged unhealthy before it finishes starting
It sets how often the probe runs between successive checks; too short, and the probe runs almost constantly and adds measurable, continuous load on the host
It caps how long a single probe invocation may take before it is killed; too short, and slow-but-valid probes are counted as failures
It sets how many consecutive failed probes flip the container to unhealthy; too short, and a single transient blip marks it unhealthy

Why does a healthcheck that only checks whether the port is open give a false sense of health?

A deadlocked process still accepts the TCP connection, so the check reports healthy while real requests hang
A port-open check is too expensive to run and adds enough load on each interval to make the container itself appear unhealthy
Docker rejects port-only checks as invalid and forces the container to stay in the starting state forever
A port check always runs from outside the container's namespace and therefore cannot reach the app's real endpoint

In Compose, why does depends_on need condition: service_healthy to correctly order web after db?

Plain depends_on waits only for db to start; service_healthy waits for its healthcheck to pass, so web starts after Postgres accepts connections
Plain depends_on already waits for db to be fully ready and accepting client connections first, so adding the explicit condition here is purely cosmetic
It controls shutdown order on teardown, ensuring web always stops cleanly before db goes down
It makes Compose restart db automatically whenever its healthcheck reports unhealthy at runtime

You got correct