Topic 17

Restart Policies and Exit Codes

ResilienceExit codes

A container's exit code is its process's exit code, and it decides whether Docker should bring the container back. A restart policy, set with --restart, tells the daemon what to do when the main process exits: leave it stopped, restart it always, or restart it only on failure. Set this wrong and you get one of two failures — a crashed service that stays down silently, or a run-once job that thrashes in a restart loop forever.

The exit code and the policy work together. The code tells you why the process died; the policy decides what happens next. Reading the code first, before assuming a bug, saves you from chasing an application fault that was really the kernel doing the killing.

Exit Codes Carry Meaning

Exit 0 is success; any non-zero code is a failure of some kind. A handful are worth memorizing. 137 means the process was SIGKILLed (128 + 9) — typically an OOM kill from the memory cap in the next topic, or a docker kill. 143 means it took SIGTERM (128 + 15) and exited — a clean stop. 139 is a segfault (128 + 11). docker ps -a and docker inspect both surface the code, and it is the first clue to why a container died.

The exit code is the first thing to read when a container dies

$ docker inspect --format '{{.State.ExitCode}}' driftwood-web
137
$ docker inspect --format '{{.State.OOMKilled}}' driftwood-web
true                # 137 + OOMKilled → the kernel OOM-killed it, not a bug

The Four Policies

There are four. no is the default and never restarts. on-failure[:N] restarts only on a non-zero exit, up to N times. always restarts whenever the container stops — even on a clean exit 0 — and again whenever the daemon starts. unless-stopped behaves like always but does not restart a container you deliberately stopped before a reboot. The whole choice between always and unless-stopped is one question: should a manual stop survive a daemon restart.

The four --restart policies — when each one brings the container back

no

The default. Never restarts — once the process exits, the container stays stopped.

on-failure[:N]

Restarts only on a non-zero exit, up to N times, then gives up. For jobs that should retry transient failures.

always

Restarts on any exit — even a clean exit 0 — and again on daemon start, even after a manual stop.

unless-stopped

Like always, but a container you deliberately stopped stays stopped across a reboot.

Restart Backoff

The daemon does not restart a crashing container in a tight loop. It backs off with an increasing delay — starting around 100ms and doubling each attempt — so a container that crashes immediately on start does not hammer the host with thousands of restarts per second. The visible effect is that a crash-looping container restarts more slowly over time, which is the daemon protecting the machine, not the container recovering.

Restart Policy vs HEALTHCHECK

A restart policy reacts to the process exiting. A HEALTHCHECK (Chapters 5 and 11) reacts to the process being alive but unhealthy — hung, deadlocked, holding the port open but answering nothing. A wedged gunicorn that never exits will not trigger a restart policy at all, because nothing exited; the policy has no event to fire on. That gap is the limit of what a restart policy can do on a single host, and it is why a liveness check is a separate mechanism.

Where Restart Policies Stop

--restart is a single-host, single-container mechanism the daemon enforces. It does not reschedule the container onto another node, replace a failed host, or maintain a replica count — it brings this container back on this machine and nothing more. When you need rescheduling, replicas, or host failure handling, you have crossed into orchestration (Kubernetes, Chapter 12), not a bigger restart flag. The restart policy is where single-host resilience ends.

always vs unless-stopped vs on-failure

--restart always — brings the container back on any exit and again on daemon start, even after you stopped it manually. Use it for services that must always run regardless of how they went down.
--restart unless-stopped — same as always but respects a manual docker stop across a reboot. Usually the better default for a long-lived service like Driftwood web, because a deliberate stop stays stopped.
--restart on-failure:5 — restarts only on a non-zero exit, capped at five attempts. Use it for jobs that should retry a few times on transient failure, then give up rather than loop forever.
--restart no — the default; never restarts. Use it for one-off commands and interactive runs you don't want resurrected.

Common Mistakes

Leaving the Driftwood web container on the default --restart no and assuming it'll come back after a crash or a host reboot — it stays exited and the service is down until someone notices.
Putting --restart always on a one-shot job or migration container — a clean exit 0 triggers a restart, so the job runs forever in a loop instead of finishing once.
Reading a 137 exit and assuming the app crashed when it was OOM-killed by the cgroup memory limit (next topic) — the code points at the kernel killing it, not a bug in the code.
Relying on a restart policy to recover a hung-but-alive process — the process never exits, so the policy never fires; a HEALTHCHECK is what covers liveness, not a restart policy.
Choosing always when you actually want a manual stop to stick across reboots — after a daemon restart the container comes back even though you stopped it on purpose; unless-stopped is what respects the stop.

Best Practices

Set --restart unless-stopped on long-lived services like Driftwood web so they survive crashes and reboots but still honor a deliberate stop.
Use --restart on-failure:N for retryable jobs so transient failures retry but a genuinely broken job gives up instead of looping forever.
Read the exit code first when a container dies (docker inspect --format '{{.State.ExitCode}}'), mapping 137 to SIGKILL or OOM and 143 to a clean SIGTERM, before assuming an application bug.
Pair a restart policy with a HEALTHCHECK (Chapter 11) so you cover both exited (policy) and hung-but-alive (health) failure modes, since the policy alone can't see a deadlock.

Comparable tools Podman supports the same --restart policies and adds podman generate systemd for boot persistence systemd service units the host-level analog of unless-stopped Kubernetes restartPolicy plus controllers (replica count, rescheduling) are the multi-host version (Ch12)

Knowledge Check

What is the difference between --restart always and --restart unless-stopped?

unless-stopped won't restart a container you manually stopped across a reboot, while always does
always restarts the container only on a non-zero failure exit, while unless-stopped restarts on any exit at all
always retries a fixed five times and then gives up, while unless-stopped goes on retrying forever
unless-stopped watches the container's HEALTHCHECK for liveness while always only ever watches the exit code

Your container keeps dying with exit code 137 and OOMKilled: true. What does that tell you?

The kernel's OOM killer SIGKILLed it for exceeding the memory limit — not an application crash
The application itself has a bug that crashes it on exit, and 137 is the error code it deliberately returns
It received a clean SIGTERM and shut itself down gracefully, since 137 is the code for a normal stop
It segfaulted on a bad pointer, since 137 is the standard exit code for a memory access violation

Why can't a restart policy recover a gunicorn that has deadlocked but never exits?

A restart policy fires only on the process exiting, and a hung process never exits — liveness needs a check
The restart backoff delay grows too long, so the policy gives up and stops retrying before the deadlock clears
The policy only watches CPU usage to decide, and a deadlocked process is still busy-looping and using CPU
Only the unless-stopped policy can detect a hang like this, and the container was instead set to always

What happens when you set --restart always on a one-shot database migration container?

It finishes, exits 0, and is immediately restarted — running the migration in an endless loop
It runs once, exits cleanly with 0, and then stays stopped because always deliberately ignores clean exits
It reschedules the migration job across multiple hosts in the cluster to run all the copies in parallel
It retries the job up to five times whenever it fails, then finally gives up exactly like on-failure:5

You got correct