Restart Policies and Exit Codes
A container's exit code is its process's exit code, and it decides whether Docker should bring the container back. A restart policy, set with --restart, tells the daemon what to do when the main process exits: leave it stopped, restart it always, or restart it only on failure. Set this wrong and you get one of two failures — a crashed service that stays down silently, or a run-once job that thrashes in a restart loop forever.
The exit code and the policy work together. The code tells you why the process died; the policy decides what happens next. Reading the code first, before assuming a bug, saves you from chasing an application fault that was really the kernel doing the killing.
Exit Codes Carry Meaning
Exit 0 is success; any non-zero code is a failure of some kind. A handful are worth memorizing. 137 means the process was SIGKILLed (128 + 9) — typically an OOM kill from the memory cap in the next topic, or a docker kill. 143 means it took SIGTERM (128 + 15) and exited — a clean stop. 139 is a segfault (128 + 11). docker ps -a and docker inspect both surface the code, and it is the first clue to why a container died.
$ docker inspect --format '{{.State.ExitCode}}' driftwood-web
137
$ docker inspect --format '{{.State.OOMKilled}}' driftwood-web
true # 137 + OOMKilled → the kernel OOM-killed it, not a bug
The Four Policies
There are four. no is the default and never restarts. on-failure[:N] restarts only on a non-zero exit, up to N times. always restarts whenever the container stops — even on a clean exit 0 — and again whenever the daemon starts. unless-stopped behaves like always but does not restart a container you deliberately stopped before a reboot. The whole choice between always and unless-stopped is one question: should a manual stop survive a daemon restart.
--restart policies — when each one brings the container backnoon-failure[:N]always0 — and again on daemon start, even after a manual stop.unless-stoppedalways, but a container you deliberately stopped stays stopped across a reboot.Restart Backoff
The daemon does not restart a crashing container in a tight loop. It backs off with an increasing delay — starting around 100ms and doubling each attempt — so a container that crashes immediately on start does not hammer the host with thousands of restarts per second. The visible effect is that a crash-looping container restarts more slowly over time, which is the daemon protecting the machine, not the container recovering.
Restart Policy vs HEALTHCHECK
A restart policy reacts to the process exiting. A HEALTHCHECK (Chapters 5 and 11) reacts to the process being alive but unhealthy — hung, deadlocked, holding the port open but answering nothing. A wedged gunicorn that never exits will not trigger a restart policy at all, because nothing exited; the policy has no event to fire on. That gap is the limit of what a restart policy can do on a single host, and it is why a liveness check is a separate mechanism.
Where Restart Policies Stop
--restart is a single-host, single-container mechanism the daemon enforces. It does not reschedule the container onto another node, replace a failed host, or maintain a replica count — it brings this container back on this machine and nothing more. When you need rescheduling, replicas, or host failure handling, you have crossed into orchestration (Kubernetes, Chapter 12), not a bigger restart flag. The restart policy is where single-host resilience ends.
- --restart always — brings the container back on any exit and again on daemon start, even after you stopped it manually. Use it for services that must always run regardless of how they went down.
- --restart unless-stopped — same as
alwaysbut respects a manualdocker stopacross a reboot. Usually the better default for a long-lived service like Driftwoodweb, because a deliberate stop stays stopped. - --restart on-failure:5 — restarts only on a non-zero exit, capped at five attempts. Use it for jobs that should retry a few times on transient failure, then give up rather than loop forever.
- --restart no — the default; never restarts. Use it for one-off commands and interactive runs you don't want resurrected.
- Leaving the Driftwood
webcontainer on the default--restart noand assuming it'll come back after a crash or a host reboot — it stays exited and the service is down until someone notices. - Putting
--restart alwayson a one-shot job or migration container — a clean exit0triggers a restart, so the job runs forever in a loop instead of finishing once. - Reading a
137exit and assuming the app crashed when it was OOM-killed by the cgroup memory limit (next topic) — the code points at the kernel killing it, not a bug in the code. - Relying on a restart policy to recover a hung-but-alive process — the process never exits, so the policy never fires; a
HEALTHCHECKis what covers liveness, not a restart policy. - Choosing
alwayswhen you actually want a manual stop to stick across reboots — after a daemon restart the container comes back even though you stopped it on purpose;unless-stoppedis what respects the stop.
- Set
--restart unless-stoppedon long-lived services like Driftwoodwebso they survive crashes and reboots but still honor a deliberate stop. - Use
--restart on-failure:Nfor retryable jobs so transient failures retry but a genuinely broken job gives up instead of looping forever. - Read the exit code first when a container dies (
docker inspect --format '{{.State.ExitCode}}'), mapping137to SIGKILL or OOM and143to a clean SIGTERM, before assuming an application bug. - Pair a restart policy with a
HEALTHCHECK(Chapter 11) so you cover both exited (policy) and hung-but-alive (health) failure modes, since the policy alone can't see a deadlock.
--restart policies and adds podman generate systemd for boot persistence
systemd service units the host-level analog of unless-stopped
Kubernetes restartPolicy plus controllers (replica count, rescheduling) are the multi-host version (Ch12)
Knowledge Check
What is the difference between --restart always and --restart unless-stopped?
unless-stoppedwon't restart a container you manually stopped across a reboot, whilealwaysdoesalwaysrestarts the container only on a non-zero failure exit, whileunless-stoppedrestarts on any exit at allalwaysretries a fixed five times and then gives up, whileunless-stoppedgoes on retrying foreverunless-stoppedwatches the container's HEALTHCHECK for liveness whilealwaysonly ever watches the exit code
Your container keeps dying with exit code 137 and OOMKilled: true. What does that tell you?
- The kernel's OOM killer SIGKILLed it for exceeding the memory limit — not an application crash
- The application itself has a bug that crashes it on exit, and 137 is the error code it deliberately returns
- It received a clean SIGTERM and shut itself down gracefully, since 137 is the code for a normal stop
- It segfaulted on a bad pointer, since 137 is the standard exit code for a memory access violation
Why can't a restart policy recover a gunicorn that has deadlocked but never exits?
- A restart policy fires only on the process exiting, and a hung process never exits — liveness needs a check
- The restart backoff delay grows too long, so the policy gives up and stops retrying before the deadlock clears
- The policy only watches CPU usage to decide, and a deadlocked process is still busy-looping and using CPU
- Only the
unless-stoppedpolicy can detect a hang like this, and the container was instead set toalways
What happens when you set --restart always on a one-shot database migration container?
- It finishes, exits
0, and is immediately restarted — running the migration in an endless loop - It runs once, exits cleanly with
0, and then stays stopped becausealwaysdeliberately ignores clean exits - It reschedules the migration job across multiple hosts in the cluster to run all the copies in parallel
- It retries the job up to five times whenever it fails, then finally gives up exactly like
on-failure:5
You got correct