Chapter Eleven
Observability & Operations
Keeping one Docker host alive in production: bounding logs before they fill the disk, healthchecks that report but do not act on their own, the built-in stats-events-inspect tools that diagnose an OOM kill without a metrics stack, pruning the four things that silently eat disk, configuring the daemon, and debugging a container that won't stay up.
Building and running driftwood/web is one job; keeping it running on a host that doesn't fall over is another. This chapter is the operations side of a single Docker host — the decisions you make once a container is meant to live for weeks instead of minutes. Logs that grow until the disk is full, a healthcheck that reports a dead service but does nothing about it, a host that quietly accumulates dangling images and stopped containers until /var/lib/docker is out of space, and a container that crash-loops at 3am and gives you exactly four built-in tools to find out why.
The through-line is that Docker on one host reports more than it acts. It captures your logs but won't rotate them unless you tell it to; it computes a health status but won't restart on it; it tracks disk usage but won't reclaim it on a schedule. Each topic here is a place where the default is "Docker tells you" and the operator's job is to wire up "and then something happens." Where the answer is genuinely fleet-scale — cluster-wide log collection, probes that act on failure, history and alerting — the chapter points at Kubernetes (Chapter 12, topic 76) rather than pretending a single host scales there.
Topics in This Chapter
json-file default grows unbounded until the disk fills; the local driver rotates by default. Which drivers docker logs can read, and why daemon defaults skip running containers.unhealthy but won't restart on it, and how the check gates Compose depends_on and pairs with a restart policy.docker stats for live resource use, docker events for the daemon stream, docker inspect for full runtime state, and docker top for processes inside. Converging four views on one cause — an OOM kill or a restart loop.docker system df accounts for the space; system prune -a --volumes reclaims it — and can delete your database.dockerd. What daemon.json controls, why overlay2 is the storage driver to use, what lives in /var/lib/docker and survives a restart, and why a trailing comma in the config takes every container offline..State. Reading 137 as an OOM kill, exec'ing into a running container, docker debug for a distroless image with no shell, and reproducing a failed build from the last cached layer.