Chapter 11: Observability & Operations
Topic 68

Stats, Events, and Inspection

OperationsDiagnostics

When driftwood/web is slow or restarting, three built-in tools answer different questions without any external monitoring stack. docker stats shows live resource use, docker events streams what the daemon is doing as it happens, and docker inspect dumps the full runtime state of a container — including why it stopped. Add docker top for the process list inside a container, and you have enough to diagnose an OOM kill or a restart loop on one host before reaching for Prometheus.

The discipline these four enforce is to stop guessing. A container that keeps dying has a cause that is observable: an exit code, a memory column climbing toward a limit, a repeating event cycle. The mistake is to read the application code looking for a bug when the kernel killed the process for exceeding its memory cap. These commands point you at the real cause before you waste an hour on the wrong one.

Three commands, three questions
docker stats
Live resource use — per-container CPU, memory against its limit, and network and block I/O, updating about once a second.
docker events
The daemon event streamstart, die, oom, and health_status as they happen.
docker inspect
Point-in-time state — .State.OOMKilled, .State.ExitCode, and the rest of a container's runtime config.

docker stats — Live Resource Use

docker stats streams a live table — per-container CPU percentage, memory usage against its limit, network I/O, and block I/O — updating about once a second. It is the fast way to see driftwood/web pinned at 100% CPU, or sitting at the edge of its --memory limit moments before an OOM kill. Run it on a single container to watch one closely:

Stream live resource use for one container
docker stats web

The output refreshes in place, so you watch the numbers move rather than reading a snapshot. A container climbing steadily is telling you something the first line alone would not.

Reading Memory Against the Limit

The memory column shows usage and the cgroup limit side by side — 480MiB / 512MiB. A container creeping toward its limit is about to be OOM-killed by the kernel (Chapter 1, topic 03), and stats makes that approach visible before it happens rather than after. If the number is parked just under the cap and the container periodically dies, you have found the cause without reading a line of application code: the memory limit is too low for the workload, or the workload leaks.

docker events — The Daemon Event Stream

docker events prints a live feed of daemon-level activity: container start, die (with its exit code), oom, kill, health_status changes, image pulls, volume mounts. Watching it while a container crash-loops shows the exact diestart cycle and the exit code each time. Because it is a live stream, a past event is gone unless you ask for a window:

Replay daemon events for one container over a past window
docker events --since 30m --until now \
  --filter container=web

--since and --until turn the stream into a query over history, which is how you recover the die that already happened. Without them, events shows only what occurs from the moment you run it forward.

docker inspect — Full Runtime State

docker inspect returns a container's complete config and live state as JSON: .State.Status, .State.ExitCode, .State.OOMKilled, .State.Health, .State.RestartCount, plus its mounts, networks, and environment. Reading the whole blob is rarely what you want; --format with a Go template pulls out the one or two fields that settle the question:

Extract just the fields that distinguish an OOM kill from a crash
docker inspect \
  --format '{{.State.OOMKilled}} {{.State.ExitCode}} {{.State.RestartCount}}' \
  web

That one line tells you whether the kernel killed it, what code it exited with, and how many times it has restarted — the difference between "the app threw an exception" and "the kernel reaped it for exceeding its limit."

docker top — Processes Inside a Container

docker top web lists the processes running inside a container, using the host's ps against the container's namespace. It confirms whether PID 1 is the expected gunicorn and not a stray shell, and whether a zombie or an unexpected child process has appeared. It is a direct view of the fact that a container is just host processes (Chapter 1, topic 03), mapped through the container's PID namespace so you never mis-attribute a process to the wrong container the way filtering the host's ps by guesswork would.

Spotting OOM and Restart Loops

The four tools converge on one diagnosis. docker ps shows a high restart count or a Restarting status. inspect reports OOMKilled: true or a non-zero ExitCode. events shows the repeating oomdiestart cycle with the exit code each time. And stats confirms memory pressure by showing usage pinned at the cgroup limit. Four views, one cause — and once you have seen them line up, an OOM kill stops looking like a mysterious crash and becomes a memory-limit configuration you can fix.

Common Mistakes
  • Reading docker stats once and treating the first sample as steady state — the initial CPU sample is often skewed; the stream needs a few seconds to settle, and a single reading misleads.
  • Ignoring .State.OOMKilled and chasing an application bug when the kernel killed the container for exceeding its memory limit — the exit looks like a crash, but the cause is the cgroup cap, not the code.
  • Forgetting that docker events is a live stream and missing past events — without --since and --until it shows only what happens from now on, so the die that already occurred is gone unless you query a time window.
  • Running docker stats across hundreds of containers as a monitoring strategy — it is an interactive diagnostic, not a metrics pipeline, and gives no history, alerting, or aggregation.
  • Confusing docker top with filtering the host's ps by guesswork — top maps the container's namespace directly, so it never mis-attributes a process to the wrong container the way a host-level filter can.
Best Practices
  • Reach for docker stats first when a container is slow or restarting, to see CPU and memory-against-limit live before instrumenting anything.
  • Check docker inspect --format '{{.State.OOMKilled}} {{.State.ExitCode}}' on any container that exited unexpectedly, so an OOM kill is not mistaken for an application crash.
  • Tail docker events while reproducing a crash loop to capture the exact die exit codes and the oom and health_status transitions in order.
  • Treat these as single-host diagnostics and graduate to a metrics stack — cAdvisor with Prometheus, or Kubernetes-native monitoring (Chapter 12, topic 76) — once you need history, alerting, or fleet-wide views.
Comparable tools cAdvisor · Prometheus · Grafana the standard step up from docker stats for history and alerting ctop a top-like TUI over the same per-container data nerdctl · Podman mirror stats, events, inspect, and top

Knowledge Check

A container keeps exiting and you suspect the kernel killed it. Which check distinguishes an OOM kill from an application crash?

  • Read .State.OOMKilled via docker inspecttrue means the kernel killed it for exceeding its memory limit
  • Read docker logs — an OOM kill always prints a distinctive Python traceback the application cannot suppress
  • Run docker top — an OOM kill always leaves a telltale zombie process still visible in the listing
  • Read .State.RestartCount — any non-zero restart count means the kernel was the one that OOM-killed it

Why does docker events often show nothing relevant when you run it after a container has already crashed?

  • It is a live stream that shows only events from the moment you run it; past events need a --since/--until window
  • It only reports events for containers that are currently running, so a container that has already crashed is excluded entirely
  • The logging driver must be set to json-file before events has any container activity to show
  • It shows only broad daemon-wide events and never per-container lifecycle ones like die

What does the memory column in docker stats let you see before an OOM kill happens?

  • Usage shown against the container's cgroup limit, so you watch it climb toward the cap that will trigger the kill
  • Only the total host RAM in use, which on its own tells you nothing about any individual container's risk of being killed
  • A predicted future timestamp for the exact moment the kernel will issue the OOM kill
  • The OOMKilled flag itself, which flips to true a few seconds before the process actually dies

Why is docker stats the wrong tool to run across hundreds of containers as a monitoring strategy?

  • It is an interactive diagnostic with no history, alerting, or aggregation — fleet monitoring needs a metrics pipeline
  • It hard-refuses to run on more than a small fixed number of containers at once, strictly capping how much you can watch live
  • It depends on the configured logging driver and silently breaks whenever the logs are shipped to a remote backend
  • It restarts each container in turn to sample its resource use, disrupting the running workload every time

You got correct