Topic 68

Stats, Events, and Inspection

OperationsDiagnostics

When driftwood/web is slow or restarting, three built-in tools answer different questions without any external monitoring stack. docker stats shows live resource use, docker events streams what the daemon is doing as it happens, and docker inspect dumps the full runtime state of a container — including why it stopped. Add docker top for the process list inside a container, and you have enough to diagnose an OOM kill or a restart loop on one host before reaching for Prometheus.

The discipline these four enforce is to stop guessing. A container that keeps dying has a cause that is observable: an exit code, a memory column climbing toward a limit, a repeating event cycle. The mistake is to read the application code looking for a bug when the kernel killed the process for exceeding its memory cap. These commands point you at the real cause before you waste an hour on the wrong one.

Three commands, three questions

docker stats

Live resource use — per-container CPU, memory against its limit, and network and block I/O, updating about once a second.

docker events

The daemon event stream — start, die, oom, and health_status as they happen.

docker inspect

Point-in-time state — .State.OOMKilled, .State.ExitCode, and the rest of a container's runtime config.

docker stats — Live Resource Use

docker stats streams a live table — per-container CPU percentage, memory usage against its limit, network I/O, and block I/O — updating about once a second. It is the fast way to see driftwood/web pinned at 100% CPU, or sitting at the edge of its --memory limit moments before an OOM kill. Run it on a single container to watch one closely:

Stream live resource use for one container

docker stats web

The output refreshes in place, so you watch the numbers move rather than reading a snapshot. A container climbing steadily is telling you something the first line alone would not.

Reading Memory Against the Limit

The memory column shows usage and the cgroup limit side by side — 480MiB / 512MiB. A container creeping toward its limit is about to be OOM-killed by the kernel (Chapter 1, topic 03), and stats makes that approach visible before it happens rather than after. If the number is parked just under the cap and the container periodically dies, you have found the cause without reading a line of application code: the memory limit is too low for the workload, or the workload leaks.

docker events — The Daemon Event Stream

docker events prints a live feed of daemon-level activity: container start, die (with its exit code), oom, kill, health_status changes, image pulls, volume mounts. Watching it while a container crash-loops shows the exact die→start cycle and the exit code each time. Because it is a live stream, a past event is gone unless you ask for a window:

Replay daemon events for one container over a past window

docker events --since 30m --until now \
  --filter container=web

--since and --until turn the stream into a query over history, which is how you recover the die that already happened. Without them, events shows only what occurs from the moment you run it forward.

docker inspect — Full Runtime State

docker inspect returns a container's complete config and live state as JSON: .State.Status, .State.ExitCode, .State.OOMKilled, .State.Health, .State.RestartCount, plus its mounts, networks, and environment. Reading the whole blob is rarely what you want; --format with a Go template pulls out the one or two fields that settle the question:

Extract just the fields that distinguish an OOM kill from a crash

docker inspect \
  --format '{{.State.OOMKilled}} {{.State.ExitCode}} {{.State.RestartCount}}' \
  web

That one line tells you whether the kernel killed it, what code it exited with, and how many times it has restarted — the difference between "the app threw an exception" and "the kernel reaped it for exceeding its limit."

docker top — Processes Inside a Container

docker top web lists the processes running inside a container, using the host's ps against the container's namespace. It confirms whether PID 1 is the expected gunicorn and not a stray shell, and whether a zombie or an unexpected child process has appeared. It is a direct view of the fact that a container is just host processes (Chapter 1, topic 03), mapped through the container's PID namespace so you never mis-attribute a process to the wrong container the way filtering the host's ps by guesswork would.

Spotting OOM and Restart Loops

The four tools converge on one diagnosis. docker ps shows a high restart count or a Restarting status. inspect reports OOMKilled: true or a non-zero ExitCode. events shows the repeating oom→die→start cycle with the exit code each time. And stats confirms memory pressure by showing usage pinned at the cgroup limit. Four views, one cause — and once you have seen them line up, an OOM kill stops looking like a mysterious crash and becomes a memory-limit configuration you can fix.

Common Mistakes

Reading docker stats once and treating the first sample as steady state — the initial CPU sample is often skewed; the stream needs a few seconds to settle, and a single reading misleads.
Ignoring .State.OOMKilled and chasing an application bug when the kernel killed the container for exceeding its memory limit — the exit looks like a crash, but the cause is the cgroup cap, not the code.
Forgetting that docker events is a live stream and missing past events — without --since and --until it shows only what happens from now on, so the die that already occurred is gone unless you query a time window.
Running docker stats across hundreds of containers as a monitoring strategy — it is an interactive diagnostic, not a metrics pipeline, and gives no history, alerting, or aggregation.
Confusing docker top with filtering the host's ps by guesswork — top maps the container's namespace directly, so it never mis-attributes a process to the wrong container the way a host-level filter can.

Best Practices

Reach for docker stats first when a container is slow or restarting, to see CPU and memory-against-limit live before instrumenting anything.
Check docker inspect --format '{{.State.OOMKilled}} {{.State.ExitCode}}' on any container that exited unexpectedly, so an OOM kill is not mistaken for an application crash.
Tail docker events while reproducing a crash loop to capture the exact die exit codes and the oom and health_status transitions in order.
Treat these as single-host diagnostics and graduate to a metrics stack — cAdvisor with Prometheus, or Kubernetes-native monitoring (Chapter 12, topic 76) — once you need history, alerting, or fleet-wide views.

Comparable tools cAdvisor · Prometheus · Grafana the standard step up from docker stats for history and alerting ctop a top-like TUI over the same per-container data nerdctl · Podman mirror stats, events, inspect, and top

Knowledge Check

A container keeps exiting and you suspect the kernel killed it. Which check distinguishes an OOM kill from an application crash?

Read .State.OOMKilled via docker inspect — true means the kernel killed it for exceeding its memory limit
Read docker logs — an OOM kill always prints a distinctive Python traceback the application cannot suppress
Run docker top — an OOM kill always leaves a telltale zombie process still visible in the listing
Read .State.RestartCount — any non-zero restart count means the kernel was the one that OOM-killed it

Why does docker events often show nothing relevant when you run it after a container has already crashed?

It is a live stream that shows only events from the moment you run it; past events need a --since/--until window
It only reports events for containers that are currently running, so a container that has already crashed is excluded entirely
The logging driver must be set to json-file before events has any container activity to show
It shows only broad daemon-wide events and never per-container lifecycle ones like die

What does the memory column in docker stats let you see before an OOM kill happens?

Usage shown against the container's cgroup limit, so you watch it climb toward the cap that will trigger the kill
Only the total host RAM in use, which on its own tells you nothing about any individual container's risk of being killed
A predicted future timestamp for the exact moment the kernel will issue the OOM kill
The OOMKilled flag itself, which flips to true a few seconds before the process actually dies

Why is docker stats the wrong tool to run across hundreds of containers as a monitoring strategy?

It is an interactive diagnostic with no history, alerting, or aggregation — fleet monitoring needs a metrics pipeline
It hard-refuses to run on more than a small fixed number of containers at once, strictly capping how much you can watch live
It depends on the configured logging driver and silently breaks whenever the logs are shipped to a remote backend
It restarts each container in turn to sample its resource use, disrupting the running workload every time

You got correct