Chapter 11: Observability & Operations
Topic 71

Debugging Containers

OperationsDebugging

A container that won't stay up gives you more to work with than it looks. The exit code says how it died, docker logs says what it said on the way out, and docker inspect .State says whether the kernel killed it or it crashed on its own. Three artifacts, three different questions, and together they classify almost any failure before you touch the application code.

Triaging a crash, in order
container exited
read exit code
docker logs
inspect .State
exec / docker debug

For a container that is running you exec a shell in and look around. For a distroless or scratch driftwood/web with no shell to exec (Chapter 2, topic 12), docker debug or an ephemeral debug container attaches the tools the image deliberately left out — so you keep the production image lean and still get a way in when it misbehaves.

Why It Exited — Start With the Exit Code

docker ps -a and docker inspect give the exit code, and the code narrows the cause before you read a single log line. The common ones each tell a story:

Read the exit code and the OOM flag first
docker inspect --format '{{.State.ExitCode}} oom={{.State.OOMKilled}}' web

0 is a clean exit; a non-zero in the app's own range is the application's failure code; 137 is SIGKILL — usually the OOM killer or a docker stop that timed out, so cross-check .State.OOMKilled; 143 is SIGTERM (a clean stop signal); 139 is a segfault. The number alone often tells you whether to read the logs or the memory limit.

docker logs and .State

docker logs <container> replays the captured STDOUT and STDERR — driver permitting (topic 66) — and that output usually carries the stack trace or the fatal message. docker inspect .State adds the context the logs lack: Status, Error, OOMKilled, StartedAt and FinishedAt, and RestartCount. Read together, they distinguish "the app threw an exception" from "the kernel killed it" from "it never started at all" — three failures that can look identical from the outside.

Crash Loops

A container with restart: always or unless-stopped that keeps dying shows a climbing RestartCount and a Restarting status. The trouble is that the logs scroll past fast and the container is never sitting still when you look. The way to read a loop that won't hold still is to grab the previous run's docker logs plus docker events (topic 68) capturing the repeating die exit code — the failure reason survives each restart that way instead of vanishing.

exec Into a Running Container

docker exec -it <container> sh (or bash) opens a shell in a running container to inspect its filesystem, environment, and live processes:

Open a shell in a running container
docker exec -it web sh

It does nothing for a container that already exited — there is no process to attach to — and it depends on the image actually shipping a shell, which a slim or distroless image may not. When exec errors with "executable file not found," the image has no sh, and you need the next approach.

Distroless and No Shell — docker debug and Ephemeral Debug Containers

When driftwood/web is distroless or built FROM scratch (Chapter 2, topic 12), there is no sh to exec into — that is the point of those base images. docker debug <container> attaches a temporary toolbox image into the running container's namespaces, giving you a shell and utilities without rebuilding or bloating the production image. The older, runtime-agnostic pattern does the same by joining the target's namespaces directly:

Attach a debug toolbox to a shell-less container by joining its namespaces
docker run -it --rm \
  --pid=container:web \
  --network=container:web \
  nicolaka/netshoot sh

The debug container brings its own shell and tools but shares the target's process and network view, so you inspect the shell-less container's processes and connections from inside a container that does have a shell.

Failed Builds

A build that fails leaves the last successful layer cached, which is the fastest way back in. docker build shows which instruction failed and its output; BuildKit's --progress=plain un-collapses the log so you see the full output of the failing step rather than the folded summary. From there, running an interactive container from the last good intermediate image lets you reproduce the failing RUN step by hand — faster and more precise than editing the Dockerfile and rebuilding from scratch each time (ties to Chapters 4 and 5).

Common Mistakes
  • Trying to docker exec into a container that already exited — exec works only on a running container; a crashed one needs docker logs, inspect, and possibly a re-run with an overridden entrypoint to get a shell before the failing command runs.
  • Reading exit code 137 as a generic crash and chasing the app — it is SIGKILL, and .State.OOMKilled: true shows the kernel killed it for exceeding its memory limit (Chapter 1, topic 03), a config problem, not a code bug.
  • Restarting a crash-looping container repeatedly and losing the failure output each time — without grabbing the previous run's docker logs or watching docker events, the exit reason scrolls away on every restart.
  • Assuming you can exec a shell into a distroless or scratch image to debug it — there is no shell or coreutils inside; docker debug or an ephemeral debug container is the only way in (Chapter 2, topic 12).
  • Debugging a failed build by repeatedly editing the Dockerfile and rebuilding from scratch — the cache already holds the last good layer, so running a container from that intermediate and reproducing the failing step by hand is faster and more precise.
Best Practices
  • Read the exit code first (.State.ExitCode and .State.OOMKilled) to classify the failure — clean exit, app error, OOM kill, or signal — before diving into the logs.
  • Capture a crash-looping container's output with docker logs and watch docker events so the failure reason survives each restart instead of scrolling past.
  • Use docker debug, or an ephemeral debug container joining the target's namespaces, for distroless and scratch images — keeping the production image shell-free while still being debuggable (Chapter 2, topic 12).
  • Debug failed builds from the last cached intermediate image with --progress=plain rather than rebuilding blind, reproducing the failing RUN step interactively.
Comparable tools kubectl debug ephemeral containers — the orchestrated equivalent for shell-less images nerdctl · Podman matching logs, inspect, and exec cdebug attaches a toolbox to a running container dive debugs image bloat from an oversized build rather than a runtime crash

Knowledge Check

A container exits with code 137 and .State.OOMKilled is true. What does that tell you?

  • The kernel killed it with SIGKILL for exceeding its memory limit — a config problem, not an app bug
  • The application exited cleanly on its own and 137 is simply its normal success code
  • It received a SIGTERM from docker stop and then proceeded to shut itself down gracefully
  • The application segfaulted on a bad pointer dereference, which is what exit code 137 always indicates

Why does docker exec -it web sh fail on a crashed container?

  • exec attaches to a running container; an exited one has no live process to attach to
  • The container's network namespace was torn down on exit, so exec can no longer reach it
  • The container's restart policy must be removed first before exec will agree to connect
  • The configured log driver must be json-file for exec to be able to open a shell

driftwood/web is a distroless image with no shell and it is misbehaving while running. How do you get a shell to inspect it?

  • Use docker debug or an ephemeral debug container that joins the target's namespaces and brings its own shell
  • Run docker exec -it web sh directly — every distroless image still ships a minimal busybox shell for exactly this
  • Rebuild the production image with a shell layer added so you can exec into the running container
  • Read docker logs, which conveniently opens an interactive shell into the container's live filesystem

A Dockerfile build fails at a RUN step. What is the fastest way to reproduce and fix it?

  • Run an interactive container from the last cached intermediate image and rerun the failing step by hand
  • Delete the entire build cache and rebuild the image from scratch after every single edit until it finally passes
  • Switch the daemon's log driver over to json-file so that the full build output is captured to disk
  • docker exec into the half-built final image and rerun the failing step interactively there

You got correct