Chapter 10: Security
Topic 59

The Container Threat Model

Threat modelShared kernel

One fact drives every other topic in this chapter: a container shares the host's kernel. Chapter 1 established that a container is a process the kernel has fenced off with namespaces and cgroups, not a small machine with a kernel of its own. The security consequence of that architecture is blunt — the kernel is the blast radius. A process that breaks out of its namespaces is not in another container or a sandbox; it is on the host.

So a container escape is host compromise, not a contained incident. A container is not a security boundary by default. It is an isolated process that you must deliberately harden, one layer at a time, and this chapter is the order in which you do it. The rest of these pages assume you accept that premise, because every control after this one exists to make a compromise land somewhere harmless.

The Shared Kernel Is the Blast Radius

Every container on a host calls into the same kernel. There is exactly one, and it does the real work for all of them — system calls, memory management, the network stack, the filesystem. That sharing is what makes containers light: no guest kernel to boot, milliseconds to start. It is also the security trade-off named back in Chapter 1, topic 02. A virtual machine's boundary is enforced in hardware by a hypervisor; a container's boundary is enforced in software by the kernel itself.

The implication follows directly. One kernel vulnerability, exploited from inside any container, can cross into the host and from there into every other container sharing that kernel. A VM's guest-kernel exploit cannot cross a hypervisor; a container's kernel exploit has nothing of that strength in its way. This is why a container boundary is weaker than a VM's, and why "we run it in a container" is not the same sentence as "we sandboxed it."

What "Escape" Means

Namespaces hide the host from a container: its own process namespace means it cannot see the host's processes, its own mount namespace means it cannot see the host's filesystem, its own network namespace means it cannot see the host's interfaces. That hiding is real and useful. But hiding is not the same as preventing. A kernel bug, a misconfigured mount, or an excess of capabilities can let a container process reach out and act on the host directly, past the things namespaces were hiding.

The uncomfortable part is that the common escapes are not exotic kernel 0-days. They are self-inflicted. Mounting /var/run/docker.sock into a container hands it root-equivalent control of the daemon, which is root on the host (Chapter 1, topic 05). Running with --privileged disables nearly every isolation control at once. A writable host bind mount lets a container process edit host files directly. The exotic exploit is rare; the misconfiguration that makes one unnecessary is common.

A Container Is Not a Security Boundary by Default

Take the default container apart. Its main process runs as root — UID 0, the same UID 0 the host uses, because the user namespace is off unless you turn it on (topic 65). It holds a broad set of Linux capabilities. Its root filesystem is writable, so anything that lands inside can drop a binary or rewrite a config. None of that is a bug; it is convenience, tuned so that images "just work" without the author thinking about permissions. Convenient and safe are different settings.

The whole rest of this chapter is the removal of those defaults, in order. Each topic takes one convenience away and replaces it with the least-privilege version: root becomes a non-root user, the broad capability set becomes drop-all-add-back, the writable filesystem becomes read-only, and so on. None of it is automatic, and none of it is on until you put it there.

Defense in Depth

No single control is enough, because each one can fail or be misconfigured. The answer is to stack independent controls so a failure in one does not hand over the host. Run as non-root (topic 60). Drop the capabilities the workload never uses (61). Confine syscalls and file access with seccomp and the LSMs (62). Make the filesystem read-only and block privilege escalation (63). Keep secrets out of the image (64). Drop the daemon's own root (65). Each layer assumes the one above it was breached, and an attacker has to defeat all of them to reach the host.

Defense in depth: each layer assumes the one before it was breached
non-root
drop capabilities
seccomp / LSM
read-only fs
rootless daemon

That stacking is the difference between a contained incident and a headline. If a process is compromised but it runs as an unprivileged user, holds no capabilities, sits on a read-only filesystem, and cannot escalate, the attacker has a foothold with nothing to stand on. The kernel is still the shared blast radius — defense in depth does not change the architecture — but it makes the path from a compromised app to host root long enough that most attackers never finish it.

Least Privilege as the Spine

Every control in this chapter answers one question: what is the smallest set of permissions, capabilities, syscalls, and writable paths this container needs to do its job? Not "what does it have" — that is the default, and the default is generous. The least-privilege question is what does it need, and then everything else gets taken away. A web app serving requests as a non-root user on a high port needs almost nothing the default grants.

Driftwood's web container is the worked example for the whole chapter, and it starts at the worst case: running as root, holding the full default capability set, on a writable root filesystem, with its database password sitting in an environment variable. By the end of these pages it runs as the non-root user app, with --cap-drop=ALL, on a read-only filesystem with a tmpfs for scratch, with no-new-privileges set, its DB password delivered as a runtime secret, and — on a hardened host — under a rootless daemon. The hardening accumulates; each topic adds one more line.

Common Mistakes
  • Treating namespace isolation as a security sandbox and running untrusted code in a default container — namespaces hide the host, they do not sandbox a kernel exploit, and the shared kernel is the shared blast radius across every container.
  • Mounting /var/run/docker.sock into a container so it can "talk to Docker" — that grants the container root-equivalent control of the host (Chapter 1, topic 05), and it is the single most common self-inflicted escape.
  • Reaching for --privileged to make something work — it disables nearly every isolation control at once and turns the container into an effectively unconfined root shell on the host.
  • Assuming the cloud provider or the base image "handles security" — the defaults are tuned for convenience, and an unhardened driftwood/web running as root with a writable rootfs is your team's problem, not Docker's.
Best Practices
  • Assume any container can be compromised and design so the compromise lands as an unprivileged process with nothing useful to escalate — that posture is exactly what topics 60 through 65 build, one layer at a time.
  • Stack independent controls — non-root, dropped capabilities, seccomp, a read-only rootfs, no-new-privileges — so no single misconfiguration hands over the host.
  • Reserve VM-strength isolation (gVisor, Kata) for genuinely untrusted multi-tenant code, since even a fully hardened container is still a shared-kernel boundary, not a hardware one.
  • Audit what each container can reach — its mounts, sockets, capabilities, and the daemon socket — and remove anything the workload does not strictly need.
Comparable tools gVisor · Kata Containers re-add a stronger boundary under the container interface — a userspace kernel and a VM-backed runtime Podman runs rootless by default, so it starts from a safer posture than a root daemon Falco watches running containers for the escape behaviors this model is built to prevent

Knowledge Check

Why is the shared kernel the blast radius of a container compromise?

  • Every container calls into the same one kernel, so a single kernel exploit from any container can reach the host and the rest
  • Each container boots its own private kernel, and an exploit has to spread between those kernels over the container network
  • Each container is isolated by its own lightweight hypervisor, so a compromise stays trapped inside that one virtual machine
  • The thin writable layer is shared between every container on the host, so a file written in one immediately appears in all the others

Why is a default Docker container not a security boundary?

  • It runs as root with a broad capability set and a writable root filesystem — defaults tuned for convenience, not for safety
  • Namespaces fully sandbox the process the way a hypervisor would, so nothing running inside can ever reach the host kernel
  • It already runs as a dedicated unprivileged user with capabilities dropped, so there is no further boundary left to add
  • Its image layers sit unencrypted in the overlay store, so any attacker with disk access can read them straight off the host

What is the point of defense in depth here, given the kernel is still shared?

  • No single control suffices, so independent controls are stacked and an attacker must defeat all of them to reach the host
  • Stacking enough independent controls eventually turns the container into a hardware-enforced boundary as strong as a full VM
  • It runs the same single control several times over so that at least one redundant copy is sure to hold under attack
  • A single well-chosen control prevents every compromise on its own, so the remaining layers exist purely as documentation

Which of these is the most common kind of container escape in practice?

  • A self-inflicted misconfiguration such as mounting docker.sock, using --privileged, or a writable host bind mount
  • A novel kernel 0-day, discovered and weaponized specifically against your host's exact kernel build before any patch exists
  • The PID and mount namespaces randomly disabling themselves once the container has been running for a while under load
  • The image registry being breached, which then lets attackers reach in and edit the already-running container remotely

You got correct