Topic 77

Containers from First Principles

ContainersKernel

A container is not an object the kernel knows about. There is no container system call, no container type in /proc, nothing the scheduler treats specially. A container is an ordinary process the kernel has been told to wrap in two sets of primitives: namespaces, which give it isolated views of PIDs, mounts, the network, users, and hostname; and cgroups, which cap the CPU, memory, and I/O it can consume. Add a root filesystem swapped in with pivot_root, a reduced capability set, and a seccomp filter, and you have what everyone calls a container.

That framing decides how you reason about every operational question that follows. A container shares your running kernel — there is exactly one kernel on the host, and every container makes its system calls against it. So a kernel CVE is a host problem that affects all containers at once, "isolation" is a process boundary rather than a hardware one, and root inside the container is, without a user namespace, root on the host. "Container" is a userland convention assembled from kernel features, not a sandbox the kernel hands you.

The Assembly of Primitives

Four kernel mechanisms, combined, produce the illusion. Namespaces isolate what a process can see: a PID namespace makes the container's first process appear as PID 1 with no visibility of host processes, a mount namespace gives it its own filesystem tree, a network namespace gives it its own interfaces and routing table, and a user namespace can map container UID 0 to an unprivileged host UID. Cgroups (v2 on modern Debian and Ubuntu) bound what it can take: a memory limit triggers the OOM killer inside the container, a CPU quota throttles it, and an I/O weight keeps it from starving the host. A root filesystem — usually an overlay mount, entered with pivot_root rather than the older chroot — gives it a self-contained userland. Capabilities and a seccomp profile then strip the privileges and syscalls it does not need.

You can build one by hand and the magic disappears. unshare creates new namespaces, nsenter joins existing ones, and the cgroup tree under /sys/fs/cgroup is just files you write limits into. The runtime does this for you in the right order, but the steps are plain kernel features.

# Build a minimal "container" by hand — no Docker involved
sudo unshare --pid --mount --net --uts --fork --mount-proc bash
# Inside: this shell is PID 1, with its own /proc, network, and hostname
ps -e          # sees only processes in this PID namespace
hostname isolated
# Cap memory at 100 MB through cgroup v2 (run from the host)
echo 100M | sudo tee /sys/fs/cgroup/demo/memory.max

Images and Layers

An image is the root filesystem the container starts from, plus metadata describing how to run it. It is not a single blob: it is a stack of layers, each a tarball of filesystem changes (added, modified, and deleted files) identified by a SHA-256 digest. At run time these layers are stacked into one tree by a union filesystem — overlayfs on current Debian and Ubuntu — with the read-only image layers as the lower dirs and a thin writable layer on top. Reads fall through to the lowest layer that has the file; writes copy the file up into the writable layer first, which is copy-on-write.

Layering is what makes images cheap to ship and store: two images built on the same debian:12 base share that base layer on disk and over the wire, so only the differing layers transfer. The cost is real but bounded — every layer adds an entry the kernel must walk on a lookup, and a deleted file in an upper layer is recorded as a whiteout rather than reclaiming the space in the layer below. Hundreds of layers or a multi-gigabyte base inflate both storage and the time overlayfs spends resolving paths.

The OCI Model

The ecosystem agrees on two specifications under the Open Container Initiative so that images and runtimes from different vendors interoperate. The image spec defines the on-disk format: the layered tarballs, a config JSON with the default command and environment, and a manifest that ties them together by digest. The runtime spec defines the other end: a directory containing an unpacked root filesystem plus a config.json describing the namespaces, cgroups, capabilities, and mounts to apply. "OCI-compliant" means a tool reads or writes these formats, which is why an image built by docker build runs unchanged under Podman, containerd, or CRI-O.

runc is the reference implementation of the runtime spec — a small Go binary that takes a runtime bundle, makes the clone/unshare, pivot_root, and cgroup calls in the correct sequence, executes your process, and exits. Every mainstream stack ends in runc or a drop-in replacement. The container is then a child of whatever supervises it, not of runc itself.

Runtimes and the Stack

The word "runtime" covers two layers that do different jobs. The low-level runtime is the piece that actually talks to the kernel — runc, or crun, a C rewrite that starts faster and uses less memory. The high-level runtime manages images, storage, and the lifecycle of many containers, then calls the low-level runtime per container: containerd and CRI-O are the two that matter, and CRI-O exists specifically to implement Kubernetes' Container Runtime Interface.

Docker and Podman sit above that as the user-facing tools. Docker is a client plus the dockerd daemon, which delegates down to containerd and then runc. Podman is daemonless and runs containers as direct child processes of your shell, which makes rootless operation and systemd integration cleaner — but the container it produces is the same OCI bundle handed to the same runc. The differences are in management and trust model, not in what a container fundamentally is.

# The same stack, bottom to top, on Debian/Ubuntu
runc / crun        # low-level: makes the kernel calls, runs the process
containerd / cri-o # high-level: images, storage, lifecycle
dockerd / podman   # user-facing: build, pull, run; podman is daemonless

Containers versus Virtual Machines

The dividing line is the kernel. A virtual machine runs its own guest kernel on virtualized hardware presented by a hypervisor (KVM on Linux), so a VM is isolated at the hardware boundary and a compromise of the guest kernel does not reach the host kernel. A container has no kernel of its own; it borrows the host's. That single fact produces every difference downstream: containers start in tens of milliseconds because there is no kernel to boot, pack ten times denser because they carry no guest OS, and share the host's patch state — but their isolation is only as strong as the kernel's namespace and cgroup code plus your seccomp and capability hardening.

For multi-tenant or hostile workloads where a kernel exploit escaping to the host is unacceptable, that gap matters, which is why sandboxed runtimes like gVisor (a user-space kernel) and Kata Containers (a thin VM per container) exist to buy VM-grade isolation at container-like ergonomics. For your own trusted services, the shared-kernel model is the point — the density and startup speed are why containers won.

Containers vs Virtual Machines

Containers — isolated processes sharing the host kernel through namespaces and cgroups. Startup in milliseconds, near-zero memory overhead, and high density. Choose them for your own trusted services and for anything where deployment speed and packing density dominate.

Virtual machines — full guests with their own kernel on virtualized hardware. Seconds to boot, hundreds of megabytes of overhead each, but isolation at the hardware boundary that a guest-kernel exploit cannot cross. Choose them for untrusted or multi-tenant workloads, for a different kernel or OS than the host, and where a kernel escape must be impossible rather than merely unlikely.

Common Mistakes

Treating a container as a lightweight VM. It shares your host kernel, so a kernel-level exploit from inside one container reaches the host and every other container — there is no guest kernel between them to stop it.
Running the process as root inside the container without a user namespace. Container UID 0 is host UID 0; a breakout from such a container lands you root on the host. Map it with userns-remap or run rootless under Podman.
Leaving the default capability set and no seccomp profile in place, then assuming the container is a hard security boundary. The default still grants capabilities like CAP_NET_RAW, and removing the syscall filter widens the kernel attack surface dramatically.
Persisting important data only in the container's writable layer. That layer is discarded when the container is removed, so anything not on a volume or bind mount is gone — and copy-on-write makes those writes slower than a real filesystem anyway.
Treating image layers as free. Each added layer is an extra overlayfs lookup and a deleted file becomes a whiteout that never reclaims space below it, so a sprawling layer count bloats both image size and path-resolution cost.
Setting no cgroup memory or CPU limit. An unbounded container can consume all host memory and trigger the host OOM killer, taking down neighbours that did nothing wrong.

Best Practices

Patch the host kernel on the same urgency as the workloads — every container runs against it, so a single host kernel CVE is a fleet-wide exposure, not a per-image one.
Run the container process as a non-root user. Set a numeric USER in the image and enable user namespaces (rootless Podman, or userns-remap on Docker) so container UID 0 never maps to host root.
Drop all capabilities and add back only what the process needs (--cap-drop=ALL then --cap-add), and keep the default seccomp profile enabled — never run with --privileged outside controlled debugging.
Set explicit cgroup limits (--memory, --cpus) on every container, and keep all durable data on named volumes or bind mounts rather than the writable layer.
Build small, few-layered images on a minimal base (debian:12-slim or distroless) using multi-stage builds, so overlay lookups stay cheap and the attack surface inside the image stays small.
Reach for a VM-grade sandbox (gVisor or Kata Containers) when running untrusted or multi-tenant code, where the shared-kernel boundary is not strong enough to accept a kernel escape.

Comparable toolsDocker / Podman — the user-facing OCI runtimes that assemble namespaces, cgroups, and overlay images for you; Podman is daemonless and rootless by defaultLXC / LXD — system containers that boot a full init and userland in namespaces, closer to a lightweight machine than a single-process containerFreeBSD jails / Solaris Zones — the original OS-level virtualization that pioneered the shared-kernel, isolated-userland model years before Linux namespaces

Knowledge Check

Why does a kernel vulnerability matter differently for containers than for virtual machines?

Containers share the single host kernel, so an exploit in it hits every container at once, whereas each VM runs its own guest kernel behind the hardware boundary
Containers each run their own stripped-down kernel image baked into the layers, so a kernel exploit stays confined to that one container while the VMs share a single hypervisor kernel
The kernel is irrelevant to containers because they issue no system calls of any kind; only full virtual machines ever touch the kernel at runtime
Containers patch the host kernel automatically from layers baked into the image, so a kernel CVE only ever exposes the VMs running beside them

Without a user namespace, what is the security consequence of running a container's process as root?

Container UID 0 is the same as host UID 0, so a breakout from the container hands you full root on the host
The container simply refuses to start, because running as root inside a container is blocked by default on Linux
It only affects files inside the image layers; the host's built-in UID mapping isolates it regardless of any namespaces
Root inside the container is automatically remapped to the unprivileged nobody account on the host by the kernel

What does it actually mean to say a tool is "OCI-compliant"?

It reads or writes the OCI image and runtime specs, so an image built by one tool runs unchanged under another such as Podman or containerd
It carries a certification from Docker Inc. attesting that the tool runs correctly on the official Docker daemon and nowhere else
It guarantees the runtime starts each and every container inside its own dedicated lightweight virtual machine for full hardware-level isolation between tenants
It means the image bundles a matching kernel inside itself so the container no longer depends on the host kernel at all

Why are container images shipped as a stack of layers rather than a single filesystem blob?

Shared base layers are stored and transferred once across images, and a union filesystem stacks them read-only under one copy-on-write writable layer at run time
Each layer is placed in its own private namespace, giving genuine per-layer process isolation inside the running container so a fault in one layer stays fully contained
Layering encrypts every part of the image with a separate key so the kernel can verify and decrypt each one independently at load
A single filesystem blob is hard-capped at 2 GB, so layers exist purely to split larger images and work around that kernel limit

When is a VM-grade sandbox such as gVisor or Kata Containers worth its overhead over a plain container?

When running untrusted or multi-tenant code, where the shared-kernel boundary is too weak to accept the risk of a kernel escape
For your own trusted first-party services, where the extra isolation layer measurably speeds up container startup and image pulls
Whenever you need higher density, since these sandboxes pack noticeably more containers onto a single host than plain runc does
Only when the host has no hypervisor at all, since they replace the KVM hardware boundary with ordinary namespaces

You got correct