Topic 03

Namespaces and cgroups

KernelIsolation

A container is not a special kind of object the kernel knows about. It is an ordinary process the kernel has been told to isolate, using two Linux features that predate Docker by years. Namespaces control what a process can see — its own process tree, its own network interfaces, its own filesystem mounts, its own hostname. cgroups control what a process can use — how much CPU, how much memory, how much block I/O.

Put a fence around what a process can see and a budget around what it can consume, and you have a container. There is no virtualization, no guest kernel, no emulated hardware. The "container" is a label for a process that happens to be running with its own set of namespaces and a cgroup limiting its resources — which is why everything you already know about Linux processes still applies to it.

Namespaces — What a Process Can See

The Linux kernel offers a handful of namespace types, and a container gets its own instance of each: a PID namespace for the process tree, a mount namespace for the filesystem, a network namespace for interfaces and routing, a UTS namespace for the hostname, an IPC namespace for shared-memory and message queues, and a user namespace for UID and GID mapping. Each one narrows a single dimension of what the process perceives.

The combined effect is that inside the container, PID 1 is the application itself, the only filesystem visible is the one assembled from the image, and the only network interface is the one Docker wired up. The process is not lied to so much as fenced in — it sees a complete, consistent world that happens to be a small slice of the real one. Nothing about this slows the process down, because the kernel is doing the same scheduling and the same syscalls it always does.

The PID Namespace, Concretely

Run a process inside a container and ask it for its PID, and it answers 1. Inside the PID namespace it is the first process — the init of its own little tree — and it cannot see, signal, or even number any of the host's processes. Run ps inside an interactive container and you will see two or three entries, not the few hundred the host is actually running.

That same process, viewed from the host, has an ordinary high PID like 48213 and sits in the host's process table alongside everything else. One process, two PIDs — 1 inside its namespace and 48213 on the host — is the cleanest single proof that a container is just a fenced-off process. The host kernel scheduled it, the host's ps lists it, and docker stop ultimately sends it a signal the way you would any process.

One process, two views

Inside the container

The app is PID 1. ps shows two or three processes, hostname is a random ID, and the only network interface is the one Docker wired up.

On the host

The same process has an ordinary high PID, sits in the host process table, and is scheduled by the host kernel like any other process.

cgroups — What a Process Can Use

Namespaces hide things; cgroups meter them. A control group caps and accounts for a process's CPU shares, memory, and block I/O, and a container runs inside one. When you pass --memory=512m to docker run, you are writing a memory limit into that container's cgroup, and the kernel enforces it.

The enforcement is the kernel's, not Docker's. Exceed a memory cgroup and the kernel's OOM killer terminates the process — Docker is not in the loop at the moment of death, it just configured the limit beforehand. CPU limits work the same way: --cpus=1.5 tells the scheduler how much CPU time the cgroup may claim, and the scheduler throttles it. This is why an unlimited container on a shared host is a liability: with no cgroup ceiling, one leaking process can claim all the host's memory.

The Union Filesystem, Named

There is a third leg to the stool. The filesystem the container sees is not a copy of the image but a union mount — the image's read-only layers stacked under a thin writable layer, presented as one filesystem. Chapter 2 takes the layering apart and Chapter 6 returns to where writes actually go; here it is enough to name it as the third primitive.

Namespaces for visibility, cgroups for resources, a union mount for the filesystem — that triad is the entire trick. No hypervisor, no guest OS, no emulation. A container is a process the kernel sees, fenced by namespaces, budgeted by cgroups, and rooted in a layered filesystem.

Why "Just a Process" Matters

Internalizing that a container is a host process turns most of its behavior from magic into prediction. The host's ps and top show container processes because they are host processes. The host kernel schedules them, so a noisy container competes with everything else on the box for CPU. A kernel bug or a bad driver degrades every container at once, because there is one kernel under all of them.

This is also why docker stop sends SIGTERM and then SIGKILL after a grace period — it is process signalling, nothing more. Every later chapter of this course rests on this fact. When something a container does seems mysterious, the first question is "what would this do to a normal Linux process," because that is exactly what it is.

Namespaces vs cgroups

Namespaces — control what a process can see: its own PID tree, mounts, network interfaces, hostname, IPC, and user mapping. They partition the kernel's view so the process perceives a private slice of the system. They hide; they do not meter, and they do not sandbox a kernel exploit.

cgroups — control what a process can use: CPU shares, memory ceiling, and block I/O budget. They account and cap, and the kernel enforces the limit — exceeding a memory cgroup is an OOM kill by the kernel. They meter; they do not hide anything. A container needs both: a fence on visibility and a budget on consumption.

Common Mistakes

Believing a container has its own kernel or operating system — it has neither; it has a fenced view from namespaces and a resource budget from cgroups over the host's single kernel, and nothing more.
Running a container with no memory cgroup limit on a shared host — one container's memory leak consumes all host memory and triggers the OOM killer against unrelated containers, so a single leak takes down neighbors.
Assuming process isolation means security isolation — namespaces hide objects from view, they do not contain a kernel exploit; treating them as a security boundary for untrusted code is the root of most container escapes.
Being surprised that the host's ps and top show container processes — that is correct and useful behavior, because the processes really are host processes, not something hidden inside a separate machine.

Best Practices

Set a memory limit with --memory on every container on a shared host, so one workload cannot starve the others or take the whole node down through the OOM killer.
Debug containers with the host's own tools — ps, top, and /proc — because container processes are visible there as ordinary host processes, no special tooling required.
Treat namespaces as isolation for correctness, not as a security sandbox, and add the dedicated security layers from Chapter 10 when you need a real boundary around untrusted code.
Reason about any container behavior — signals, limits, visibility — by asking what it would do to a normal Linux process, because that is precisely what a container is.

Comparable tools Podman · LXC · systemd-nspawn use the same namespace and cgroup primitives runc the OCI runtime that actually sets up the namespaces and cgroups FreeBSD jails · Solaris zones the older OS-level isolation lineage this descends from

Knowledge Check

What is the division of labor between namespaces and cgroups?

Namespaces control what a process can see; cgroups control what it can use
cgroups control what a process can see; namespaces control what it can use
Namespaces encrypt the filesystem; cgroups sandbox the process against kernel exploits
Namespaces give each container its own kernel; cgroups give it its own scheduler

Why does the same process report PID 1 inside the container but a high PID on the host?

It has its own PID namespace where it is the first process, but the host's process table still lists it normally
Docker launches two separate copies of the process, one running inside the container and a matching proxy on the host
The container runs its own private guest kernel, which assigns PIDs entirely independently of the host's kernel
Docker simply relabels the entrypoint as PID 1 by convention, a cosmetic number with no real effect on the process

When a container exceeds its --memory limit and is killed, what actually killed it?

The kernel's OOM killer, enforcing the memory cgroup limit Docker configured
Docker, which polls memory usage and kills the container once it crosses the limit
The container's own guest operating system, which ran out of RAM and panicked
The CPU scheduler, which paused the process indefinitely after it used too much memory

Why is it a mistake to treat namespace isolation as a security boundary for untrusted code?

Namespaces only hide objects from view; they do not contain a kernel exploit, and the kernel is shared with the host
Namespaces do not actually isolate anything at all, so the process can simply see and reach the host directly
cgroups already provide a complete security sandbox on their own, which makes namespace isolation redundant for that purpose
Each container runs its own separate kernel, so an escape can only ever reach that one container's private kernel

You got correct