Topic 76

Namespaces and cgroups

FoundationsKernel

Namespaces and cgroups are the two kernel features that make Linux containers possible. A namespace virtualizes a single global resource — process IDs, mount points, network interfaces, hostnames, user IDs — so that processes inside it see their own private instance of that resource. A cgroup (control group) does the opposite job: it accounts for and limits how much CPU, memory, I/O, and how many processes a set of tasks may consume. Neither feature is a container by itself; a container runtime such as runc or crun simply assembles a bundle of namespaces plus a cgroup plus a root filesystem and starts a process inside it.

The operational consequence is that there is no "container" object in the kernel to inspect — only processes wearing namespace and cgroup memberships. When a container behaves strangely, you debug it with the same tools you use for any process: ls -l /proc/<pid>/ns to see which namespaces it joined, and cat /proc/<pid>/cgroup to see which control group constrains it. Understanding the two primitives directly is what separates someone who can fix a wedged container from someone who can only restart it.

The seven namespace types

Linux defines eight namespace types as of kernel 5.6, though seven are in routine container use. Each isolates one class of global resource. A process can be a member of one namespace of each type at a time, and child processes inherit their parent's namespaces unless explicitly given new ones. The kernel creates them through three syscalls: clone() with CLONE_NEW* flags when forking, unshare() to move the calling process into fresh namespaces, and setns() to join an existing one.

Namespace	clone flag	Isolates
Mount (mnt)	CLONE_NEWNS	Filesystem mount points and propagation
PID	CLONE_NEWPID	Process ID number space; first process is PID 1
Network (net)	CLONE_NEWNET	Interfaces, routes, iptables/nftables, sockets
IPC	CLONE_NEWIPC	System V IPC, POSIX message queues
UTS	CLONE_NEWUTS	Hostname and NIS domain name
User	CLONE_NEWUSER	UID/GID mappings and capabilities
Cgroup	CLONE_NEWCGROUP	The root of the cgroup hierarchy a process sees

The eighth type, Time (CLONE_NEWTIME, added in 5.6), virtualizes CLOCK_MONOTONIC and CLOCK_BOOTTIME offsets and is used mainly for checkpoint/restore (CRIU); container runtimes rarely create it by default. The user namespace is special because it is the only one an unprivileged process can create without CAP_SYS_ADMIN, which is why it underpins rootless containers.

Inspecting and entering namespaces

Every namespace appears as a file under /proc/<pid>/ns/; the inode number behind that symlink is the namespace's identity. Two processes share a namespace if and only if those inode numbers match, which is how you prove that a sidecar really is in the same network namespace as its main container rather than merely on the same host.

# What namespaces does PID 4123 belong to?
ls -l /proc/4123/ns
# lrwxrwxrwx net -> 'net:[4026532567]'  <- the inode is the identity

# Run a shell in a brand-new net + UTS + PID namespace
sudo unshare --net --uts --pid --fork --mount-proc /bin/bash

# Enter the namespaces of an existing process (e.g. a container)
sudo nsenter --target 4123 --net --mount --uts ip addr

On Debian and Ubuntu the unshare and nsenter tools ship in the util-linux package, which is part of the base system, so no installation is needed. The --mount-proc flag is important when creating a PID namespace: without remounting /proc, the new shell would still show the host's full process list and break PID isolation. Red Hat and Fedora ship the identical binaries from the same upstream util-linux.

cgroup v2 resource control

Modern Debian (11+) and Ubuntu (21.10+) boot with the unified cgroup v2 hierarchy mounted at /sys/fs/cgroup, replacing the v1 design where each controller had its own separate hierarchy. In v2 there is one tree, and every process belongs to exactly one cgroup in it. Controllers — cpu, memory, io, pids — are enabled per subtree by writing to a parent's cgroup.subtree_control file, subject to the "no internal processes" rule: a cgroup that has child cgroups with controllers enabled may not itself contain processes.

# Cap a group at 50% of one CPU and 256 MiB of memory (cgroup v2)
sudo mkdir /sys/fs/cgroup/demo
echo "50000 100000" | sudo tee /sys/fs/cgroup/demo/cpu.max
echo "268435456" | sudo tee /sys/fs/cgroup/demo/memory.max
# Move the current shell into it
echo $$ | sudo tee /sys/fs/cgroup/demo/cgroup.procs

The first cpu.max number is the quota in microseconds, the second the period: 50000 100000 means 50 ms of CPU time every 100 ms, or half a core. When a process exceeds memory.max and cannot reclaim, the kernel's cgroup-aware OOM killer terminates a task in that cgroup rather than picking a victim system-wide — the limit is enforced locally. Read pressure and accounting back from memory.current, cpu.stat, and memory.pressure (PSI).

systemd as the cgroup manager

On Debian and Ubuntu you almost never write to /sys/fs/cgroup by hand in production, because systemd owns the cgroup tree as the single writer. Every service, scope, and slice systemd starts gets its own cgroup, and you set limits through unit directives instead of raw files. This avoids two managers fighting over the same hierarchy, which is the cause of most "my limit keeps disappearing" incidents.

# Limit a service via a drop-in (preferred over editing the unit)
sudo systemctl edit nginx.service
# In the editor, under [Service]:
CPUQuota=50%
MemoryMax=256M
TasksMax=512

# Run an ad-hoc command inside a transient, capped cgroup
sudo systemd-run --scope -p MemoryMax=512M -p CPUQuota=80% stress-ng --vm 2

Inspect live accounting with systemd-cgls to see the tree and systemd-cgtop for a per-cgroup top-style view of CPU, memory, and I/O. The same CPUQuota= and MemoryMax= directives are honored on Red Hat and Fedora because they run the identical systemd; the only divergence worth noting is that older RHEL 7 used cgroup v1 and the legacy CPUShares= tunable.

Namespaces vs cgroups

Namespaces — control what a process can see: its own PIDs, mounts, network stack, hostname, and user IDs. They partition visibility, not capacity. A process in a fresh net namespace simply cannot observe the host's interfaces; nothing about how much CPU or memory it uses changes.

cgroups — control what a process can use: a ceiling on CPU time, memory, I/O bandwidth, and task count, plus the accounting to measure it. They constrain consumption, not visibility. A process can be in one cgroup with no namespaces at all and still be capped at half a core. A container is both halves applied to the same process tree at once.

Common Mistakes

Creating a PID namespace with unshare --pid but omitting --fork --mount-proc, so the shell still sees host processes and the first command you run becomes PID 1 with no /proc remount — signals and zombie reaping then behave unexpectedly.
Editing files under /sys/fs/cgroup directly on a systemd host. systemd rewrites the tree on the next daemon-reload or unit restart and your manual limit silently vanishes; use unit directives or systemd-run instead.
Assuming cgroup v1 behavior on a v2 system. Tunables like cpu.cfs_quota_us and the per-controller mount points do not exist in v2; the equivalents are cpu.max and a single unified tree.
Running a container with no memory.max at all, so one runaway process consumes host RAM and triggers the system-wide OOM killer — which can kill an unrelated, critical service instead of the offender.
Believing a container is a security boundary equal to a VM. Namespaces hide resources but do not virtualize the kernel; a kernel-level vulnerability or an over-broad capability set turns a container escape into host compromise.
Forgetting that a new network namespace starts with only a loopback that is administratively down. Without a veth pair, bridge, or ip link set lo up, the namespace has no usable connectivity at all.
Running rootless containers without configuring /etc/subuid and /etc/subgid. The user namespace has no UID range to map into, so image extraction and chown inside the container fail.

Best Practices

Set resource limits through systemd unit directives — CPUQuota=, MemoryMax=, TasksMax=, IOWeight= — so a single manager owns the cgroup tree and the limits survive restarts.
Always set a memory.max (or MemoryMax=) on every container so a runaway workload is killed inside its own cgroup instead of dragging the host into a system-wide OOM event.
Prefer MemoryHigh= for throttling and reserve MemoryMax= as the hard ceiling, giving the workload back-pressure before the OOM killer engages.
Run containers rootless with a user namespace: populate /etc/subuid and /etc/subgid so root inside the container maps to an unprivileged host UID, and drop every capability you do not need.
Always pass --fork --mount-proc when creating a PID namespace with unshare so /proc reflects the new namespace and PID 1 semantics work.
Verify isolation by comparing inode numbers from /proc/<pid>/ns/ rather than trusting tool labels, and debug live with nsenter --target <pid> instead of installing tooling into the image.
Monitor saturation with PSI files (cpu.pressure, memory.pressure, io.pressure) and systemd-cgtop rather than host-wide averages, which hide per-cgroup contention.

Comparable toolsWindows — containers backed by job objects and silos for resource limits and isolationBSD — FreeBSD jails, a single combined visibility-plus-resource isolation primitiveKubernetes — the orchestration layer that schedules and composes these primitives across hosts

Knowledge Check

A teammate edits /sys/fs/cgroup/myapp/memory.max directly on an Ubuntu 24.04 host, but after the next service restart the limit is gone. Why?

systemd owns and rewrites the cgroup tree, so manual changes are overwritten; limits should be set with unit directives or systemd-run.
cgroup v2 does not support a memory.max file at all, so the write was silently discarded by the kernel without any error.
Memory limits are deliberately stored in kernel RAM only and never persist across any process restart on this host by design.
The host is still running legacy cgroup v1, where memory limits live under an entirely separate controller hierarchy that gets reset on each restart.

You want to harden a container against a kernel-level escape. Which statement is correct about namespaces?

Namespaces hide and partition global resources but share the host kernel, so a kernel vulnerability can cross the boundary that a VM would hold.
Each namespace runs its own private copy of the kernel, so any kernel bug stays fully contained within the single namespace that happened to trigger it.
The user namespace alone already provides full kernel-level isolation that is equivalent to the boundary a hypervisor enforces.
It is cgroups, not namespaces, that actually isolate and protect the host kernel from the container's processes.

A new net namespace has been created but a process inside it cannot reach 127.0.0.1. What is the most likely cause?

A fresh network namespace contains only a loopback interface that is administratively down; it needs ip link set lo up.
Loopback traffic is dropped by default inside the namespace until an explicit nftables accept rule for it is added there.
The namespace inherits the full host routing table, which wrongly sends loopback traffic out the default gateway instead of lo.
cgroup v2 keeps loopback disabled until the dedicated net controller is explicitly enabled in that subtree.

For a latency-sensitive service that occasionally spikes memory, which cgroup v2 configuration best avoids hard kills while still capping usage?

Set MemoryHigh below MemoryMax so the kernel throttles and reclaims under pressure before the hard ceiling triggers the OOM killer.
Set MemoryMax exactly equal to the expected spike size with no MemoryHigh at all, relying solely on the kernel OOM killer for protection.
Disable the memory controller entirely for that subtree so the occasional spikes are never accounted against any limit.
Use CPUQuota on the service instead, since throttling its CPU time indirectly prevents the memory spikes.

You got correct