Topic 61

Linux Capabilities

CapabilitiesKernel

Root is not all-or-nothing. The kernel splits root's powers into ~40 discrete capabilities — bind a low port, change file ownership, load a kernel module, override file permissions, trace another process — and a process holds a subset rather than the whole. Docker already drops most of them for you, so "root in a container" is a reduced root, not full host root. Container UID 0 cannot, for example, load a kernel module, because CAP_SYS_MODULE was never granted.

The hardening move is to finish the job the default started: drop everything with --cap-drop=ALL and add back only the one or two capabilities the workload actually uses. For driftwood/web running gunicorn on port 8000 as app, that add-back list is empty — the app needs no special kernel power at all, so it runs with zero capabilities.

Capabilities Split Up Root

Instead of a single privileged-or-not flag, the kernel grants powers individually. CAP_NET_BIND_SERVICE lets a process bind ports below 1024. CAP_CHOWN lets it change file ownership. CAP_SYS_ADMIN is a grab-bag of mount and namespace operations. CAP_SYS_MODULE loads kernel modules. There are roughly 40 of them, and a process carries a set. Dropping a capability removes that specific power even from UID 0 — root without CAP_CHOWN genuinely cannot chown a file.

That granularity is the whole point. The question stops being "is this process root" and becomes "which of root's ~40 powers does this process actually hold," which is a question you can answer per workload and trim to almost nothing.

Three capability postures, from default to extremes

Default set (~14 caps)

Docker grants a curated subset — CHOWN, NET_BIND_SERVICE, SETUID — and drops the dangerous ones. A reduced root, but more than most apps need.

--cap-drop=ALL + --cap-add=NET_BIND_SERVICE

Least privilege. Start from zero and add back only the one power the workload provably needs — often none.

--privileged

Avoid. Grants ALL capabilities, disables seccomp, and exposes host devices — an effectively unconfined root on the host.

Docker's Default Set Is Already Reduced

A default container does not get the full capability list. Docker grants a curated subset — a handful of common ones like CAP_CHOWN, CAP_NET_BIND_SERVICE, and CAP_SETUID — and drops the dangerous ones such as CAP_SYS_ADMIN, CAP_SYS_MODULE, and CAP_SYS_PTRACE outright. This is why container root cannot load a kernel module or arbitrarily mount filesystems: the first layer of least privilege is already on before you do anything.

It is a reduced root, but it is still more than most apps need. The default set is sized to make the average image work without surprises, not sized to your specific workload — which is the gap the next section closes.

Drop All, Add Back What You Need

The hardened pattern is --cap-drop=ALL followed by --cap-add= for the specific capabilities the workload uses. You start from zero and grant only what the app provably needs, instead of starting from Docker's default set and hoping none of it is exploitable. For driftwood/web that add-back list is empty, because gunicorn on port 8000 as the non-root app user needs no capability at all.

Run driftwood/web with no capabilities

# drop every capability; the app needs none of them
docker run -d --name web \
  --user app \
  --cap-drop=ALL \
  -p 8000:8000 \
  driftwood/web

# a service that must bind a low port adds back exactly one
docker run -d --name web \
  --cap-drop=ALL \
  --cap-add=NET_BIND_SERVICE \
  -p 80:80 \
  driftwood/web

The first form is what Driftwood uses: dropped to app, every capability gone, no add-back. The second shows the discipline when a capability is genuinely required — drop all, then add back the single named one, never the whole default set "just in case."

NET_BIND_SERVICE, the Common Add-Back

NET_BIND_SERVICE is the capability apps most often need back. It lets a non-root process bind ports below 1024, which is the precise, minimal alternative to running as root just to listen on 80. Where topic 60 solved the low-port problem by listening high behind a proxy, this is the other answer: keep the non-root user, drop every capability, and grant back exactly the one power that lets it hold the privileged port — and nothing else.

The --privileged Footgun

--privileged is the opposite of this entire topic. It grants every capability, disables the seccomp profile (topic 62), and exposes the host's devices — all in one flag. It turns the container into an effectively unconfined root on the host, undoing the reduced-root default and every layer this chapter adds. It exists for narrow cases like Docker-in-Docker or specific hardware access, and it is never the right answer to "this container needs a bit more access."

When a container can't do something, the correct response is to find the one capability or device it is missing and grant exactly that. Reaching for --privileged because diagnosing the missing capability is slower trades a one-line precise grant for an open door, and the door tends to stay open long after the original problem is forgotten.

Common Mistakes

Reaching for --privileged because a container "needs more access" — it grants all ~40 capabilities and disables seccomp at once; identify the one capability actually needed and --cap-add only that.
Running as root purely to bind port 80 instead of dropping to non-root and adding NET_BIND_SERVICE — you keep all of root's other powers for no reason beyond one port.
Leaving Docker's default capability set in place on a sensitive container — --cap-drop=ALL plus a tiny add-back list is a one-line reduction of the attack surface that costs nothing for most apps.
Adding CAP_SYS_ADMIN to make something work — it is so broad it's often called "the new root" and re-opens many of the escape paths the default set had already closed.

Best Practices

Start every hardened container with --cap-drop=ALL and add back only the specific capabilities the workload provably needs — for many apps, including driftwood/web, that add-back list is empty.
Use --cap-add=NET_BIND_SERVICE rather than root when a non-root process must bind a low port, so it gets exactly that power and no other.
Never use --privileged as a troubleshooting shortcut; diagnose the missing capability or device and grant exactly that one thing.
Treat any request for CAP_SYS_ADMIN as a design smell and look for a narrower capability or a different approach before granting it.

Comparable tools Podman · nerdctl honor the same --cap-drop / --cap-add flags, since capabilities are a kernel feature Kubernetes sets them under securityContext.capabilities capsh · getpcaps inspect a process's capability set on the host

Knowledge Check

What does it mean that "root in a container is already a reduced root"?

Docker grants only a curated subset of the ~40 capabilities and drops the dangerous ones, so container root can't load a module
The user namespace remaps it by default, so container root is already a harmless unprivileged non-root UID as seen from the host
Its CPU shares and memory are capped tightly by cgroups, and that resource ceiling is exactly what 'reduced root' refers to
It can only read files and never write them, because container root is mounted read-only by the runtime by default

What is the recommended pattern for capabilities on a hardened container?

Drop everything with --cap-drop=ALL, then add back only the specific capabilities the workload provably needs
Keep Docker's default capability set and add CAP_SYS_ADMIN on top so the container has plenty of room to grow
Use --privileged from the very start so that you never once have to reason about which individual capabilities the workload actually needs
Run with --cap-auto so the kernel profiles the workload and grants exactly the capabilities it calls for

When does a non-root container legitimately need NET_BIND_SERVICE added back?

When a non-root process must bind a privileged port below 1024, such as listening directly on port 80
Whenever the container opens any outbound network connection at all, even to an ordinary high port
For Driftwood listening on port 8000, which is exactly why the app adds this capability back by default
Whenever the container needs to resolve DNS names over UDP to reach other backing services

Why is --privileged a footgun rather than a convenience?

It grants every capability, disables seccomp, and exposes host devices at once — an effectively unconfined root on the host
It silently drops all capabilities and clears the bounding set, so a privileged container can't actually do anything useful at all
It only makes the container start a little more slowly, with no real security impact on the host at all
It forces the root filesystem read-only and blocks all device access, which breaks most ordinary applications

You got correct