Topic 62

seccomp and AppArmor/SELinux

seccompLSM

Capabilities limit what powers a process holds. These controls limit two other things. seccomp limits which syscalls it can make, and the Linux Security Modules — AppArmor on Debian and Ubuntu, SELinux on RHEL and Fedora — limit what files and resources it can touch. They are the layers that contain a process which somehow holds root and capabilities anyway, by blocking the calls and file accesses it would need to do damage.

Docker ships a default seccomp profile that blocks roughly 44 dangerous syscalls out of the 300-plus available, and on most hosts an LSM profile is already confining your containers. Both are on unless you opt out. The skill in this topic is mostly knowing they exist and not turning them off — because the standard wrong move is to disable the whole layer the first time an app trips over it.

Three kernel controls, on by default

seccomp

Filters which syscalls a process may make. The default profile blocks ~44 dangerous calls out of 300-plus, permitting the ones a normal app uses.

AppArmor

Path-based mandatory access control, default on Debian and Ubuntu via the docker-default profile.

SELinux

Label-based mandatory access control, default on RHEL and Fedora. Confines which files and resources a container can reach, even as root.

seccomp Filters Syscalls

A seccomp profile is an allowlist-or-denylist of system calls — the narrow interface through which every process asks the kernel to do anything. Docker's default profile blocks ~44 calls that a normal container never makes but an escape would: mount, reboot, kexec_load, raw ptrace of other processes, and similar low-level operations. It permits the hundreds of calls an ordinary application actually uses — read, write, open, socket — so the filter is invisible to normal workloads and a wall to abnormal ones.

The Default Profile Is On

Unless you opt out, every container runs under Docker's default seccomp profile. You rarely need to write your own. When you do — for a service you want locked down further, like driftwood/web — you supply a stricter profile with --security-opt seccomp=profile.json. The important thing is that the protection is already there by default; the failure mode is removing it, not forgetting to add it.

Apply a custom seccomp profile (built from the default, not from empty)

# lock driftwood/web down with a stricter profile
docker run -d --name web \
  --user app \
  --cap-drop=ALL \
  --security-opt seccomp=driftwood-web.json \
  -p 8000:8000 \
  driftwood/web

# the wrong move: never do this to get past one blocked syscall
# docker run --security-opt seccomp=unconfined ...

The custom profile driftwood-web.json starts as a copy of Docker's default and removes the few extra syscalls the app demonstrably never makes. The commented-out unconfined line is what you must not reach for: it switches the entire filter off to permit a single call.

AppArmor and SELinux Confine Resources

The LSMs enforce mandatory access control — rules the kernel applies regardless of file permissions or process identity. AppArmor is path-based and default on Debian and Ubuntu through the docker-default profile; SELinux is label-based and default on RHEL and Fedora. Both restrict which files, devices, and capabilities a container can reach even as root, so a confined process cannot read /etc/shadow on the host even if it breaks out of its mount namespace. Where seccomp gates the syscall, the LSM gates the resource the syscall would touch.

Applying Profiles

Both LSMs attach through --security-opt: apparmor=<profile> for AppArmor, label=<option> for SELinux. In practice the defaults are right, and the work is not writing profiles — it is knowing they are there and leaving them on. A custom AppArmor or SELinux profile is occasionally worth it for a tightly scoped service, but the common case is the default profile doing its job silently.

Don't Disable to "Fix" a Bug

The standard wrong move is reaching for --security-opt seccomp=unconfined — or apparmor=unconfined, or label=disable for SELinux — the moment an app hits a blocked syscall or a denied file access. That removes the entire layer to fix one call. It works, the error goes away, and the container now runs with no syscall filter or no mandatory access control at all, which nobody remembers six months later.

The right move is surgical. For a blocked syscall, copy Docker's default seccomp profile and add only the one call the app legitimately needs. For a denied file path, adjust the AppArmor or SELinux policy to allow that path. You keep the layer and open exactly the one hole the workload requires, rather than tearing the wall down because one brick was in the way.

Common Mistakes

Running --security-opt seccomp=unconfined to get past a blocked syscall — it disables the entire ~44-syscall filter to permit one call; write a profile that allows only that syscall instead.
Disabling AppArmor or setting --security-opt label=disable for SELinux to silence a permission error — that removes mandatory access control wholesale and re-opens the file-access paths it was confining.
Assuming --privileged still leaves seccomp on — it disables the seccomp profile entirely, which is one more reason that flag is a footgun (topic 61).
Writing a custom seccomp profile from scratch and accidentally blocking syscalls the runtime needs, so the container won't even start — begin from Docker's default and subtract, never build from empty.

Best Practices

Leave the default seccomp and LSM profiles on; they cost nothing and block the syscalls and file accesses a normal workload never makes.
When an app needs a blocked syscall, copy the default seccomp profile and add only that one syscall, rather than going unconfined.
Keep AppArmor (Debian/Ubuntu) or SELinux (RHEL/Fedora) enabled on the host so containers inherit mandatory access control by default.
Verify the host's LSM is actually active with aa-status or getenforce on production hosts, since a silently disabled module removes a layer you assumed was there.

Comparable tools Podman · containerd honor the same seccomp, AppArmor, and SELinux kernel features Falco · Tracee watch syscalls at runtime to detect what these profiles prevent gVisor intercepts syscalls in userspace as a stronger alternative to seccomp filtering

Knowledge Check

What is the division of labor between seccomp and an LSM like AppArmor or SELinux?

seccomp filters which syscalls the process can make; the LSM confines which files and resources it can touch
seccomp limits which Linux capabilities a process holds while the LSM caps its CPU shares and memory usage
seccomp scans the image layers for known vulnerabilities while the LSM scans the running process for malware
They are simply two names for the same underlying syscall filter, one shipped on Debian and one on RHEL

What is true of Docker's default seccomp profile?

It is on by default and blocks ~44 dangerous syscalls while permitting the hundreds a normal app uses
It is off by default and must be enabled explicitly per container with a flag before it filters anything
It blocks nearly every syscall by default, which is why most ordinary containers need a hand-written custom profile to run
It only applies to containers explicitly started with the --privileged flag, and to nothing else

An app hits a syscall blocked by the default seccomp profile. What is the right response?

Copy the default profile and add back only the one extra syscall the app legitimately needs
Run with seccomp=unconfined to drop the whole filter and let every syscall through unchecked
Add --privileged to the run command so the syscall restriction no longer applies to the container
Write a brand-new profile from an empty allowlist that permits only that single syscall

How does an LSM protect the host even if a container process has root?

It enforces mandatory access control regardless of identity, so a confined process can't read host files like /etc/shadow
It strips the process of root entirely at exec time, turning every container process into an unprivileged ordinary non-root user
It transparently encrypts every file on the host at rest so a container process can only ever read back ciphertext
It boots a separate private kernel for the container so host files sit on a different kernel and stay unreachable

You got correct