Topic 63

Read-Only Filesystems and no-new-privileges

ImmutabilityRead-only

A default container has a writable layer (Chapter 6), so a process that lands inside can drop a binary, rewrite a config, or persist a backdoor that survives a restart. Running with --read-only makes the root filesystem immutable and forces you to declare exactly which directories actually need to be writable, mounting those as tmpfs. Paired with --security-opt no-new-privileges, which blocks setuid escalation, driftwood/web becomes a process that can neither modify itself nor gain privileges it didn't start with.

These two flags close the last common in-container paths. The read-only rootfs stops persistence and tampering; no-new-privileges stops a non-root process from climbing back up through a setuid binary. Together with the non-root user, the dropped capabilities, and the seccomp and LSM profiles from the earlier topics, a compromise inside driftwood/web has nowhere to write, nothing to escalate, and very little it can do.

Three flags that close the in-container paths

--read-only

Immutable rootfs. The image and OS files can't be altered at run time, so a compromise can't drop a binary, rewrite a config, or persist a backdoor.

--tmpfs /tmp

Writable scratch in memory. Declares exactly where the app may write — an ephemeral filesystem that vanishes on stop — while the rootfs stays immutable.

no-new-privileges

Blocks setuid escalation. Sets the kernel flag that stops a non-root process from climbing up through a setuid-root binary.

The Writable Layer Is the Problem

Every container gets a thin read-write layer stacked over the read-only image layers (Chapter 6). That layer is convenient — it is where an app writes temp files and logs — and it is also where an attacker writes. Malware persists by dropping a file there; a tampered binary survives because the write sticks for the life of the container. Immutability removes that surface: if the layer can't be written, none of it can happen.

`--read-only` Forces Immutability

The --read-only flag mounts the entire root filesystem read-only, so the application code and the OS files inside the image cannot be altered at run time. The container can read its image but not rewrite it. A useful side effect is that the flag surfaces every place the app secretly expected to write — the moment it tries, you get an error, and now you know about a write path you didn't know existed.

tmpfs for Writable Scratch

Apps still need some writable paths — /tmp, a cache directory, a pid file. The answer is not to drop --read-only but to declare those paths explicitly with --tmpfs (or --mount type=tmpfs), which gives each one an in-memory, ephemeral filesystem that vanishes when the container stops. Writes are allowed exactly where you declared them and nowhere durable, so the app works while the rootfs stays immutable.

driftwood/web read-only, with tmpfs scratch and escalation blocked

docker run -d --name web \
  --user app \
  --cap-drop=ALL \
  --read-only \
  --tmpfs /tmp \
  --security-opt no-new-privileges \
  -p 8000:8000 \
  driftwood/web

This is the accumulated hardening so far on one line: the non-root app user from topic 60, the dropped capabilities from topic 61, the read-only rootfs with a tmpfs for /tmp, and no-new-privileges. A process that lands inside has no root, no capabilities, no writable rootfs, and no way to escalate.

no-new-privileges Stops setuid Escalation

--security-opt no-new-privileges sets the kernel flag that prevents a process from gaining privileges through setuid or setgid binaries. Even if a setuid-root binary exists somewhere in the image — sudo, ping, or one that slipped in with a base image — it cannot be used to escalate. That closes a classic in-container privilege-escalation path: a compromised non-root process finds a setuid-root binary and rides it up to root. With the flag set, the ride goes nowhere.

Driftwood, Hardened

Stack everything this chapter has added and driftwood/web runs --read-only with a tmpfs for /tmp, --security-opt no-new-privileges, as the non-root app user, with --cap-drop=ALL. A process that compromises the app has no root, no capabilities, no writable root filesystem, and no path to escalate. It is not invulnerable — it can still abuse the network or try to read data the app legitimately reads — but the easy wins are gone.

A read-only rootfs is one layer, not the whole answer. It stops persistence and tampering; it does not stop a process from exfiltrating data over a connection the app is allowed to make, and it does not protect a writable volume mounted into the container. Treat it as one more independent control in the defense-in-depth stack, valuable precisely because it does not depend on the others holding.

Common Mistakes

Adding --read-only without mounting a tmpfs for the paths the app writes — the container starts, then crashes the first time it writes a temp file or a pid; declare the writable directories explicitly.
Assuming --read-only protects mounted volumes too — named volumes and bind mounts stay writable unless you mount them :ro themselves; the flag covers the container's own layers, not your data mounts.
Skipping no-new-privileges and leaving setuid binaries in the image — a compromised non-root process can still escalate through a setuid-root binary that slipped into the base image.
Treating a read-only rootfs as the whole answer — it stops persistence and tampering, but a process can still exfiltrate data or abuse the network; it is one layer in the stack, not the stack itself.

Best Practices

Run hardened services with --read-only and mount only the specific scratch paths the app needs as tmpfs, so writes are ephemeral and contained.
Add --security-opt no-new-privileges to every container that doesn't legitimately rely on setuid, closing the setuid escalation path for free.
Mount data volumes :ro wherever the container only needs to read them, rather than leaving every mount writable by default.
Use the read-only run as a design test — every path the app tried to write reveals undeclared state that probably belongs in a volume or a tmpfs anyway.

Comparable tools Kubernetes maps these to readOnlyRootFilesystem and allowPrivilegeEscalation: false in a security context Podman honors the same --read-only, --tmpfs, and no-new-privileges flags distroless · minimal base images pair naturally with an ephemeral-root pattern (Chapter 1, topic 12)

Knowledge Check

What problem does --read-only solve that the other controls don't?

It makes the writable layer immutable, so a compromised process can't drop a binary, rewrite a config, or persist
It drops all of the container's remaining Linux capabilities at once with a single convenient runtime flag
It runs the main process as a dedicated non-root user without needing any USER instruction in the Dockerfile
It encrypts the container's image layers at rest so they can't be read straight off the host's disk by an attacker

Why mount a tmpfs alongside --read-only instead of just dropping the flag?

Apps still need a few writable paths; a tmpfs gives those an ephemeral in-memory filesystem while the rootfs stays locked
A tmpfs is exactly where you store the app's durable data so that it reliably survives container restarts and full host reboots
A tmpfs is required because read-only filesystems otherwise make every disk read noticeably slower under load
A tmpfs re-enables writing across the entire root filesystem, quietly undoing the read-only flag everywhere

What does --security-opt no-new-privileges block?

A process gaining privileges through a setuid or setgid binary, closing the non-root escalation path
The container from opening any outbound network connections to other hosts or services on the network
The container from acquiring any new Linux capabilities at build time, beyond the curated default set
All writes to the entire root filesystem, which makes the separate --read-only flag completely redundant

Does --read-only protect a named volume mounted into the container?

No — the flag covers the container's own layers; volumes and bind mounts stay writable unless mounted :ro
Yes — it makes every single mounted path read-only too, including all named data volumes and bind mounts alike
No — because --read-only blocks the runtime from mounting any volume into the container at all
Yes — but only for in-memory tmpfs mounts, never for persistent named volumes or bind mounts

You got correct