Chapter 10: Security
Topic 63

Read-Only Filesystems and no-new-privileges

ImmutabilityRead-only

A default container has a writable layer (Chapter 6), so a process that lands inside can drop a binary, rewrite a config, or persist a backdoor that survives a restart. Running with --read-only makes the root filesystem immutable and forces you to declare exactly which directories actually need to be writable, mounting those as tmpfs. Paired with --security-opt no-new-privileges, which blocks setuid escalation, driftwood/web becomes a process that can neither modify itself nor gain privileges it didn't start with.

These two flags close the last common in-container paths. The read-only rootfs stops persistence and tampering; no-new-privileges stops a non-root process from climbing back up through a setuid binary. Together with the non-root user, the dropped capabilities, and the seccomp and LSM profiles from the earlier topics, a compromise inside driftwood/web has nowhere to write, nothing to escalate, and very little it can do.

Three flags that close the in-container paths
--read-only
Immutable rootfs. The image and OS files can't be altered at run time, so a compromise can't drop a binary, rewrite a config, or persist a backdoor.
--tmpfs /tmp
Writable scratch in memory. Declares exactly where the app may write — an ephemeral filesystem that vanishes on stop — while the rootfs stays immutable.
no-new-privileges
Blocks setuid escalation. Sets the kernel flag that stops a non-root process from climbing up through a setuid-root binary.

The Writable Layer Is the Problem

Every container gets a thin read-write layer stacked over the read-only image layers (Chapter 6). That layer is convenient — it is where an app writes temp files and logs — and it is also where an attacker writes. Malware persists by dropping a file there; a tampered binary survives because the write sticks for the life of the container. Immutability removes that surface: if the layer can't be written, none of it can happen.

--read-only Forces Immutability

The --read-only flag mounts the entire root filesystem read-only, so the application code and the OS files inside the image cannot be altered at run time. The container can read its image but not rewrite it. A useful side effect is that the flag surfaces every place the app secretly expected to write — the moment it tries, you get an error, and now you know about a write path you didn't know existed.

tmpfs for Writable Scratch

Apps still need some writable paths — /tmp, a cache directory, a pid file. The answer is not to drop --read-only but to declare those paths explicitly with --tmpfs (or --mount type=tmpfs), which gives each one an in-memory, ephemeral filesystem that vanishes when the container stops. Writes are allowed exactly where you declared them and nowhere durable, so the app works while the rootfs stays immutable.

driftwood/web read-only, with tmpfs scratch and escalation blocked
docker run -d --name web \
  --user app \
  --cap-drop=ALL \
  --read-only \
  --tmpfs /tmp \
  --security-opt no-new-privileges \
  -p 8000:8000 \
  driftwood/web

This is the accumulated hardening so far on one line: the non-root app user from topic 60, the dropped capabilities from topic 61, the read-only rootfs with a tmpfs for /tmp, and no-new-privileges. A process that lands inside has no root, no capabilities, no writable rootfs, and no way to escalate.

no-new-privileges Stops setuid Escalation

--security-opt no-new-privileges sets the kernel flag that prevents a process from gaining privileges through setuid or setgid binaries. Even if a setuid-root binary exists somewhere in the image — sudo, ping, or one that slipped in with a base image — it cannot be used to escalate. That closes a classic in-container privilege-escalation path: a compromised non-root process finds a setuid-root binary and rides it up to root. With the flag set, the ride goes nowhere.

Driftwood, Hardened

Stack everything this chapter has added and driftwood/web runs --read-only with a tmpfs for /tmp, --security-opt no-new-privileges, as the non-root app user, with --cap-drop=ALL. A process that compromises the app has no root, no capabilities, no writable root filesystem, and no path to escalate. It is not invulnerable — it can still abuse the network or try to read data the app legitimately reads — but the easy wins are gone.

A read-only rootfs is one layer, not the whole answer. It stops persistence and tampering; it does not stop a process from exfiltrating data over a connection the app is allowed to make, and it does not protect a writable volume mounted into the container. Treat it as one more independent control in the defense-in-depth stack, valuable precisely because it does not depend on the others holding.

Common Mistakes
  • Adding --read-only without mounting a tmpfs for the paths the app writes — the container starts, then crashes the first time it writes a temp file or a pid; declare the writable directories explicitly.
  • Assuming --read-only protects mounted volumes too — named volumes and bind mounts stay writable unless you mount them :ro themselves; the flag covers the container's own layers, not your data mounts.
  • Skipping no-new-privileges and leaving setuid binaries in the image — a compromised non-root process can still escalate through a setuid-root binary that slipped into the base image.
  • Treating a read-only rootfs as the whole answer — it stops persistence and tampering, but a process can still exfiltrate data or abuse the network; it is one layer in the stack, not the stack itself.
Best Practices
  • Run hardened services with --read-only and mount only the specific scratch paths the app needs as tmpfs, so writes are ephemeral and contained.
  • Add --security-opt no-new-privileges to every container that doesn't legitimately rely on setuid, closing the setuid escalation path for free.
  • Mount data volumes :ro wherever the container only needs to read them, rather than leaving every mount writable by default.
  • Use the read-only run as a design test — every path the app tried to write reveals undeclared state that probably belongs in a volume or a tmpfs anyway.
Comparable tools Kubernetes maps these to readOnlyRootFilesystem and allowPrivilegeEscalation: false in a security context Podman honors the same --read-only, --tmpfs, and no-new-privileges flags distroless · minimal base images pair naturally with an ephemeral-root pattern (Chapter 1, topic 12)

Knowledge Check

What problem does --read-only solve that the other controls don't?

  • It makes the writable layer immutable, so a compromised process can't drop a binary, rewrite a config, or persist
  • It drops all of the container's remaining Linux capabilities at once with a single convenient runtime flag
  • It runs the main process as a dedicated non-root user without needing any USER instruction in the Dockerfile
  • It encrypts the container's image layers at rest so they can't be read straight off the host's disk by an attacker

Why mount a tmpfs alongside --read-only instead of just dropping the flag?

  • Apps still need a few writable paths; a tmpfs gives those an ephemeral in-memory filesystem while the rootfs stays locked
  • A tmpfs is exactly where you store the app's durable data so that it reliably survives container restarts and full host reboots
  • A tmpfs is required because read-only filesystems otherwise make every disk read noticeably slower under load
  • A tmpfs re-enables writing across the entire root filesystem, quietly undoing the read-only flag everywhere

What does --security-opt no-new-privileges block?

  • A process gaining privileges through a setuid or setgid binary, closing the non-root escalation path
  • The container from opening any outbound network connections to other hosts or services on the network
  • The container from acquiring any new Linux capabilities at build time, beyond the curated default set
  • All writes to the entire root filesystem, which makes the separate --read-only flag completely redundant

Does --read-only protect a named volume mounted into the container?

  • No — the flag covers the container's own layers; volumes and bind mounts stay writable unless mounted :ro
  • Yes — it makes every single mounted path read-only too, including all named data volumes and bind mounts alike
  • No — because --read-only blocks the runtime from mounting any volume into the container at all
  • Yes — but only for in-memory tmpfs mounts, never for persistent named volumes or bind mounts

You got correct