Topic 33

The Writable Layer and Copy-on-Write

LayersCopy-on-write

Every running container gets exactly one thin read-write layer stacked on top of its image's read-only layers (topic 04, topic 07), and every byte the process writes lands there. That layer is bound to the container's lifecycle: remove the container and the writable layer goes with it. This is precisely why Driftwood's db container loses every bookmark when it's replaced — Postgres wrote into the disposable layer, and docker rm took the data with the container.

Copy-on-write is the mechanism underneath, and it explains two things people are routinely surprised by: why launching another container off the same image is nearly free, and why the first write to a large file inside a container is slow. Reads fall through to the layer that owns the file; writes copy the whole file up first. Once you hold those two rules, the storage behavior of a container stops being mysterious.

The Writable Layer Is the Container

docker run takes an image's read-only layers and adds a single read-write layer on top. The process inside sees one merged filesystem tree — it cannot tell which file came from which layer — but everything it creates, modifies, or deletes is recorded only in that top layer. That writable layer exists solely while the container does. It is created when the container starts and destroyed when the container is removed, which makes it the most literal expression of the disposability model: the container is its writable layer plus a pointer at some shared read-only image.

The layer stack, base to top

Base image layer

read-only — shared, never modified

More image layers

read-only — stacked on the base

Writable layer

read-write — thin, discarded on docker rm

Because the read-only layers are shared, ten containers off the same postgres:16 image do not each carry a copy of Postgres — they share the image's layers and each get their own thin writable layer for whatever diverges. That sharing is why a second container starts in milliseconds and costs almost nothing on disk. The cost only appears when a container writes.

Copy-on-Write, File by File

Reading an unmodified file is direct: the storage driver — overlay2 on a current Linux Docker — resolves the path down through the layers and hands back the bytes from whichever read-only layer owns the file, with no copying at all. The first write to such a file is where copy-on-write earns its name. Before the change can be applied, overlay2 copies the entire file up from the lower read-only layer into the container's writable layer, then modifies the copy. The original in the lower layer is never touched, which is what keeps it shareable across every other container using that image.

Once a file has been copied up, subsequent writes to it are ordinary writes against the copy in the writable layer — the copy-up tax is paid only on the first modification. Files the container creates from scratch live in the writable layer from birth and pay no copy-up at all; the cost is specific to the first write of a file that originated in a read-only layer.

Why Big Files Are Slow to First-Write

Copy-up duplicates the whole file regardless of how few bytes you actually change. Append one line to a 2 GB file that lives in a lower layer, and overlay2 copies all 2 GB into the writable layer before writing your one line. For a config tweak measured in kilobytes this is invisible. For a database file or a large media asset modified in place, it is ruinous — the latency is the size of the file, not the size of the edit, and it lands on the very first write after the container starts.

This is the mechanical reason databases do not belong in the container filesystem. A database rewrites pages in place across files that grow into gigabytes; making every one of those a copy-up event turns ordinary writes into full-file copies. The fix is not a faster storage driver — it is to keep the data out of the layered filesystem entirely, on a volume, which the rest of this chapter is about.

Deletes Are Whiteouts, Not Reclaims

Deleting a file that lives in a lower read-only layer does not free its bytes. overlay2 cannot edit a read-only layer, so instead it writes a small "whiteout" marker in the writable layer that hides the file from the merged view. The file still occupies space in the layer that owns it; it has only become invisible. This is why a container's apparent disk usage and the real storage its layers consume can diverge sharply, and it is the same fact that drives the image-size lessons in chapter 5: deleting a large file in a later layer never shrinks the earlier layer that added it.

Why Data Dies With the Container

Put the pieces together and the writable layer is scratch space by definition. It is created with the container, it holds everything the process writes, and it is destroyed on docker rm — or on a compose down that recreates the container, or on any redeploy, since a redeploy is a new container with a fresh, empty writable layer. Anything you want to survive that has to leave the layer.

That is the entire motivation for the rest of the chapter. Durable data — Driftwood's Postgres data directory, user uploads, anything you would mourn losing — must live in a volume or a bind mount that is independent of the container's lifecycle. The writable layer is for truly throwaway state and nothing else. The catalog of where durable data goes instead, and how to choose among the options, starts on the next page.

Common Mistakes

Running postgres:16 with no volume, so its data directory writes into the container's writable layer — a docker rm, or a compose down that recreates the container, deletes every row, which is the exact Driftwood failure this chapter opens on.
Writing application logs, user uploads, or a SQLite file into the container filesystem and being surprised they vanish on the next deploy — a redeploy is a new container with a fresh, empty writable layer, not the old one with its data intact.
Editing a file inside a running container to "fix" something — the change lives only in that container's writable layer and disappears on removal; the fix belongs in the Dockerfile and a rebuilt image (topic 04).
Doing heavy in-place writes to large files in the container filesystem and blaming Docker for the latency — copy-on-write copies each large file up on its first modification, so the cost is the full file size, not the size of the edit.

Best Practices

Treat the writable layer as ephemeral scratch and put every byte you can't afford to lose in a volume — assume docker rm can happen at any moment, because in a redeploy it does.
Keep databases, uploads, and anything mutated in place out of the container filesystem, so copy-on-write never has to copy large files up on write.
Reason about disk and integrity by remembering reads fall through and writes copy up — it explains both why launching another container is nearly free and why in-place writes cost the full file.
Verify a container's storage layout with docker inspect before trusting it in production, so you know exactly what sits on a volume and what is in the disposable layer.

Comparable tools Podman · nerdctl the same writable-layer/copy-on-write model over overlay2 or fuse-overlayfs aufs · devicemapper · btrfs snapshots the union/copy-on-write idea that predates Docker qcow2 a VM's copy-on-write disk image — the closest analog in the VM world

Knowledge Check

Where do a running container's file writes actually go, and why don't they survive removal?

Into a single thin read-write layer on top of the image's read-only layers, which is destroyed when the container is removed
Directly into the image's read-only layers, which is why a rebuilt image from the same base carries the data forward into every new container without a volume
Into a Docker-managed volume by default, so the data persists across removal unless you explicitly opt out
Nowhere — the host kernel blocks writes to a running container's filesystem entirely

What does copy-on-write do the first time a container writes to a file that lives in a read-only layer?

It copies the entire file up into the writable layer first, then modifies the copy, leaving the lower layer untouched
It writes only the changed bytes into the writable layer as a small block-level diff and leaves the rest of the file untouched in the lower read-only layer, reassembling the two on every read
It edits the file in place in the read-only layer it came from, sharing that change with other containers
It refuses the write outright because the target file belongs to an immutable read-only layer

Why does appending one line to a 2 GB file inside a container cost the full 2 GB?

Copy-on-write copies the whole file up to the writable layer on first modification, so the cost is the file size, not the edit size
The writable layer transparently compresses every file on write to save space, and a 2 GB file has to be re-compressed in full on each append, which is what makes the operation expensive every single time
Docker rebuilds the entire writable layer from scratch on any change to a single file inside it
Each appended byte triggers its own separate copy-up of the whole 2 GB file in turn

What happens when a container deletes a file that lives in a lower read-only layer?

overlay2 writes a whiteout marker in the writable layer that hides the file, without freeing its bytes in the lower layer
The bytes are immediately reclaimed from the lower read-only layer the instant the delete runs, shrinking the image's on-disk storage and trimming the layer that originally added the file
The file is removed directly from the read-only layer it originally came from, freeing the space
The delete silently fails because read-only layers reject all file removals outright

You got correct