Topic 07

Image Layers and the Union Filesystem

LayersFilesystem

An image is not a single blob. It is an ordered stack of read-only layers, each one a set of filesystem changes, merged into a single coherent view by a union filesystem — overlay2 on every modern Linux host. Each instruction in a Dockerfile that touches the filesystem adds a layer on top of the one below; at run time the union mount stacks them all and the container sees one root filesystem, with a thin writable layer added on top for anything it changes.

This structure is the reason images are cheap to store and fast to ship, and it is also the source of two persistent surprises: deleting a file rarely makes an image smaller, and ten images built on the same base do not cost ten times the base. Both fall out of how layers stack and how they are shared, and both shape every size lesson later in the book.

A Layer Is a Filesystem Diff

Each layer records only what changed relative to the layer beneath it — files added, files modified, files deleted — not a fresh copy of the whole tree. The bottom layer of an image might be the entire Debian userland; the layer above it might be nothing but the handful of files an apt install dropped in; the one above that, your application code. Stacked top to bottom, those diffs reconstruct the full filesystem, but each layer on disk is just its own delta.

That diff model is why a Dockerfile's instruction count and ordering matter so much. Every filesystem-changing instruction is one more layer in the stack, and the contents of each layer are fixed the moment it is built. You cannot reach back into a lower layer and edit it; you can only add a new layer on top that masks or replaces what's below.

The Union Mount

overlay2 presents the stacked layers as a single merged directory tree. The lower layers are mounted read-only; the container's own writable layer sits on top; and a read for any path falls through the stack from the top down until it hits the first layer that contains that file. The container has no idea it is looking at a composite — it sees one ordinary filesystem rooted at /.

When two layers both contain a file at the same path, the higher one wins — its version is what the container sees, and the lower copy is shadowed but still physically present on disk. That shadowing is the mechanism behind both file replacement and, as you'll see below, deletion.

How a union mount builds one filesystem from a stack of diffs

The read-only stack

Bottom: the Debian userland. Above it: the files apt install added. Above that: your app code. Each layer is only its own diff — files added, changed, or deleted.

The merged view

overlay2 stacks them into one root filesystem. A read falls through top to bottom to the first layer that has the file; the writable layer on top holds runtime changes.

Copy-on-Write

The lower layers are read-only, so what happens when a container modifies a file that lives in one of them? overlay2 copies that file up into the container's writable layer first, then applies the change there — copy-on-write. The original in the read-only layer is never touched, which is exactly what lets many containers share the same lower layers safely: each one's edits land in its own private writable layer.

The cost shows up with large files. Modify a single byte of a 500 MB file that lives in a lower layer, and overlay2 must copy the entire 500 MB up before the write completes — disk and latency you didn't expect. This is a full preview of the storage-driver chapter; for now the rule is that runtime modification of big files in lower layers is expensive, and durable data belongs in a volume, not the writable layer.

Layer Sharing

Layers are content-addressed — named by the hash of their contents — so identical layers are stored once and reused everywhere. Pull two images that both build on python:3.12-slim and the shared base layers are downloaded and stored a single time; only the layers that differ cost extra bytes. The same holds across running containers: a hundred containers from one image share its read-only layers and add only their own thin writable layers.

This is why the arithmetic of "ten images, each on a 120 MB base, equals 1.2 GB" is wrong. The base is stored once at 120 MB, and the ten images add only their own distinct upper layers — often a few megabytes each. Building your images on a small, common base turns that sharing into a real disk and bandwidth saving.

Deletions Don't Shrink Lower Layers

Deleting a file in a later layer does not remove its bytes. overlay2 records the deletion as a whiteout marker — a small entry that says "this path is gone" — and the merged view hides the file, but the original still sits in the layer that added it, taking the same space it always did. The image's total size is the sum of all its layers, and a whiteout adds to that sum rather than subtracting from it.

The consequence has two edges. A package cache you install in one layer and rm in the next is still in the image, fully present and fully extractable — which is why RUN apt install … && rm -rf /var/lib/apt/lists/* in a single instruction matters, and why a secret added then deleted across two layers is still recoverable by anyone with the image. This is the seed of every image-size and image-hygiene lesson that follows.

Common Mistakes

Assuming a RUN rm in a later instruction reclaims the deleted file's space — it doesn't; the file stays in the earlier layer and only a whiteout is added, so the image stays the same size and any deleted secret is still extractable from its layer.
Treating image size as the sum of the files you see inside the container — duplicated and shadowed copies across layers inflate the real total well beyond what ls in the running container suggests.
Modifying many large files at run time and being surprised by disk use and latency — copy-on-write duplicates each one into the writable layer on first write, so a one-byte change to a 500 MB file copies all 500 MB.
Thinking each image carries a private full copy of its base — content-addressed layers are stored once, so ten images on the same base cost the base once plus each image's own upper layers, not ten times the base.

Best Practices

Combine related filesystem work into one RUN — install, use, and clean up in a single instruction — so intermediate files never persist in a lower layer where a later deletion can't remove them.
Build on a small, shared base such as python:3.12-slim or alpine so the heavy common layers are pulled and stored once across all your images.
Order instructions from least- to most-frequently-changed so the stable bottom layers stay cached and shared while only the volatile top layers rebuild — the full cache lesson lands in the Dockerfile chapter.
Never rely on a later deletion to remove a secret or a large file added in an earlier layer — it remains in the image history; keep it out of the layer in the first place.

Comparable tools overlay2 the default storage driver implementing the union mount today aufs · devicemapper · btrfs older drivers, same union/copy-on-write idea built differently OCI image spec standardizes the layer format so Podman and containerd read the same images

Knowledge Check

What does a single image layer actually contain?

The filesystem changes relative to the layer below it — files added, modified, or deleted
A complete, self-contained copy of the entire root filesystem as it stands at that point in the build
The literal Dockerfile instruction text that the daemon re-executes when the container runs
The container's private runtime writable data captured for that stage of execution

How does the union mount produce a single filesystem from a stack of layers?

overlay2 stacks the layers and a read falls through from the top down to the first layer that has the file
It physically copies and merges all the layers into one brand-new directory tree each time the container starts
It concatenates the files drawn from every layer in strict alphabetical order into a single flat tree
It permanently deletes the lower copy from disk whenever two layers contain a file at the same path

Why does deleting a file in a later Dockerfile instruction not shrink the image?

The deletion adds a whiteout marker in a later layer, while the file's bytes stay in the earlier layer
The deletion removes the file from the merged view and reclaims its bytes from the lower layer
overlay2 transparently compresses the deleted file in place instead of removing it, so some of its space still remains
The space is reclaimed automatically, but only after the daemon restarts and re-reads the full layer stack

Why do ten images built on the same base not cost ten times the base size on disk?

Layers are content-addressed, so the shared base layers are stored once and each image adds only its own distinct layers
Docker periodically scans the whole local store and removes duplicate base copies as a background housekeeping job
The base layer is aggressively compressed once on disk and each image references that single compressed copy at run time
Each image re-downloads its own base, but the daemon detects and merges them into a single shared copy on first run

You got correct