Topic 56

Swap and Memory

StorageMemory

Swap is disk space the kernel uses as an overflow for RAM. When physical memory fills, the kernel writes the least-recently-used anonymous pages — heap and stack memory that has no backing file — out to a swap area, freeing RAM for whatever needs it now. The pages come back in on demand when a process touches them again. On Debian and Ubuntu, swap is either a dedicated partition or a file (commonly /swapfile), wired up in /etc/fstab and activated with swapon.

The operational point is that swap does not make a machine faster; it changes how it fails. A box with no swap and no headroom hits the out-of-memory killer the moment allocation outruns RAM, and the kernel shoots a process dead with no warning. A box with swap degrades first — latency climbs as pages are paged in and out — giving you a window to react. Swap on an NVMe SSD is microseconds per page; swap on a spinning disk is milliseconds, which is why a thrashing server with rotational swap can become unusable while still technically "up."

Virtual Memory and Page Reclaim

Every process sees a private virtual address space; the kernel maps those virtual pages onto physical RAM in 4 KiB units (the default page size on x86-64). Most allocated memory is never all resident at once, and the kernel keeps two kinds of pages it can reclaim under pressure. File-backed pages — the page cache holding executables, libraries, and read files — can be dropped instantly if clean, or written back if dirty. Anonymous pages have no file behind them, so the only place to evict them to is swap.

When free memory drops below a watermark, kswapd wakes and reclaims pages in the background; if allocation outruns it, the allocating process is forced into direct reclaim and stalls until pages are freed. This is the latency you feel before swap thrashing becomes visible. Without swap, the kernel has only file-backed pages to reclaim, so a workload dominated by anonymous memory leaves it nothing to free — and it goes straight to the OOM killer.

Swappiness and Reclaim Tuning

The vm.swappiness sysctl, 0 to 200, sets how aggressively the kernel prefers reclaiming anonymous pages (swapping) versus dropping file-backed page cache. The Debian/Ubuntu default is 60. Lower it toward 10 on a database or application server where the working set is hot anonymous memory you do not want paged out; the kernel then leans on dropping page cache first. Setting it to 0 does not disable swap — it only makes the kernel avoid swapping until it is nearly out of other options.

# inspect and set at runtime
cat /proc/sys/vm/swappiness            # 60
sysctl vm.swappiness=10

# persist across reboots
echo 'vm.swappiness=10' | sudo tee /etc/sysctl.d/99-swappiness.conf
sysctl --system

A companion knob, vm.vfs_cache_pressure (default 100), controls how readily the kernel reclaims the dentry and inode caches. Leave it at 100 unless a measured metadata-heavy workload justifies change. These knobs shift behavior under pressure; they do not add capacity, and tuning them is no substitute for sizing RAM correctly.

Sizing and Creating Swap

The old "twice RAM" rule is dead for servers with double-digit gigabytes of memory. Size swap for the failure mode you want: a few gigabytes to absorb transient spikes and give the kernel reclaim headroom, or — if you need suspend-to-disk (hibernation), which writes all of RAM to swap — at least as much swap as RAM. A pure-throughput server with monitored memory and a hard "never swap the working set" requirement may legitimately run with a small swap file as a safety margin rather than none at all.

# create a 4 GiB swap file on Debian/Ubuntu
fallocate -l 4G /swapfile          # or dd if=/dev/zero for old filesystems
chmod 600 /swapfile               # root-only; world-readable swap leaks memory
mkswap /swapfile
swapon /swapfile
echo '/swapfile none swap sw 0 0' >> /etc/fstab   # persist

# verify what is active and how full it is
swapon --show
free -h

A swap file and a swap partition perform the same once active; the file is easier to resize and is the Ubuntu installer's default. On btrfs the file must be no-copy-on-write (set chattr +C before allocating) and uncompressed, or activation fails — a common surprise when moving from ext4.

zram and zswap

Compressed in-memory swap trades CPU for effective capacity. zram creates a block device in RAM that compresses pages (lz4 or zstd), and you swap to that device instead of disk; a 2:1 to 3:1 ratio is typical, so 2 GiB of zram holds roughly 4–6 GiB of cold pages with no disk I/O at all. Debian and Ubuntu ship systemd-zram-generator (configured in /etc/systemd/zram-generator.conf) to set this up, and it is the right default for memory-constrained or latency-sensitive nodes.

zswap is different: it is a compressed cache that sits in front of a real disk swap device, compressing pages in RAM and only spilling the coldest ones to disk. Use zram when you have no disk swap and want to extend RAM; use zswap when you already have disk swap and want to soften its latency. Running both for the same purpose is redundant — pick the one that matches whether a real backing device exists.

The OOM Killer

When the kernel cannot reclaim or swap fast enough to satisfy an allocation, the out-of-memory killer selects a victim and kills it to recover memory. Selection is driven by a per-process oom_score, which scales with memory footprint and is biased by the tunable oom_score_adj (−1000 to 1000) in /proc/<pid>/oom_score_adj. The largest memory consumer is usually the target — which on a database host is often the database itself, exactly the process you least want killed.

# find what the OOM killer has shot, with the kernel's reasoning
journalctl -k | grep -i 'killed process\|out of memory'
dmesg -T | grep -i oom

# protect a critical service via its systemd unit
# [Service]
# OOMScoreAdjust=-800

On systems running systemd with cgroup v2, systemd-oomd can act earlier than the kernel killer, using pressure-stall information (PSI) to terminate a whole cgroup before the machine seizes up. Ubuntu enables systemd-oomd on the desktop by default; on servers, decide deliberately whether you want PSI-based cgroup kills or only the kernel's last-resort behavior, because the two make different victims.

Common Mistakes

Running production with zero swap on a workload dominated by anonymous memory — the kernel has nothing to reclaim under pressure and goes straight to the OOM killer, turning a brief spike into a killed process with no warning.
Reading high used in free as a problem when most of it is page cache — Linux deliberately uses free RAM for cache, and the number that matters is available, not free.
Creating a swap file with permissions other than 600 — a world-readable /swapfile exposes paged-out process memory, and swapon warns about it for that reason.
Setting vm.swappiness=0 expecting it to disable swap — it only delays swapping to the last moment; to truly disable swap you must swapoff and remove the fstab entry.
Adding a swap file on btrfs without the no-COW attribute — copy-on-write breaks swap's fixed-block assumption and swapon refuses to activate it; the file must also be uncompressed and preallocated.
Leaving an fstab swap entry pointing at a device that no longer exists after a disk swap or reinstall — boot stalls or drops to emergency mode waiting on the missing swap unit.
Treating swap thrashing as a tuning problem — once a server is paging the active working set in and out every few milliseconds, the only real fix is more RAM or less workload, not a swappiness change.

Best Practices

Provision a few gigabytes of swap even on RAM-rich servers — it gives the kernel reclaim headroom and lets the box degrade in latency rather than OOM-killing instantly.
Lower vm.swappiness to 10 on database and application servers via /etc/sysctl.d/, and confirm with sysctl vm.swappiness after reboot.
Read memory from free -h using the available column, and watch si/so in vmstat 1 — sustained nonzero swap-in/out is your early thrashing signal.
Set chmod 600 on every swap file and pin a critical service's OOMScoreAdjust=-800 in its systemd unit so the killer targets something else first.
Use zram via systemd-zram-generator on memory-constrained or latency-sensitive nodes — compressed RAM swap avoids disk I/O entirely and beats a slow disk swap device.
Size swap to at least RAM only if you actually need hibernation; otherwise size it for spike absorption, not the obsolete twice-RAM rule.
Grep journalctl -k for OOM events after any unexplained process death, and alert on PSI memory pressure (/proc/pressure/memory) before the killer ever runs.

Comparable toolsWindows — the pagefile (pagefile.sys) plays swap's role; the memory manager compresses pages in RAM much as zram doesmacOS — dynamic swap files under /var/vm with built-in memory compression, no fixed partitionFreeBSD — swap partitions or files plus a kernel OOM behavior tuned by separate VM sysctls

Knowledge Check

A database server runs with zero swap. Under a memory spike, how does it fail compared with the same server with a few gigabytes of swap?

With no swap the kernel has only file-backed pages to reclaim, so an anonymous-memory workload hits the OOM killer immediately; with swap it degrades in latency first, giving a window to react
With no swap it runs faster under the spike because it never pages to disk, so the swap version is always the slower of the two
Both servers fail identically under the spike, because swap only matters for hibernation, not for runtime memory pressure
With no swap the kernel transparently spills anonymous pages into the page cache instead of evicting clean file pages, so the database server treats free cache as overflow memory and never truly runs out under the spike

What does setting vm.swappiness=0 actually do?

It makes the kernel avoid swapping anonymous pages until it is nearly out of other reclaim options — it does not disable swap
It disables swap entirely and deactivates the swap device, exactly as if you had run swapoff -a
It tells the kernel to swap anonymous pages as aggressively as possible to keep RAM mostly empty
It caps the usable swap device at zero bytes, so any new memory allocation that would otherwise spill over into swap fails fast with an error

When should you reach for zram rather than zswap?

When there is no disk swap device and you want compressed RAM to act as the swap, avoiding disk I/O entirely
When you already have a fast disk swap device and only want to compress the coldest pages before they spill to it
When you need to hibernate the machine, since zram keeps its compressed pages persistent across reboots
When you want to disable the kernel OOM killer entirely, since zram takes over that role under pressure

On a host that just lost its main process to the OOM killer, which value best protects a critical service from being chosen next?

A negative oom_score_adj (for example OOMScoreAdjust=-800 in its systemd unit), which lowers the service's OOM score so the killer targets a different process
A positive oom_score_adj on its systemd unit, which raises the process's priority so the kernel spares it
Raising vm.swappiness to its maximum of 200 so the kernel aggressively swaps that service's pages out to disk under pressure instead of ever choosing it as the OOM victim
Setting vm.vfs_cache_pressure=0 so the kernel reclaims that critical service's memory pages last of all

You got correct