Topic 75

sysctl and Kernel Tuning

KernelTuning

sysctl reads and writes kernel parameters while the system is running. Every tunable is a file under /proc/sys — write 1 to /proc/sys/net/ipv4/ip_forward and the kernel starts forwarding packets immediately, no reboot, no recompile. The sysctl command is a thin wrapper over that tree: it translates the dotted name net.ipv4.ip_forward into the path net/ipv4/ip_forward and reads or sets it. Hundreds of these parameters control networking, memory reclaim, the OOM behavior, filesystem limits, and scheduler details.

The operational catch is that a write to /proc/sys lives only in RAM. It is gone on the next boot. A value that survives a reboot has to be declared in a configuration file under /etc/sysctl.d/ (or the older /etc/sysctl.conf) and re-applied at boot. Mixing up the runtime write and the persistent declaration is the single most common sysctl mistake — the tuning works in testing, then silently reverts the first time the box restarts.

The /proc/sys Tree

Kernel parameters are grouped into namespaces, and each namespace is a top-level directory under /proc/sys. The four that matter on a server are net (the entire networking stack — TCP, IP, the bridge, netfilter), vm (virtual memory: reclaim, overcommit, dirty-page writeback, swap behavior), kernel (process limits, panic behavior, the OOM and core-dump policy, shared-memory caps), and fs (open-file ceilings, inotify watches, aio limits). The dotted sysctl name maps directly onto the path: vm.swappiness is /proc/sys/vm/swappiness.

You read the whole tree with sysctl -a, a single parameter with sysctl net.ipv4.ip_forward, or just cat the file. Both views are equivalent because sysctl is reading the same kernel state the file exposes. Not every parameter is writable at runtime: a handful are read-only or can only be set at boot via the kernel command line — those will not change with sysctl -w no matter how you phrase it.

# read one parameter (dotted name or file path are equivalent)
sysctl vm.swappiness
cat /proc/sys/vm/swappiness

# list every tunable the running kernel exposes
sysctl -a | grep '^net.ipv4.tcp'

Runtime versus Persistent Changes

sysctl -w net.core.somaxconn=1024 changes the value now and takes effect for new connections immediately. It does not write anything to disk, so it does not survive a reboot. Use it to test a value or to make an emergency change during an incident, never as the way you ship a setting. A bare echo 1024 > /proc/sys/net/core/somaxconn does exactly the same thing and is just as temporary.

For a setting that must stick, drop a file in /etc/sysctl.d/ with a numeric prefix — for example /etc/sysctl.d/60-network-tuning.conf — containing key = value lines. Apply every drop-in file at once with sysctl --system, which reads the full set of directories in precedence order; sysctl -p <file> applies a single named file. On Debian and Ubuntu the procps service replays these files at boot, which is what makes the value persistent. Red Hat behaves the same way through systemd-sysctl.service; the file format and the --system flag are identical across both.

# temporary — gone on reboot
sysctl -w net.core.somaxconn=1024

# persistent — /etc/sysctl.d/60-net.conf
net.core.somaxconn = 1024
net.ipv4.tcp_max_syn_backlog = 4096

# apply all drop-ins in precedence order
sysctl --system

Network Tunables

The networking namespace is where tuning earns its keep on a busy server. net.core.somaxconn caps the accept-queue depth a listening socket can request; the kernel default is 4096 on current kernels but was 128 for years, and an application asking for a larger backlog is silently clamped to it — a classic cause of dropped connections under a burst. net.ipv4.tcp_max_syn_backlog sizes the half-open SYN queue. net.ipv4.ip_forward must be 1 for any box acting as a router, NAT gateway, or container/VPN host — it is off by default.

Two more deserve a warning. net.ipv4.conf.all.rp_filter controls reverse-path filtering, which drops packets arriving on an interface the kernel would not have routed them back through; it breaks asymmetric routing and multi-homed setups, so set it to 2 (loose mode) rather than 1 (strict) when a host has multiple uplinks. And do not copy net.ipv4.tcp_tw_recycle from an old blog post — it was removed in kernel 4.12 and broke clients behind NAT before that; the parameter no longer exists and sysctl -w will error on it.

Memory and Filesystem Tunables

vm.swappiness (0–200, default 60) biases how aggressively the kernel reclaims anonymous pages to swap versus dropping file-backed page cache. On a database or latency-sensitive service, lowering it to 10 keeps the working set in RAM longer; setting it to 0 does not disable swap, it just makes the kernel avoid swapping until it is nearly out of memory. vm.overcommit_memory governs whether the kernel grants more virtual memory than it can back — mode 2 with a tuned vm.overcommit_ratio refuses allocations past a hard limit, which trades the OOM killer for honest malloc failures.

fs.file-max is the system-wide ceiling on open file descriptors, and fs.inotify.max_user_watches caps how many files a single user can watch — the default of 8192 (or even 65536) is far too low for IDEs, file-sync daemons, and build watchers, which fail with ENOSPC long before the disk is full. Raising fs.file-max is necessary but not sufficient: the per-process limit is governed separately by ulimit -n and the systemd unit's LimitNOFILE, and a process hits that wall first.

# /etc/sysctl.d/70-memory-fs.conf
vm.swappiness = 10
vm.overcommit_memory = 2
vm.overcommit_ratio = 80
fs.inotify.max_user_watches = 524288
fs.file-max = 2097152

Applying and Auditing

Precedence is what bites people who split settings across files. sysctl --system reads from a fixed list of directories — /etc/sysctl.d/, /run/sysctl.d/, /usr/local/lib/sysctl.d/, /usr/lib/sysctl.d/, then /etc/sysctl.conf last — but it does not simply apply them in that directory order. All the drop-in files are merged into one list sorted lexically by filename, so the numeric prefix is what decides order; when the same key appears in two differently-named files, the lexically later name wins. Two files that share a filename are an exception: only the copy in the highest-priority directory is read (/etc shadows /run shadows /usr/lib). A package-shipped /usr/lib/sysctl.d/50-default.conf is overridden by your /etc/sysctl.d/60-tuning.conf because 60 sorts after 50; /etc/sysctl.conf is applied last of all and overrides any drop-in.

Verify the running value with sysctl <key>, not by reading your config file — the file is your intent, the /proc/sys value is the truth, and they diverge whenever someone made a runtime -w change or a later drop-in won the precedence fight. After editing config, run sysctl --system and re-read the keys you changed. A few parameters take effect only at boot regardless (those consumed by the kernel command line or set before userspace starts), so when a change refuses to apply at runtime, that is the reason — schedule a reboot rather than fighting sysctl -w.

Runtime sysctl -w vs persistent sysctl.d

sysctl -w (runtime) — writes straight to /proc/sys in RAM and takes effect instantly. Nothing touches disk, so the value is lost on the next reboot. Reach for it to test a candidate value or to make an emergency change during an incident — never as the way you deploy a setting.

/etc/sysctl.d/ (persistent) — a key = value drop-in that the boot-time service replays on every start, so the setting survives reboots and is version-controllable. It does not apply itself the moment you save the file: run sysctl --system to load it now. This split is the source of the "my tuning vanished after reboot" bug — the change was only ever a runtime -w.

Common Mistakes

Tuning with sysctl -w or echo > /proc/sys during an incident and never writing it to /etc/sysctl.d/ — the fix works until the next reboot, then the problem returns with no obvious cause.
Cargo-culting net.ipv4.tcp_tw_recycle from an old performance guide — it was removed in kernel 4.12, errors out on modern systems, and even when it existed it silently broke clients sitting behind NAT.
Raising fs.file-max and assuming the descriptor limit is lifted, while the process still hits the per-process ulimit -n / systemd LimitNOFILE wall first and fails with "too many open files".
Editing files under /proc/sys directly to make a change "permanent" — it is a virtual filesystem in RAM, so the edit reverts on reboot just like sysctl -w.
Leaving fs.inotify.max_user_watches at its low default, then watching an IDE, file-sync daemon, or build watcher fail with ENOSPC on a disk that is nowhere near full.
Setting net.ipv4.conf.all.rp_filter = 1 (strict) on a multi-homed or asymmetrically-routed host, silently dropping legitimate return traffic; loose mode (2) is the correct choice there.
Splitting the same key across two drop-ins without thinking about precedence — the lexically later file (and /etc over /usr/lib) wins, so the value you see is not the one you expected.

Best Practices

Persist every real setting in a numbered drop-in under /etc/sysctl.d/ (for example 60-network.conf) and keep it in version control; reserve sysctl -w for testing and emergencies only.
Apply changes with sysctl --system so all drop-ins load in the correct precedence order, rather than relying on a single sysctl -p that ignores the rest of the tree.
Confirm the result by re-reading the live value with sysctl <key> after applying — the /proc/sys value is the truth; the config file is only your intent.
Raise fs.file-max and the matching per-process limit (LimitNOFILE in the systemd unit, or /etc/security/limits.conf) together, since the process hits the per-process ceiling first.
Benchmark before and after a network or memory change — measure connection drops, latency, or reclaim behavior rather than copying tunables blind from a tuning list.
Use the numeric filename prefix deliberately to control precedence, and verify with sysctl --system which file ultimately set a contested key.
Know which parameters apply only at boot (those taken from the kernel command line) and schedule a reboot for them instead of fighting a sysctl -w that will never take.

Comparable toolsWindows — registry tuning under HKLM\SYSTEM\CurrentControlSet plus netsh int tcp for the network stackFreeBSD — sysctl for runtime knobs and /boot/loader.conf for boot-time-only parametersmacOS — the same sysctl command, with persistent values in /etc/sysctl.conf

Knowledge Check

You run sysctl -w vm.swappiness=10, confirm it took effect, and move on. After a reboot the value is back to 60. Why?

sysctl -w only writes to /proc/sys in RAM; without a drop-in in /etc/sysctl.d/ there is nothing for the boot-time service to replay
The kernel automatically resets vm.swappiness back to its compiled-in default whenever physical memory pressure is detected during the shutdown sequence
sysctl -w changes are quietly reverted by systemd at boot unless you also run systemctl daemon-reload immediately afterward
A value of 10 sits below the minimum the kernel will accept here, so it was silently clamped back up to the 60 default at boot

A process keeps failing with "too many open files" even after you set fs.file-max to two million and applied it. What is the most likely cause?

The per-process descriptor limit (ulimit -n / systemd LimitNOFILE) is separate from fs.file-max and the process hits it first
fs.file-max is actually a read-only parameter that cannot be changed at runtime at all, so the new two-million value never really applied
The change needs a full reboot first, because every fs.* parameter is only ever consumed from the kernel command line at boot
fs.file-max actually counts inotify watches rather than file descriptors, so raising it has no effect on open files at all

The same key is set in both /usr/lib/sysctl.d/50-default.conf and your /etc/sysctl.d/60-tuning.conf when you run sysctl --system. Which value wins, and why?

Your /etc/sysctl.d/60-tuning.conf value — /etc outranks /usr/lib and the higher numeric prefix is read later, so it overrides
The /usr/lib default value wins, because vendor-shipped files are deliberately applied last in order to protect the original package defaults
sysctl --system refuses to apply either value and prints a hard conflict error until you manually remove the duplicate key
Whichever file has the lower numeric prefix wins, so 50-default.conf takes precedence over your 60-tuning.conf

Why is copying net.ipv4.tcp_tw_recycle=1 from an old tuning guide a bad idea on a current server?

The parameter was removed in kernel 4.12 (so sysctl -w errors), and before removal it broke connections from clients behind NAT
It still works on current kernels but forces every TCP connection to stay in TIME_WAIT permanently, steadily exhausting the host's ephemeral ports
It silently enables IP packet forwarding as an undocumented side effect, quietly turning the server into an open router on the network
It can only ever be set at boot through the kernel command line, so a value pasted into a runtime guide is always ignored

You got correct