sysctl and Kernel Tuning
sysctl reads and writes kernel parameters while the system is running. Every tunable is a file under /proc/sys — write 1 to /proc/sys/net/ipv4/ip_forward and the kernel starts forwarding packets immediately, no reboot, no recompile. The sysctl command is a thin wrapper over that tree: it translates the dotted name net.ipv4.ip_forward into the path net/ipv4/ip_forward and reads or sets it. Hundreds of these parameters control networking, memory reclaim, the OOM behavior, filesystem limits, and scheduler details.
The operational catch is that a write to /proc/sys lives only in RAM. It is gone on the next boot. A value that survives a reboot has to be declared in a configuration file under /etc/sysctl.d/ (or the older /etc/sysctl.conf) and re-applied at boot. Mixing up the runtime write and the persistent declaration is the single most common sysctl mistake — the tuning works in testing, then silently reverts the first time the box restarts.
The /proc/sys Tree
Kernel parameters are grouped into namespaces, and each namespace is a top-level directory under /proc/sys. The four that matter on a server are net (the entire networking stack — TCP, IP, the bridge, netfilter), vm (virtual memory: reclaim, overcommit, dirty-page writeback, swap behavior), kernel (process limits, panic behavior, the OOM and core-dump policy, shared-memory caps), and fs (open-file ceilings, inotify watches, aio limits). The dotted sysctl name maps directly onto the path: vm.swappiness is /proc/sys/vm/swappiness.
You read the whole tree with sysctl -a, a single parameter with sysctl net.ipv4.ip_forward, or just cat the file. Both views are equivalent because sysctl is reading the same kernel state the file exposes. Not every parameter is writable at runtime: a handful are read-only or can only be set at boot via the kernel command line — those will not change with sysctl -w no matter how you phrase it.
# read one parameter (dotted name or file path are equivalent) sysctl vm.swappiness cat /proc/sys/vm/swappiness # list every tunable the running kernel exposes sysctl -a | grep '^net.ipv4.tcp'
Runtime versus Persistent Changes
sysctl -w net.core.somaxconn=1024 changes the value now and takes effect for new connections immediately. It does not write anything to disk, so it does not survive a reboot. Use it to test a value or to make an emergency change during an incident, never as the way you ship a setting. A bare echo 1024 > /proc/sys/net/core/somaxconn does exactly the same thing and is just as temporary.
For a setting that must stick, drop a file in /etc/sysctl.d/ with a numeric prefix — for example /etc/sysctl.d/60-network-tuning.conf — containing key = value lines. Apply every drop-in file at once with sysctl --system, which reads the full set of directories in precedence order; sysctl -p <file> applies a single named file. On Debian and Ubuntu the procps service replays these files at boot, which is what makes the value persistent. Red Hat behaves the same way through systemd-sysctl.service; the file format and the --system flag are identical across both.
# temporary — gone on reboot sysctl -w net.core.somaxconn=1024 # persistent — /etc/sysctl.d/60-net.conf net.core.somaxconn = 1024 net.ipv4.tcp_max_syn_backlog = 4096 # apply all drop-ins in precedence order sysctl --system
Network Tunables
The networking namespace is where tuning earns its keep on a busy server. net.core.somaxconn caps the accept-queue depth a listening socket can request; the kernel default is 4096 on current kernels but was 128 for years, and an application asking for a larger backlog is silently clamped to it — a classic cause of dropped connections under a burst. net.ipv4.tcp_max_syn_backlog sizes the half-open SYN queue. net.ipv4.ip_forward must be 1 for any box acting as a router, NAT gateway, or container/VPN host — it is off by default.
Two more deserve a warning. net.ipv4.conf.all.rp_filter controls reverse-path filtering, which drops packets arriving on an interface the kernel would not have routed them back through; it breaks asymmetric routing and multi-homed setups, so set it to 2 (loose mode) rather than 1 (strict) when a host has multiple uplinks. And do not copy net.ipv4.tcp_tw_recycle from an old blog post — it was removed in kernel 4.12 and broke clients behind NAT before that; the parameter no longer exists and sysctl -w will error on it.
Memory and Filesystem Tunables
vm.swappiness (0–200, default 60) biases how aggressively the kernel reclaims anonymous pages to swap versus dropping file-backed page cache. On a database or latency-sensitive service, lowering it to 10 keeps the working set in RAM longer; setting it to 0 does not disable swap, it just makes the kernel avoid swapping until it is nearly out of memory. vm.overcommit_memory governs whether the kernel grants more virtual memory than it can back — mode 2 with a tuned vm.overcommit_ratio refuses allocations past a hard limit, which trades the OOM killer for honest malloc failures.
fs.file-max is the system-wide ceiling on open file descriptors, and fs.inotify.max_user_watches caps how many files a single user can watch — the default of 8192 (or even 65536) is far too low for IDEs, file-sync daemons, and build watchers, which fail with ENOSPC long before the disk is full. Raising fs.file-max is necessary but not sufficient: the per-process limit is governed separately by ulimit -n and the systemd unit's LimitNOFILE, and a process hits that wall first.
# /etc/sysctl.d/70-memory-fs.conf vm.swappiness = 10 vm.overcommit_memory = 2 vm.overcommit_ratio = 80 fs.inotify.max_user_watches = 524288 fs.file-max = 2097152
Applying and Auditing
Precedence is what bites people who split settings across files. sysctl --system reads from a fixed list of directories — /etc/sysctl.d/, /run/sysctl.d/, /usr/local/lib/sysctl.d/, /usr/lib/sysctl.d/, then /etc/sysctl.conf last — but it does not simply apply them in that directory order. All the drop-in files are merged into one list sorted lexically by filename, so the numeric prefix is what decides order; when the same key appears in two differently-named files, the lexically later name wins. Two files that share a filename are an exception: only the copy in the highest-priority directory is read (/etc shadows /run shadows /usr/lib). A package-shipped /usr/lib/sysctl.d/50-default.conf is overridden by your /etc/sysctl.d/60-tuning.conf because 60 sorts after 50; /etc/sysctl.conf is applied last of all and overrides any drop-in.
Verify the running value with sysctl <key>, not by reading your config file — the file is your intent, the /proc/sys value is the truth, and they diverge whenever someone made a runtime -w change or a later drop-in won the precedence fight. After editing config, run sysctl --system and re-read the keys you changed. A few parameters take effect only at boot regardless (those consumed by the kernel command line or set before userspace starts), so when a change refuses to apply at runtime, that is the reason — schedule a reboot rather than fighting sysctl -w.
sysctl -w (runtime) — writes straight to /proc/sys in RAM and takes effect instantly. Nothing touches disk, so the value is lost on the next reboot. Reach for it to test a candidate value or to make an emergency change during an incident — never as the way you deploy a setting.
/etc/sysctl.d/ (persistent) — a key = value drop-in that the boot-time service replays on every start, so the setting survives reboots and is version-controllable. It does not apply itself the moment you save the file: run sysctl --system to load it now. This split is the source of the "my tuning vanished after reboot" bug — the change was only ever a runtime -w.
- Tuning with
sysctl -worecho > /proc/sysduring an incident and never writing it to/etc/sysctl.d/— the fix works until the next reboot, then the problem returns with no obvious cause. - Cargo-culting
net.ipv4.tcp_tw_recyclefrom an old performance guide — it was removed in kernel 4.12, errors out on modern systems, and even when it existed it silently broke clients sitting behind NAT. - Raising
fs.file-maxand assuming the descriptor limit is lifted, while the process still hits the per-processulimit -n/ systemdLimitNOFILEwall first and fails with "too many open files". - Editing files under
/proc/sysdirectly to make a change "permanent" — it is a virtual filesystem in RAM, so the edit reverts on reboot just likesysctl -w. - Leaving
fs.inotify.max_user_watchesat its low default, then watching an IDE, file-sync daemon, or build watcher fail withENOSPCon a disk that is nowhere near full. - Setting
net.ipv4.conf.all.rp_filter = 1(strict) on a multi-homed or asymmetrically-routed host, silently dropping legitimate return traffic; loose mode (2) is the correct choice there. - Splitting the same key across two drop-ins without thinking about precedence — the lexically later file (and
/etcover/usr/lib) wins, so the value you see is not the one you expected.
- Persist every real setting in a numbered drop-in under
/etc/sysctl.d/(for example60-network.conf) and keep it in version control; reservesysctl -wfor testing and emergencies only. - Apply changes with
sysctl --systemso all drop-ins load in the correct precedence order, rather than relying on a singlesysctl -pthat ignores the rest of the tree. - Confirm the result by re-reading the live value with
sysctl <key>after applying — the/proc/sysvalue is the truth; the config file is only your intent. - Raise
fs.file-maxand the matching per-process limit (LimitNOFILEin the systemd unit, or/etc/security/limits.conf) together, since the process hits the per-process ceiling first. - Benchmark before and after a network or memory change — measure connection drops, latency, or reclaim behavior rather than copying tunables blind from a tuning list.
- Use the numeric filename prefix deliberately to control precedence, and verify with
sysctl --systemwhich file ultimately set a contested key. - Know which parameters apply only at boot (those taken from the kernel command line) and schedule a reboot for them instead of fighting a
sysctl -wthat will never take.
HKLM\SYSTEM\CurrentControlSet plus netsh int tcp for the network stackFreeBSD — sysctl for runtime knobs and /boot/loader.conf for boot-time-only parametersmacOS — the same sysctl command, with persistent values in /etc/sysctl.confKnowledge Check
You run sysctl -w vm.swappiness=10, confirm it took effect, and move on. After a reboot the value is back to 60. Why?
sysctl -wonly writes to/proc/sysin RAM; without a drop-in in/etc/sysctl.d/there is nothing for the boot-time service to replay- The kernel automatically resets
vm.swappinessback to its compiled-in default whenever physical memory pressure is detected during the shutdown sequence sysctl -wchanges are quietly reverted bysystemdat boot unless you also runsystemctl daemon-reloadimmediately afterward- A value of 10 sits below the minimum the kernel will accept here, so it was silently clamped back up to the 60 default at boot
A process keeps failing with "too many open files" even after you set fs.file-max to two million and applied it. What is the most likely cause?
- The per-process descriptor limit (
ulimit -n/ systemdLimitNOFILE) is separate fromfs.file-maxand the process hits it first fs.file-maxis actually a read-only parameter that cannot be changed at runtime at all, so the new two-million value never really applied- The change needs a full reboot first, because every
fs.*parameter is only ever consumed from the kernel command line at boot fs.file-maxactually counts inotify watches rather than file descriptors, so raising it has no effect on open files at all
The same key is set in both /usr/lib/sysctl.d/50-default.conf and your /etc/sysctl.d/60-tuning.conf when you run sysctl --system. Which value wins, and why?
- Your
/etc/sysctl.d/60-tuning.confvalue —/etcoutranks/usr/liband the higher numeric prefix is read later, so it overrides - The
/usr/libdefault value wins, because vendor-shipped files are deliberately applied last in order to protect the original package defaults sysctl --systemrefuses to apply either value and prints a hard conflict error until you manually remove the duplicate key- Whichever file has the lower numeric prefix wins, so
50-default.conftakes precedence over your60-tuning.conf
Why is copying net.ipv4.tcp_tw_recycle=1 from an old tuning guide a bad idea on a current server?
- The parameter was removed in kernel 4.12 (so
sysctl -werrors), and before removal it broke connections from clients behind NAT - It still works on current kernels but forces every TCP connection to stay in TIME_WAIT permanently, steadily exhausting the host's ephemeral ports
- It silently enables IP packet forwarding as an undocumented side effect, quietly turning the server into an open router on the network
- It can only ever be set at boot through the kernel command line, so a value pasted into a runtime guide is always ignored
You got correct