Topic 46

The Network Stack

NetworkingConcept

The Linux network stack is the in-kernel path a packet travels between an application socket and the wire. When your program calls send(), the kernel copies the data into a socket buffer (an sk_buff), runs it down through the transport layer (TCP or UDP), the network layer (IP routing and netfilter), and the link layer (the device driver and the NIC's transmit queue); inbound packets walk the same layers in reverse, from the driver's receive interrupt up to the socket the application is reading. Every interface, route, address, firewall rule, and tuning knob you touch is a configuration of one of these layers.

For a server operator the practical consequence is that "the network is broken" is never one question — it is a question about which layer. A missing default route, a dropped nftables rule, a full receive queue, and a NIC that negotiated half-duplex are four different failures at four different layers, and the tool that diagnoses one is useless on the others. Knowing where a packet is when it dies is most of the job.

The Layered Packet Path

Linux implements the TCP/IP model, not the seven-layer OSI textbook. Four layers carry real weight: the link layer (Ethernet frames, the driver, the NIC), the network layer (IPv4/IPv6 addressing, routing, and the netfilter hooks), the transport layer (TCP's ordered reliable streams and UDP's fire-and-forget datagrams), and the application layer (your process, holding a socket file descriptor). The kernel owns the lower three; the application owns only the socket at the top.

The socket is the boundary. Above it your code reads and writes bytes with ordinary read() and write() calls and never sees a packet. Below it the kernel does segmentation, retransmission, checksumming, routing, and queueing. This split is why you can run strace on a hung service and see it blocked in recvfrom() — the application has done its part and is waiting on a layer it does not control.

# Inspect each layer with iproute2 — the modern, scriptable interface
ip link show          # link layer: interfaces, MAC, MTU, up/down state
ip addr show          # network layer: assigned IPv4/IPv6 addresses
ip route show         # network layer: routing table, default gateway
ss -tulpn             # transport layer: listening TCP/UDP sockets + owning PID

Sockets and the System-Call Boundary

A socket is a kernel object exposed to the process as a file descriptor, created with socket(AF_INET, SOCK_STREAM, 0) for IPv4 TCP. The address family (AF_INET, AF_INET6, AF_UNIX) chooses the protocol world, and the type (SOCK_STREAM for TCP, SOCK_DGRAM for UDP) chooses the delivery semantics. Because it is a file descriptor, a socket lives in the same table as open files and pipes — which is why a busy server can exhaust its per-process file-descriptor limit and fail to accept new connections with EMFILE while CPU and memory sit idle.

Each socket carries kernel-side send and receive buffers. The receive buffer fills when the application reads slower than the peer sends; once full, TCP advertises a zero window and the sender stalls — backpressure with no packet loss. This is the mechanism behind a slow consumer throttling a fast producer, and it is invisible unless you look at ss -m, which prints per-socket memory and the queue depths the kernel sees.

# Listening sockets, the accept-queue backlog, and per-socket memory
ss -ltn                # Recv-Q = current backlog, Send-Q = max backlog on a LISTEN socket
ss -tm 'state established'   # skmem: r/t = receive/send buffer bytes in use, rb/tb = their size limits

Netfilter and the Packet Hooks

Routing and filtering happen at the network layer through netfilter — five hook points (prerouting, input, forward, output, postrouting) where the kernel will run rules against every packet. On Debian and Ubuntu the modern frontend is nftables with the single nft command; the older iptables still works but on current releases is usually iptables-nft, a translation shim writing into the same kernel subsystem. Connection tracking (conntrack) sits here too, recording the state of every flow so that a stateful rule can accept return traffic for a connection it already allowed.

The order matters operationally: a packet destined for a local socket passes prerouting then input; a packet being routed through the box passes prerouting, forward, then postrouting. A host that is supposed to route between interfaces but silently drops transit traffic almost always has either net.ipv4.ip_forward left at 0 or a forward-chain policy of drop with no matching accept rule — both at the network layer, neither visible from the application.

# Enable IPv4 forwarding persistently (Debian/Ubuntu)
echo 'net.ipv4.ip_forward=1' | sudo tee /etc/sysctl.d/99-forward.conf
sudo sysctl --system

# Inspect live conntrack flows and the ruleset
sudo nft list ruleset
sudo conntrack -L      # from the conntrack-tools package

Queues, Buffers, and Tuning Knobs

Between the IP layer and the driver sits the queueing discipline (qdisc), the kernel's packet scheduler. Modern Debian and Ubuntu default to fq_codel, which fights bufferbloat by dropping or marking packets early instead of letting a deep buffer add latency. On the receive side the NIC raises an interrupt, and the kernel's NAPI mechanism switches to polling under load so a flood of packets does not drown the CPU in one interrupt per frame; multi-queue NICs spread this across cores with RSS.

The knobs that actually move throughput live in sysctl under net.core and net.ipv4: net.core.somaxconn caps the accept queue (raise it from the historic 128 on any high-connection server), and the TCP autotuning ranges net.ipv4.tcp_rmem and tcp_wmem govern how large a single connection's window can grow. Set these in /etc/sysctl.d/ so they survive a reboot — a value typed at the sysctl -w prompt is gone the next time the machine boots.

Common Mistakes

Reaching for the deprecated net-tools commands — ifconfig, route, netstat — which are not installed by default on current Ubuntu and which hide secondary IPs and full routing policy that only ip addr and ip route show correctly.
Setting ip_forward=1 or any tuning value with sysctl -w only — it works until the next reboot, then routing silently dies because the value was never written to /etc/sysctl.d/.
Blaming "the firewall" for dropped transit traffic when the real cause is the forward chain policy or a missing conntrack accept rule — filtering and forwarding are distinct decisions at the same layer.
Leaving net.core.somaxconn at its default on a high-connection server, so the accept queue overflows and the kernel drops SYNs while the application looks healthy.
Diagnosing a stalled transfer by staring at CPU graphs when the receive buffer is full and TCP has advertised a zero window — backpressure shows up in ss -tm, not in top.
Exhausting the per-process file-descriptor limit on a busy server, so accept() fails with EMFILE and new connections are refused even though there is plenty of free memory.

Best Practices

Standardize on the iproute2 suite — ip and ss — for every interface, address, route, and socket query; it is the maintained interface and the only one that reflects the full kernel state.
Isolate the failing layer before changing anything: ip link for link, ip route get for routing, nft list ruleset for filtering, ss -tulpn for the listening socket.
Persist every tuning change in /etc/sysctl.d/ and apply it with sysctl --system, never with a bare sysctl -w that vanishes on reboot.
Write firewall rules with nftables on Debian and Ubuntu rather than raw iptables; on Red Hat and Fedora firewalld over nftables is the equivalent default.
Use ip route get <dest> to ask the kernel exactly which route and source address a packet will use — it answers the routing question without guessing from the table.
Raise net.core.somaxconn and the application's own listen backlog together on any server expecting connection bursts; one without the other still caps the accept queue.
Read socket buffers and queue depths with ss -tm when a transfer stalls, so you distinguish a network-layer drop from transport-layer backpressure.

Comparable toolsWindows — the TCP/IP stack configured through netsh and the WFP firewall; iproute2 has no single equivalent, the functions are spread across cmdletsBSD — FreeBSD/OpenBSD ship their own stack with pf for filtering and ifconfig/route still primary, a separate lineage from netfiltermacOS — a BSD-derived stack using pf and ifconfig, closer to FreeBSD than to Linux

Knowledge Check

A transfer to a slow client stalls, but CPU and memory on the server are idle. What is the most likely cause?

The socket receive buffer filled, TCP advertised a zero window, and the sender is throttled by backpressure — visible in ss -tm
The default route was flushed mid-transfer, so the server's reply packets to this particular client suddenly have no next hop and are silently discarded
A forward-chain rule is dropping the connection's packets
The NIC negotiated half-duplex and is colliding on every frame

A host should route between two interfaces but silently drops all transit traffic. Where do you look first?

net.ipv4.ip_forward and the netfilter forward-chain policy — forwarding is off or the chain default is drop
The application's socket backlog, which is too small to accept the connections
The interface qdisc, since the default fq_codel scheduler is built to discard any packet merely passing through the host
The per-process file-descriptor limit, which blocks new flows

Why does a tuning value set with sysctl -w net.core.somaxconn=4096 stop working after a reboot?

-w changes only the live kernel value; without an entry in /etc/sysctl.d/ the setting is not reapplied at boot
The kernel resets somaxconn to a hardware-derived value on every boot
somaxconn can only be raised by recompiling the kernel
The value persists across the reboot, but the fq_codel qdisc silently overrides the accept-queue limit on the first incoming connection

Why is ip addr show preferred over ifconfig on a current Ubuntu server?

iproute2 is the maintained tool and shows full state — secondary addresses and routing policy — that the deprecated ifconfig omits
ifconfig can only display IPv4 and is incapable of showing any IPv6 address
ifconfig requires root for read-only queries while ip does not
ip addr writes every change directly into the /etc/network/interfaces file automatically, so its edits are persisted across reboots without any extra step

You got correct