Chapter 12
Performance and Troubleshooting
A method and a toolkit for finding which layer is actually broken — and the four classic performance bugs that hide as "the network is slow."
Everything before this chapter built the model; this chapter is about using it under pressure, when a pager is going off and someone is saying "the network is down." Most of the time the network is fine and one specific layer is not — a link, an address, a route, a name, a port, or the app itself — and the entire skill is isolating that layer before you touch anything. The first three topics are a method and the tools that drive it: a layer-by-layer checklist, then ping/traceroute/mtr for reachability and paths, then dig/ss/lsof for names and sockets.
The last three topics are the performance bugs that masquerade as a slow network and waste the most engineer-hours: latency versus jitter versus loss (each breaks a different class of application), the small-works/large-hangs signature of a broken Path MTU Discovery, and the bandwidth-delay product that explains why a single TCP flow crawls across a transcontinental link no matter how fat the pipe. Each one has a number you can compute and a fix you can apply — guessing is not on the menu.
Topics in This Chapter
ping answers "reachable and how far," traceroute maps the hops, and mtr merges them into live per-hop stats. How ICMP deprioritization and asymmetric paths make all three lie.dig interrogates DNS precisely, ss shows socket state and listeners, and lsof maps a socket to its process. Telling "port not listening" from "listening but firewalled" from "DNS wrong."