Chapter 12

Performance and Troubleshooting

A method and a toolkit for finding which layer is actually broken — and the four classic performance bugs that hide as "the network is slow."

6 topics

Everything before this chapter built the model; this chapter is about using it under pressure, when a pager is going off and someone is saying "the network is down." Most of the time the network is fine and one specific layer is not — a link, an address, a route, a name, a port, or the app itself — and the entire skill is isolating that layer before you touch anything. The first three topics are a method and the tools that drive it: a layer-by-layer checklist, then ping/traceroute/mtr for reachability and paths, then dig/ss/lsof for names and sockets.

The last three topics are the performance bugs that masquerade as a slow network and waste the most engineer-hours: latency versus jitter versus loss (each breaks a different class of application), the small-works/large-hangs signature of a broken Path MTU Discovery, and the bandwidth-delay product that explains why a single TCP flow crawls across a transcontinental link no matter how fat the pipe. Each one has a number you can compute and a fix you can apply — guessing is not on the menu.

The layer-by-layer method — stop at the first “no”
Link up?
Has IP?
Route?
DNS?
Port open?
App responds?

Topics in This Chapter