Chapter Thirteen

Performance and Troubleshooting

A method for finding the bottleneck instead of guessing at it, and the tools that measure CPU, memory, disk, and network so the slow part names itself.

5 topics

Most production slowdowns are diagnosed by intuition, and intuition is wrong often enough to waste hours restarting the wrong service. The fix is a method: measure the load, find the saturated resource, then drill into the one subsystem that is actually constrained instead of tuning four that are not.

This chapter builds that method, then arms it. It starts with the USE-style approach and the baseline tools every Debian and Ubuntu box already ships, then takes each resource in turn — CPU and memory, disk and I/O, the network — and ends with the tracing tools that show you exactly which syscall a stuck process is blocked on. By the end you can walk up to a misbehaving server and have the bottleneck identified in minutes, not by luck.

Troubleshoot by method, not by guessing

Hypothesiswhat + where

→

Measureagainst a baseline

→

Localizeresource, then process

→

Confirmfix, then re-measure

Topics in This Chapter

Performance Methodology and Tools

The USE method — utilization, saturation, errors — applied resource by resource, and the baseline toolkit (top, vmstat, iostat, sar from sysstat) that turns a vague "it's slow" into a measured bottleneck.

MethodologyTools

CPU and Memory Analysis

Load average versus run-queue length, user versus system versus iowait time, and reading /proc/meminfo so the page cache stops looking like a leak. Where swap, OOM kills, and steal time actually come from.

Disk and I/O Analysis

Reading iostat for await, %util, and queue depth to tell a saturated disk from a busy one, and using iotop and pidstat to pin the I/O on the process causing it rather than the one waiting behind it.

Network Troubleshooting

Working the stack from the bottom up — ip, ss, and ping for the local end, then mtr, dig, and tcpdump for the path and the payload. Separating a DNS problem from a routing problem from an application that never reads the socket.

NetworkDiagnostics

strace, lsof, and Tracing

When the metrics say a process is stuck but not why: strace to see the syscall it's blocked on, lsof to list the files and sockets it holds open, and a look at the modern eBPF-based tools that trace without the strace performance penalty.

TracingDebugging