Topic 69

Performance Methodology and Tools

Operations

Performance methodology is a repeatable process for locating a bottleneck before you touch a single tunable. It tells you which resource to look at, which metric proves it, and when to stop looking. Without one, you fall into the tools-first antipattern: run top, run htop, run whatever someone tweeted last week, stare at the numbers, and pattern-match against a problem you saw two years ago. That works until the obvious cause is not the cause.

The consequence is concrete. When you skip the method you fix symptoms instead of causes, and you burn the outage window doing it. A method-first approach — the USE method, workload characterization — narrows the search systematically, so the time you spend maps to the size of the system, not to how lucky your guesses are. On a 64-core database host at 3 a.m., that difference is the whole incident.

The USE Method

The USE method, from Brendan Gregg, gives you a fixed checklist for every resource: Utilization, Saturation, Errors. Utilization is the fraction of time the resource was busy. Saturation is the extra work that could not be serviced and had to queue. Errors are operations that failed outright. You walk CPU, memory, disk, and network in turn and answer all three for each. The first resource that shows high saturation or errors is your suspect.

Saturation is the column people skip, and it is the one that matters. A CPU at 100% utilization is not a problem if nothing is waiting; a CPU at 80% utilization with a run queue of 40 is a problem. Utilization tells you the resource is busy. Saturation tells you the resource is the bottleneck.

Resource	Utilization	Saturation	Errors
CPU	mpstat -P ALL 1 (%idle)	vmstat 1 (r column)	perf, MCE in dmesg
Memory	free -m, /proc/meminfo	vmstat 1 (si/so swap)	dmesg (OOM kills, EDAC)
Disk	iostat -xz 1 (%util)	iostat -xz 1 (aqu-sz)	smartctl, /sys/.../ioerr_cnt
Network	sar -n DEV 1	sar -n EDEV 1, ss -ti	ip -s link (drops/errs)

Workload Characterization

Where the USE method examines the resource, workload characterization examines the demand placed on it. Four questions: who is causing the load (which PID, user, or IP), why (which code path or system call), what (the operations and their sizes), and how much (the rate). You can answer all four without changing anything, and the answers frequently make tuning unnecessary — a single misbehaving cron job or a retry storm explains far more incidents than a badly chosen kernel parameter.

Keep three metrics distinct because they fail differently. Throughput is operations per second. Utilization is how busy the resource is. Latency is how long each operation took, and it is what users actually feel. A system can run at 50% utilization and high throughput while p99 latency is awful, because averages hide the tail. Optimize for the metric your users experience, not the one that is easiest to graph.

The Tool Landscape

Linux performance tools fall into three families. Counters are cheap, always-on tallies the kernel keeps in /proc and /sys — the source of truth that almost every other tool reads. Tracing records individual events (a system call, a block I/O, a scheduler switch) and is more expensive. Profiling samples the stack at a fixed rate to show where CPU time goes. Reach for counters first; escalate to tracing and profiling only once the counters point you at a resource.

The sysstat package gives you the per-resource counter tools: vmstat for virtual memory and the run queue, iostat for disks, mpstat for per-CPU breakdown, and pidstat for per-process attribution. On Debian and Ubuntu install them with one command; the package also drops in the sar collector.

# Debian/Ubuntu: counter tools + the sar history collector
sudo apt install sysstat
# enable collection so you have history before the next incident
sudo sed -i 's/ENABLED="false"/ENABLED="true"/' /etc/default/sysstat
sudo systemctl enable --now sysstat

# Red Hat/Fedora equivalent
sudo dnf install sysstat
sudo systemctl enable --now sysstat

When counters are not enough, perf profiles CPU and traces kernel events, and bpftrace (eBPF) writes custom in-kernel probes with near-zero overhead — for example, latency histograms per disk or per syscall. Both are heavier than counters and need a recent kernel and root, so they are escalation tools, not the first thing you run.

A Triage Checklist

Gregg's 60-second checklist is the fixed opening move on an unfamiliar box. Run ten commands in order, glance at each, and you have a coarse map of where the trouble is before you commit to any deep dive. Each line answers a specific question.

# Gregg's first 60 seconds on a sick host
uptime              # load avg trend: rising, flat, or falling?
dmesg | tail        # OOM kills, MCE, dropped packets, disk errors
vmstat 1            # r = run queue (CPU saturation), si/so = swapping
mpstat -P ALL 1     # one hot CPU vs. evenly spread load
pidstat 1           # which process is burning the CPU, by name
iostat -xz 1        # %util and aqu-sz per disk: storage saturation
free -m             # real free memory vs. page cache
sar -n DEV 1        # interface throughput, near link ceiling?
sar -n TCP,ETCP 1   # retransmits and connection errors
top                 # confirm the suspect, sort by RES or %CPU

The point is not the commands; it is the order. Load average and dmesg give context, vmstat and mpstat separate CPU saturation from a single hot thread, iostat and free rule storage and memory in or out, and the sar network lines catch the cases people forget. Sixty seconds in, you know which resource to chase.

The USE Method vs Symptom-Chasing

The USE method — a fixed per-resource checklist: walk CPU, memory, disk, and network, and read utilization, saturation, and errors for each before forming a theory. It finds the constrained resource by elimination, so the time it costs scales with the system, not with luck. Choose it whenever the cause is not obvious — which is most real incidents.

Symptom-chasing — running whichever tool you remember, fixing the first high number, and flipping tunables until something moves. It feels faster for the first thirty seconds and then costs you the outage window, because you tune resources that were never the bottleneck and leave new misconfigurations behind. It only pays off when you already know the cause cold from a past identical incident.

Common Mistakes

Tuning before measuring. You change vm.swappiness or a thread-pool size on a hunch, the symptom moves, and you never learn whether you fixed the cause or shifted the bottleneck somewhere you are not watching.
Trusting load average as a CPU metric. On Linux, load average counts processes in uninterruptible sleep (D state) waiting on I/O, so a load of 40 on a 16-core box can mean a slow disk, not a CPU shortage. Confirm with mpstat before you blame the CPU.
Ignoring saturation. You read 100% CPU utilization and stop, never checking the r column in vmstat, so you miss whether 2 or 200 threads are queued — the difference between healthy and drowning.
Reading averages that hide bursts. A 1-minute average of 30% disk util can contain ten seconds at 100%; the average looks fine while requests time out. Sample at 1-second intervals when latency complaints do not match the averages.
Changing multiple variables at once. You bump three sysctls and restart the service together, it gets better, and you have learned nothing reusable because you cannot attribute the win to any one change.
No baseline before the incident. With sar collection disabled you have no idea whether 4,000 IOPS is normal or a tenfold spike, so every number during the outage is uninterpretable.

Best Practices

Establish a baseline on a healthy system. Run the 60-second checklist and capture sar output during normal load so incident numbers have something to compare against.
Use the USE method as a literal checklist. Walk CPU, memory, disk, and network, and write down utilization, saturation, and errors for each before you form a theory.
Measure one change at a time. Apply a single tunable, re-run the relevant metric, record the delta, then decide on the next change — never batch them.
Prefer p99 over averages. Track tail latency with sar, bpftrace histograms, or application metrics, because the average is the number that lies to you.
Capture sar history. Enable sysstat collection (ENABLED="true" in /etc/default/sysstat) so you can replay the minutes before an alert with sar -f.
Know the 60-second checklist cold. Memorize the ten commands and what each column means, so triage on an unfamiliar host costs you a minute, not a manual.

Comparable toolsWindows — PerfMon and Resource Monitor: the same utilization/saturation counters with a GUI front endBSD — systat, the per-resource live monitor that predates most of the Linux equivalentsObservability — Prometheus and Grafana, which apply the same metrics continuously across a fleet instead of one host at a time

Knowledge Check

A 16-core host shows a load average of 40 but mpstat -P ALL 1 reports 70% idle on every CPU. What is the most likely cause?

All 16 cores are fully saturated with runnable threads and the scheduler simply cannot keep up with the backlog
Many processes are in uninterruptible sleep waiting on slow I/O, which Linux counts in the load average
The load average is reported on a per-core basis, so the figure of 40 must first be divided by the 16 cores
mpstat is sampling far too slowly at a one-second interval to catch the real CPU usage spikes

In the USE method, why is saturation the column that most often identifies the bottleneck?

It measures work that could not be serviced and had to queue, so it reveals a resource that is the constraint even when utilization looks acceptable
It is the only one of the three USE metrics that the kernel actually exposes through the files under /proc
It always begins to rise before utilization climbs at all, giving an earlier warning of the developing constraint than either the utilization or the errors column does
It counts the failed operations on a resource, and those errors are the root cause behind the majority of real-world slowdowns

You apply three sysctl changes and restart the service in one step, and latency improves. What is the methodological problem?

The sysctl changes will not actually take effect until the host goes through its next full reboot cycle
Restarting the service quietly invalidates any sysctl change you applied just before it, rolling the kernel values back to their defaults
You cannot attribute the improvement to any single change, so the result is not reproducible and may hide a regression
Three changes is well below the sample size needed for the result to count as a statistically valid test

Why enable sysstat collection on a host before any incident occurs?

It reduces the CPU overhead of running ad-hoc vmstat and iostat commands during an active outage, so the tools cost less when you need them
It records a historical baseline with sar, so numbers seen during an incident can be compared against normal load
It automatically tunes the relevant kernel parameters based on the load it has observed over time on the host
It is a hard prerequisite for perf and bpftrace to attach their probes to the running kernel during an investigation

You got correct