Topic 71

Disk and I/O Analysis

DiskI/O

Disk is the bottleneck that hides behind other symptoms. "CPU is high" turns out to be iowait — the CPU sitting idle waiting for a read to come back. "The app is slow" turns out to be a database blocked on fsync to a saturated volume. The disk subsystem rarely announces itself; you have to go measure it, and the measurement has to name a device, a process, and a latency in milliseconds before it is worth acting on.

The cost of guessing here is buying the wrong resource. Add CPU to a box that was iowait-bound and nothing changes; add RAM to a write-bound database and the writes still stall on disk latency. iostat, iotop, and pidstat from the sysstat family answer the three questions that matter — is a device saturated, how much latency is it adding, and which process is generating the I/O — so the fix lands on the constrained resource instead of the convenient one.

The Metrics That Matter

Four numbers describe a block device under load: throughput (MB/s moved), IOPS (operations completed per second), average request size (throughput ÷ IOPS), and service latency — how long each request takes from submission to completion. Latency is the one that maps to user pain, because an interactive request waits for its I/O to finish. A volume can show modest 30 MB/s throughput and still be the problem if each 4 KB operation takes 12 ms instead of 0.3 ms.

In iostat -x output the latency columns are r_await and w_await, measured in milliseconds and including time spent queued in the kernel, not just on the device. %util is the fraction of wall-clock time the device had at least one request in flight, and aqu-sz (average queue size) is the mean number of requests outstanding. On a single spinning disk, %util near 100 means saturated; on an SSD or a RAID array that serves many requests in parallel, %util at 100 only means "never idle," not "out of capacity."

iostat and iotop

On Debian and Ubuntu, iostat ships in the sysstat package (apt install sysstat); RHEL and Fedora install the same package with dnf install sysstat. Always run it with an interval, and discard the first sample — it is an average since boot, not the current state. The extended (-x) form is the one worth reading.

# extended per-device stats in MB/s, 2-second samples; ignore the first
iostat -dxm 2

Device   r/s    w/s    rMB/s  wMB/s  r_await  w_await  aqu-sz  %util
nvme0n1  12.0   840.0  0.19   42.1   0.31     8.74     7.42    98.6
sda      4.0    96.0   0.02   1.50   1.10     22.40    2.30    61.0

iostat tells you which device hurts; it does not tell you which process. For that, iotop (apt install iotop) shows per-process read and write rates live, and it needs root or the CAP_NET_ADMIN capability because it reads kernel task accounting. The non-interactive form is what you put in a runbook.

# batch mode, only processes doing I/O, 3 samples, accumulated totals
sudo iotop -boP --iter=3

# sysstat alternative: per-process disk stats, no curses UI
pidstat -d 2

Saturation Signs

A device is saturated when requests spend more time waiting than being served. The signature is rising await together with an aqu-sz well above 1 — requests are backing up in the queue faster than the device drains them. A queue depth of 30 with 8 ms w_await means each new write waits behind roughly 30 others; that is where a database with a 5 ms write SLO quietly starts missing it while the CPU graph stays flat.

System-wide, the cheap first signal is iowait in top or vmstat — the wa column and the b (blocked, uninterruptible-sleep) process count. High iowait with processes stuck in D state is the kernel telling you tasks are blocked on I/O completion. That points you at disk; iostat then tells you which device, and iotop tells you which process is filling its queue.

# vmstat: watch the 'b' (blocked) count and 'wa' (iowait) column
vmstat 2

 r  b   swpd   free   buff  cache   si  so   bi    bo   us sy id wa
 1  6      0  210304  4096 920184    0   0   12  41200    3  4 19 74

Random versus Sequential and the Storage Type

The same numbers mean different things on different hardware, and the random-versus-sequential split is why. A 7200 RPM SATA drive streams roughly 150 MB/s sequentially but delivers only 80–150 random IOPS, because every seek costs milliseconds of head movement. An enterprise NVMe SSD delivers hundreds of thousands of random IOPS with sub-millisecond latency and barely cares about access pattern. Small request sizes with high %util and low throughput mean a random workload — and on a spinning disk that is close to its hard limit.

This decides the fix. If a workload is random-IOPS-bound on an HDD, more sequential bandwidth is useless; you move it to flash or cut the operation count. If it is throughput-bound on flash, the device is doing exactly what it should and the answer is a faster link or parallel devices, not lower latency. Read the request size first — it tells you which kind of limit you are against before you spend money on the wrong one.

Filesystem and Caching Effects

The page cache sits between applications and the device and distorts what you measure. Reads served from cache never reach the disk, so a read-heavy service can look I/O-free until the working set outgrows RAM and the cache hit rate collapses — then disk read load appears suddenly and the latency cliff is steep. Writes are buffered too: iostat shows the device flushing dirty pages, which lags the application's write() calls and makes write spikes look delayed.

The exception that bypasses all of this is fsync. When a database or message broker calls fsync to guarantee durability, the kernel must push data to stable storage before the call returns, so application latency equals device write latency directly. A workload that is "slow" only because it fsyncs every commit is disk-latency-bound by design — the fix is faster storage, a write-back cache with battery backing, or relaxing the durability requirement, never more CPU. Check the mount options too: noatime in /etc/fstab removes a metadata write on every read, and the wrong barrier or journaling mode can add synchronous writes you did not ask for.

Throughput vs IOPS vs Latency

Throughput — bytes moved per second (MB/s). The metric for streaming, sequential work: backups, large file copies, log shipping, video. Bound by the device's bandwidth and the link.

IOPS — operations completed per second, independent of their size. The metric for small-random workloads: an OLTP database doing thousands of 4–8 KB reads and writes cares about IOPS, not MB/s.

Latency — time per operation (await, in ms). The metric that maps to user-visible response time, because an interactive request blocks until its I/O completes. Databases and any synchronous workload are usually latency-bound; optimize await first, not raw bandwidth.

Common Mistakes

Reading %util as percent-of-capacity on an SSD or RAID array. These devices serve many requests in parallel, so 100% %util means "never idle," not "saturated" — judge them by await and aqu-sz instead, or you will replace hardware that had headroom left.
Optimizing throughput when the workload is latency-bound. Striping for more MB/s does nothing for a database whose commits each wait on one 10 ms fsync — the operations are small and serialized, so only lower per-operation latency helps.
Blaming the application for fsync-bound latency. A broker or database that fsyncs every commit is reporting the device's true write latency back to you; the slowness lives in the storage, not the code.
Trusting a read-load measurement taken while the page cache is warm. Reads served from RAM never touch the disk, so the device looks idle until the working set exceeds memory and real read I/O appears all at once.
Using the first iostat sample. The opening interval is an average since boot and routinely hides a current spike — discard it and read the second and later samples.
Sending SIGKILL to a process stuck in D (uninterruptible) state to "free" the disk. The task is blocked in the kernel on I/O completion; the signal is not delivered until the I/O returns, so the kill does nothing until the device catches up.

Best Practices

Lead with await for any interactive or database workload — measure r_await and w_await in iostat -dxm 2 and treat sustained double-digit milliseconds as the problem, regardless of what %util says.
Run sudo iotop -boP to pin the I/O on the process generating it before you tune anything; the saturated queue is usually one backup, one log writer, or one runaway query, not the whole box.
Classify the workload by request size first — divide throughput by IOPS in iostat output — so you know whether you are IOPS-bound or throughput-bound before choosing flash, more spindles, or a faster link.
Set baseline expectations from the storage type: 80–150 random IOPS for a 7200 RPM HDD, hundreds of thousands for enterprise NVMe. A number is only "bad" relative to what the device can do.
Account for the page cache when reading numbers — confirm whether reads are hitting cache or disk, and watch for the latency cliff when the working set outgrows RAM.
Add noatime to read-heavy mounts in /etc/fstab to drop a metadata write on every file access, and reserve fio for reproducing and quantifying a workload before you change hardware.

Comparable toolsWindows — Resource Monitor disk view and diskspd for load testing; Performance Monitor counters for per-disk latency and queue lengthfio — the cross-platform benchmark for reproducing a workload's IOPS, throughput, and latency profile before changing storageeBPF — biolatency and biosnoop from bcc/bpftrace, block-layer latency histograms with near-zero overhead

Knowledge Check

An NVMe volume shows %util pinned at 100% but w_await is 0.4 ms and aqu-sz is 1.2. What is the right read?

The device is busy but not saturated — it serves requests in parallel, so 100% %util only means never idle; the low await and queue show plenty of headroom
The device is saturated and must be replaced with a faster drive right away, because a %util reading pinned at 100% always means the volume has run out of IOPS capacity
The numbers are internally contradictory, so the iostat sample is unreliable and should be discarded before drawing any conclusion
The workload is throughput-bound and needs a wider PCIe bus, regardless of the low latency figure, to push more bytes per second

A PostgreSQL server has high commit latency. iostat shows low throughput, small request sizes, and high w_await. What is the most effective fix?

Reduce per-operation write latency — faster (lower-latency) storage or a battery-backed write cache — because each fsync'd commit blocks on device write latency
Stripe the volume across several disks to gain more sequential MB/s, on the assumption that the wider aggregate bandwidth lets each fsync'd commit flush proportionally faster
Add CPU cores to the database host, since high commit latency under small request sizes points to a compute bottleneck in the transaction path
Add RAM so the larger page cache can absorb the commit writes in memory and let the transactions return without touching the disk at all

Why can a read-heavy service look I/O-free in iostat and then suddenly develop heavy disk read load?

Reads served from the page cache never reach the device; once the working set outgrows RAM the hit rate collapses and real disk reads appear all at once
iostat cannot account for reads, only writes, until the per-device read counter crosses a built-in reporting threshold and starts surfacing them
The disk scheduler deliberately batches all incoming reads and releases them to the device in a single coordinated burst once per minute
Read latency stays pinned at exactly zero until the device firmware quietly switches into a slower power-saving mode and only then begins charging measurable time for each access

A backup process is stuck in D state and is hammering the disk. Why does kill -9 on it not free the device immediately?

A process in uninterruptible sleep is blocked in the kernel on I/O completion; the signal isn't delivered until that I/O returns, so it does nothing until the device catches up
SIGKILL is queued behind every pending disk write the backup process issued and is only delivered once the kernel has finished flushing the entire backlog of them out to the platter
Root is forbidden from signaling any process that still holds an open file descriptor against a mounted disk until that descriptor is released
D state means the process has already exited and become a zombie, so there is no live task left for the signal to terminate

You got correct