Topic 70

CPU and Memory Analysis

PerformanceTroubleshooting

CPU and memory analysis is the work of deciding whether a slow server is starved of cores, blocked on I/O, or quietly being killed by the kernel's out-of-memory reaper — and proving it with numbers rather than guessing. The toolset is small and stable: uptime and vmstat for the system-wide pulse, top and htop for the live per-process view, mpstat for per-core breakdown, and free plus /proc/meminfo for where the RAM went. Every one of these reads from /proc, so the data is the same numbers the kernel sees; the skill is reading the columns correctly.

The two questions that trip people up live in this topic. "Is the CPU busy?" is not answered by load average — load counts everything waiting, including processes blocked on disk. "Is memory full?" is not answered by the first line of free — Linux deliberately spends almost all free RAM on page cache and hands it back on demand. Misreading either number sends you scaling the wrong resource, and the bill or the outage follows.

Load Average versus CPU Utilization

Load average is the number of processes in the run queue averaged over 1, 5, and 15 minutes — and on Linux specifically it also counts processes in uninterruptible sleep (state D), which is almost always disk or network I/O wait. A load of 8.0 on an 8-core box can mean every core is pegged, or it can mean two cores are idle and six processes are blocked on a slow NFS mount. Load alone cannot tell the two apart, which is why a high load average with low CPU usage is the classic signature of an I/O problem, not a CPU one.

To know whether the CPU itself is the bottleneck, divide load by core count and then confirm with actual utilization. nproc gives the core count; vmstat 1 and mpstat -P ALL 1 give the real split between user time, system time, I/O wait, and idle. The single most useful column is wa (iowait): if it is high while us and sy are low, the cores are sitting idle waiting for storage, and adding vCPUs buys you nothing.

# Run-queue and CPU breakdown, sampled every second
vmstat 1 5
# r  b   swpd   free   buff  cache   si so   bi bo  in cs  us sy id wa st
# 9  0      0 1.2G    88M  5.4G    0  0   12 40 900 2k 71  6  1 22  0

# us=71 sy=6  -> CPU-bound;  wa=22 -> a fifth of the time waiting on I/O
mpstat -P ALL 1   # per-core: find one saturated core hiding behind a low average

System Time, User Time, and Steal

CPU time splits into categories that point at different culprits. High us (user) time is your application code doing arithmetic and is fixed in your code, not the OS. High sy (system) time means the kernel is doing the work — usually a syscall storm: a process hammering small reads and writes, opening and closing files in a tight loop, or thrashing on locks. When sy rivals us, reach for strace -c on the offending PID to see which syscall dominates before you touch the code.

On any VM — and that includes most production servers today — watch the st (steal) column. Steal is CPU time the hypervisor gave to another tenant while your vCPU was ready to run; it is the cloud-VM symptom of a noisy neighbor or an oversubscribed host. Sustained steal above a few percent means your guest is not getting the cores you are paying for, and the fix is a larger instance or a different host class, not anything inside the guest. A bare-metal server always shows st at 0.

Resident, Virtual, and Shared Memory

Every process reports several memory sizes and they measure different things. VSZ (virtual size) is the entire address space the process has mapped, including memory it reserved but never touched and shared libraries it barely uses — it routinely reads as gigabytes and is almost never what you want. RSS (resident set size) is the physical RAM the process actually occupies right now, which is closer to the truth but double-counts shared pages: a shared library mapped by 40 processes is added into all 40 RSS figures, so summing RSS across a process list overstates real consumption.

The honest per-process number is PSS (proportional set size), which divides each shared page by the number of processes sharing it, so the column does sum correctly to the machine's real usage. PSS lives in /proc/<pid>/smaps_rollup and is what smem reports. For a quick which-process-is-eating-RAM check, sort by RSS; when you need the figures to add up to the truth — capacity planning, a memory-leak hunt across a fleet of identical workers — use PSS.

Metric	What it counts	Trap
VSZ	Total virtual address space mapped	Includes untouched reservations; not real RAM
RSS	Physical pages resident now	Shared pages counted in every sharer; sum overstates
PSS	Private pages + fair share of shared	Sums correctly; only in smaps, slower to read

Reading free and the Page Cache

Linux treats unused RAM as wasted RAM. It fills almost all otherwise-idle memory with page cache — copies of recently read files — and evicts that cache instantly the moment a process needs the pages. This is why the used figure on a healthy server looks alarmingly high and free looks alarmingly low. The number that actually matters is available, the kernel's own estimate of how much a new process could allocate without swapping, computed from reclaimable cache plus genuinely free pages.

On any kernel from the last decade, read the available column and ignore free for capacity decisions. The buff/cache figure is reclaimable, not lost. The signals of real memory pressure are different: rising si/so (swap in/out) in vmstat meaning the system is paging to disk, and entries in journalctl -k from the OOM killer naming a process it terminated to reclaim memory.

# Human-readable, with the column that matters
free -h
#               total   used   free  shared  buff/cache  available
# Mem:           7.8Gi  1.9Gi  220Mi   140Mi       5.7Gi      5.5Gi
# free=220Mi looks scary; available=5.5Gi is the real headroom

# Did the kernel kill something for memory?
journalctl -k --since "1 hour ago" | grep -i "out of memory\|oom-kill"

Load Average vs CPU% vs iowait

Load average — the run-queue length over 1/5/15 minutes, including D-state I/O waiters. Use it as a trend and a first alarm, never as a CPU-percentage; always divide by core count and compare against actual utilization.

CPU utilization — the us+sy percentage from vmstat/mpstat. This is the real answer to "are the cores busy?"; reach for it the moment load looks high, and check per-core to catch one saturated thread.

iowait — the wa percentage: cores idle while waiting on storage or network. High wa with low us/sy means the bottleneck is disk or the network, not the CPU — adding vCPUs will not help.

Common Mistakes

Reading load average as a CPU percentage and scaling up vCPUs — when the load is high because processes are stuck in D-state on a slow disk, more cores change nothing and the spend is wasted.
Panicking at a low free figure and provisioning a bigger instance — the RAM is page cache the kernel reclaims on demand; the only honest headroom number is available.
Summing RSS across processes to estimate total memory use — shared libraries are counted in every process's RSS, so the total overstates reality; use PSS from smaps_rollup when the numbers must add up.
Treating VSZ as "memory used" — virtual size includes reserved-but-untouched mappings and routinely shows gigabytes for a process holding tens of megabytes of real RAM.
Ignoring the st (steal) column on a cloud VM — sustained steal means the hypervisor is handing your CPU time to a neighbor, and no tuning inside the guest fixes a noisy-host problem.
Chasing a one-second CPU spike in top instead of sustained pressure — a single sample catches transient bursts; trends from vmstat 1 or the 5-minute load average tell you whether it is real.
Discovering an OOM kill only when the application restarts — without checking journalctl -k for the OOM killer's message, a memory leak looks like a random crash for days.

Best Practices

Divide load average by nproc before reacting, then confirm with vmstat 1 — load over core count near or above 1.0 with high us+sy is real CPU saturation; high load with high wa is an I/O problem.
Read the available column in free -h, not free or used, for every memory capacity decision; treat buff/cache as reclaimable, not consumed.
Run mpstat -P ALL 1 when the load is high but aggregate CPU looks moderate — it exposes a single pinned core that a system-wide average hides.
Use strace -c -p <pid> when system time rivals user time to find the dominant syscall before changing application code.
Watch the st column on every cloud VM and escalate to a larger or dedicated instance class when steal stays above a few percent — the fix is outside the guest.
Grep journalctl -k for OOM-killer messages as the first step in any "the service died for no reason" investigation; the kernel logs exactly which process it terminated and why.
Use PSS from smem or /proc/<pid>/smaps_rollup when tracking a leak or sizing identical workers, since PSS is the only per-process figure that sums to the machine's real usage.

Comparable toolsWindows — Performance Monitor (perfmon) and Task Manager; load average has no direct equivalent, and the OS reports committed vs cached memory differentlymacOS — Activity Monitor and top; "memory pressure" is the analog of Linux's available, and there is no /procBSD — top, vmstat, and systat; similar metrics with different column names and no Linux-style page-cache accounting

Knowledge Check

An 8-core server shows a 1-minute load average of 9.0, but vmstat reports us=4, sy=2, wa=40, id=54. What is the bottleneck?

Disk or network I/O — the high load comes from processes in uninterruptible sleep (counted in load), while iowait is 40% and the CPUs are mostly idle
CPU saturation — a load of 9.0 against only 8 cores means every core is overcommitted by at least one extra runnable thread, so add vCPUs to drain the oversized run queue
A memory leak forcing constant swap-in and swap-out, which pins the cores at high utilization and always drives load above the core count
Hypervisor steal starving the guest of physical CPU time, which inflates the load even though the idle counter reads high

free -h shows used 1.9Gi, free 220Mi, buff/cache 5.7Gi, available 5.5Gi on an 8Gi box. How much memory can a new process safely allocate?

About 5.5Gi — available is the kernel's estimate of allocatable memory, since most of buff/cache is reclaimable page cache
About 220Mi — only the free column counts as genuinely usable, since buff/cache is already committed to other processes and cannot be reclaimed
About 1.9Gi — a new process can reuse only as much as the used column reports, because that is the live working set the kernel hands back
Zero — at 220Mi free the system is already in memory pressure and will start the OOM killer

Why does summing the RSS of every process overstate the machine's real memory usage?

Shared pages such as common libraries are counted in the RSS of every process that maps them, so PSS (which divides shared pages) is the figure that sums correctly
RSS folds in swapped-out pages that no longer occupy physical RAM, so every process's resident figure is inflated by anonymous memory the kernel has already paged out to the swap device
RSS reports the full virtual address space, including large reservations and mmap regions the process mapped but never actually touched or faulted in
RSS double-counts page cache, which the kernel charges to each process that read the underlying files rather than tracking it once system-wide

On a cloud VM, vmstat shows sustained st=15 while us and sy are low and the application is slow. What is the correct response?

The hypervisor is giving your vCPU time to other tenants — move to a larger or dedicated instance class, since no in-guest tuning recovers stolen time
Renice the application to a lower nice value so the in-guest scheduler favors it ahead of every other local process and the extra priority outruns the steal time
Add more vCPUs to the guest so there is spare capacity to absorb the steal and keep the application threads scheduled on the host
Disable the page cache so the freed memory bandwidth can offset the lost CPU cycles and the application stops stalling on steal

You got correct