Monitoring Processes
Process monitoring is the act of reading the live process table — each task's state, CPU share, memory footprint, and parent/child relationships — to answer one question: what is running and what is it costing. The kernel keeps this table in memory and projects it through the /proc filesystem, where every running task has a directory named after its PID. Every tool here, from ps to htop, is a different way of reading and formatting those same files.
The operational consequence is the difference between a tool that snapshots and one that refreshes, and between columns that look alike but mean very different things. ps reads the table once; top and htop re-read it every 1.5 seconds by default. Read the memory columns wrong — VSZ as "memory used", or RSS summed across processes — and you will overstate real usage by gigabytes and chase the wrong problem at 2 a.m.
ps — the snapshot
ps prints one frozen view of the process table. It carries two historical syntaxes: BSD style, where options take no dash (ps aux), and UNIX style, where they do (ps -ef). Both list every process, but the column sets differ — aux shows %CPU, %MEM, VSZ, and RSS, while -ef shows the parent PID and start time. On Debian and Ubuntu both come from the procps package; the same binary behaves the same on RHEL-family hosts.
For scripts, the real power is -o, which names exactly the columns you want, and --sort, which orders output server-side. A sorted, custom-column snapshot is something you can drop straight into a ticket or a cron job without piping through sort.
# top memory consumers, custom columns, sorted descending ps -eo pid,ppid,stat,vsz,rss,comm --sort=-rss | head # filter to one program for a script ps -C nginx -o pid,rss,etime
top — the live view
top ships with procps-ng and is present on essentially every Debian and Ubuntu host. It refreshes every 1.5 seconds by default (change it with -d), and the header reports the three load averages plus a task and CPU-state summary. Press Shift+P to sort by CPU, Shift+M to sort by memory, and k to send a signal to a PID without leaving the tool.
The summary line is where most misreadings happen. The CPU-state row breaks time into us (user), sy (system/kernel), id (idle), and wa (waiting on I/O); a high wa means the box is blocked on disk, not short of CPU. By default %CPU is per-core in Irix mode, so a single-threaded process can show 100% while seven other cores sit idle; toggle Shift+I to express it as a fraction of the whole machine.
htop — the practical default
htop is a separate package (apt install htop) and is the better interactive tool. It draws per-core meters, scrolls horizontally and vertically through the full table, supports incremental search with F3 and filtering with F4, and sends signals through a menu with F9 instead of asking you to remember signal numbers.
Its best feature for debugging is the tree view (F5), which lays out the parent/child hierarchy so you can see that a runaway worker belongs to a specific systemd unit or container, not floating free. Recent versions also group by cgroup, which makes "what is this container actually consuming" a glance rather than an investigation. btop and glances push the same idea further with graphs and historical sparklines.
Reading memory columns
Three memory figures appear across these tools, and they answer different questions. VSZ (VIRT in top) is the total virtual address space the process has mapped, including shared libraries it never faults in and memory it reserved but never touched. It routinely runs into gigabytes and almost never reflects real consumption.
RSS (RES) is the resident set — the pages actually present in physical RAM right now — and it is the number to read when asking how much memory a process occupies. The catch is shared pages: a process's RSS includes shared libraries like libc that are counted in every process's RSS, so summing RSS across processes double-counts that shared memory. For a true per-process figure that splits shared pages fairly, read PSS (proportional set size) from /proc/<pid>/smaps_rollup.
| Column | Meaning | Read it for |
|---|---|---|
| VSZ / VIRT | Total virtual address space mapped | Almost never; it overstates real use |
| RSS / RES | Pages currently resident in RAM | A process's actual footprint |
| SHR | Resident pages shared with others | Estimating private vs shared memory |
| PSS | RSS with shared pages divided fairly | Summing memory across many processes |
CPU% and load average
The %CPU column is a share measured since the last refresh, not a cumulative total, which is why it jumps around and why a single ps snapshot can miss a spike entirely. A momentary 90% during a refresh tick is meaningless; sustained 90% across a minute of top is the signal. Chase sustained usage, not a transient flicker.
Load average is the most misread number on the system. The three figures are the 1, 5, and 15-minute exponentially-weighted counts of tasks that are runnable or in uninterruptible (D-state) sleep — not a percentage. The reference point is core count: a load of 4.0 on a 4-core box means it is fully busy with no queue, while 4.0 on a single core means tasks are waiting three-deep. Because D-state I/O waits count, a load can climb past core count while CPUs sit idle behind a stuck disk.
# load average and core count side by side cat /proc/loadavg nproc
VSZ (virtual size) — every byte of address space the process has mapped, including unfaulted reservations and shared libraries. Read it only when you care about address-space limits or a mapping leak; it overstates real memory, often by an order of magnitude.
RSS (resident set size) — the pages actually in physical RAM. This is the column for "how much memory does this process use" — but it counts shared library pages in every process, so summing RSS across a fleet of workers double-counts the shared parts. Sum PSS instead when you need a true total.
- Reading VSZ (VIRT) as "memory used" — it is reserved address space, much of it never faulted into RAM; a process showing 12 GB VSZ may hold 200 MB resident.
- Summing the RSS column across processes and reporting the total as memory used — shared pages such as libc are counted once per process, so the sum overcounts, sometimes by gigabytes; sum PSS for a true figure.
- Reading load average as a percentage — a load of 4.0 is "four busy" and means a fully loaded 4-core box or a three-deep queue on one core, not "4% busy".
- Concluding a CPU spike is over because one
ps auxshowed nothing — the single snapshot simply landed between bursts; only a refreshing tool catches transients. - Reading
%CPUintopas a fraction of the whole machine when Irix mode makes it per-core; 100% there means one saturated core, and a multithreaded process can exceed 100%. - Blaming high CPU for slowness when the
topsummary shows highwaand the load is full ofD-state tasks — the box is blocked on I/O, not short of CPU. - Leaving
topon its default%CPUsort and missing a memory leak that aShift+Msort would have surfaced at once.
- Read RSS, never VSZ, when answering how much RAM a process occupies — and sum PSS from
/proc/<pid>/smaps_rollupwhen you need a fleet total. - Compare load average against
nprocevery time: a load equal to core count is fully busy, well above it means a run queue or stuck I/O. - Watch a refreshing tool for a full minute and act on sustained usage, not a single transient
%CPUreading. - Reach for
htopfor interactive triage — the tree view (F5), per-core meters, search (F3), and theF9kill menu beat rawtopfor finding and stopping the offending task. - Capture a stable snapshot for tickets with
ps aux --sort=-%mem(or-%cpu), and useps -C name -o ...to filter by program in scripts. - Check the CPU-state row in
top: highwapoints at disk or network I/O, not a CPU shortage, and changes which subsystem you investigate. - Toggle Irix mode off with
Shift+Iintopwhen you want%CPUas a fraction of the whole machine rather than per-core.
Get-Process in PowerShell for the same snapshot-and-live viewmacOS — Activity Monitor for the GUI, and the BSD-derived top on the command lineglances / btop — richer terminal monitors with graphs, history, and sensors over the same /proc dataKnowledge Check
You add up the RSS column for every process and report it as total RAM in use. Why is the figure too high?
- RSS includes shared pages like libc, which are counted in every process's RSS, so shared memory is added multiple times
- RSS reports the entire virtual address space, most of which is reserved mappings that are never faulted into resident RAM
- RSS double-counts the kernel's own slab and page-cache memory in every process row, inflating the per-process figure
- RSS is measured in 4 KB pages, so the raw column numbers must be halved before they can be summed into a RAM total
A 4-core server shows a 1-minute load average of 8.0. What does that indicate?
- Roughly twice as many tasks are runnable or in uninterruptible I/O wait as there are cores — a run queue is building
- The CPUs are running at 8% utilization, which is comfortable headroom for a quad-core box under a normal daytime workload
- Eight cores are fully saturated, so the box is short by four cores and needs hardware added
- Memory is overcommitted by a factor of eight and the box is swapping heavily, which is what drove the figure up
A short CPU spike is reported, but a single ps aux shows nothing unusual. What is the better instrument?
- A refreshing tool like
top,htop, orpidstat 1that samples repeatedly and reveals sustained versus transient usage - Another
psrun with more-ocolumns such asetimeandstatto widen the single snapshot - Reading
/proc/<pid>/cmdlinefor each running process in turn to identify which command was responsible for the spike - Sorting the existing
psoutput by%CPUin descending order to surface the heaviest consumer
In top's default Irix mode on an 8-core host, a single-threaded process shows 100% CPU. What does that mean?
- It is saturating one full core, which is one-eighth of total machine capacity
- It is using all eight cores at full load, fully saturating the whole machine's CPU capacity
- It is consuming exactly 100% of available system memory
- It has hit a kernel-imposed per-process CPU cap set at one core's worth
You got correct