Topic 73

strace, lsof, and Tracing

TracingDebugging

When a process is stuck and the metrics tell you that it is stuck but not why, you stop guessing and watch what it actually does. strace intercepts every system call a process makes — every open(), read(), connect(), futex() — and prints the arguments, the return value, and the errno when it fails. lsof answers the complementary question: which files, sockets, pipes, and devices does this process currently hold open. Between the two you can see the exact syscall a hung daemon is blocked on and the exact path it could not read.

These tools answer questions logs and top cannot: why a service that "started fine" is permission-denied on a config it can see, why a port is already in use when no listener is obvious, why disk space stays full after you deleted the log. The catch is cost — strace stops the target on every syscall through ptrace, which can slow a busy process by an order of magnitude or more. That is acceptable for an ad-hoc investigation and dangerous on a hot production path, which is why the eBPF tools exist alongside it.

Tracing Syscalls with strace

strace works through the kernel's ptrace interface: it attaches to a process, and the kernel halts that process and notifies strace on entry to and exit from every system call. You either launch a command under it or attach to a running PID with -p. Attaching needs the same privilege as signalling the process — usually sudo — and on Ubuntu the Yama LSM restricts non-child ptrace by default, so attaching to an unrelated PID requires root or a relaxed kernel.yama.ptrace_scope.

Raw output is a firehose, so the filters earn their keep. -e trace= narrows to a class or to named calls — -e trace=openat,read for file access, -e trace=network for sockets. -f follows children across fork and clone, which matters because the process you launched is often a wrapper and the real work happens in a child. -c drops the per-call trace for a summary table — count, time, and errors per syscall — the fastest way to see that a process spent its time in futex or burned a million failing stat calls.

# Attach to a running PID, follow children, timestamp each call
sudo strace -f -tt -p 4821

# Trace only file-opening syscalls of a command you launch
strace -e trace=openat,access myapp --config /etc/myapp.conf

# Count syscalls instead of printing them — the time/error summary
sudo strace -f -c -p 4821

Reading the errno

The most useful thing in a trace line is the failure code at the end. A line like openat(AT_FDCWD, "/etc/myapp/cert.pem", O_RDONLY) = -1 EACCES (Permission denied) names the syscall, the exact path, and the reason in one place — no log statement required, because the kernel told you. The errno is the diagnosis: ENOENT means the file is genuinely missing (often a typo or a relative path resolved from the wrong directory), EACCES means it exists but the process lacks permission, EADDRINUSE means the port is taken, ECONNREFUSED means nothing is listening on the other end.

That turns "the service won't start and the log just says failed" into a specific, actionable fact. A process looping on EACCES for a directory it can list is missing the x traverse bit, not r. A daemon blocked forever in read() on a socket is waiting on a peer that never replies — a network problem, not a CPU one. Skipping past the errno to "restart it and see" throws away the one piece of evidence the trace was run to collect.

Open Files and Sockets with lsof

lsof — list open files — leans on the Unix premise that almost everything is a file: regular files, directories, sockets, pipes, and devices all show up the same way. lsof -p 4821 lists everything one process holds open; lsof /var/log/app.log lists every process holding that specific file, which is how you find what is keeping a filesystem busy or blocking an unmount. For networking, lsof -i :443 names the process bound to a port — the answer to "address already in use" — and lsof -i TCP -nP shows established connections without the slow DNS and port-name lookups.

The case that surprises people is reclaimed-but-not-freed space. When a process holds a file open and someone deletes it, the directory entry is gone but the inode and its blocks survive until the last descriptor closes — so df shows the disk full while du finds nothing to blame. lsof is the only tool that connects the two: it lists the descriptor with a (deleted) marker and names the PID still holding it. The fix is to restart or signal that process, not to hunt for files that no longer have names.

# Which process is bound to port 443 — the "address in use" answer
sudo lsof -i :443 -nP

# Find deleted-but-held files still eating disk space
sudo lsof -nP +L1          # files whose link count is below 1
sudo lsof -nP | grep '(deleted)'

# Everything a single process has open
sudo lsof -p 4821

The Overhead Caveat

strace is not a passive observer. Every traced syscall causes two context switches into strace and back, so a process that makes millions of small reads can slow to a fraction of its normal speed — syscall-heavy workloads routinely show 10x to over 100x slowdowns under strace. On a quiet process you will not notice; on a database under load, attaching strace can push latency past timeouts, trip health checks, and cause the cascading failure you were trying to debug.

Treat it as a scalpel, not a monitor. Attach briefly, capture what you need, and detach. Narrow with -e trace= so you are not stopping the process for syscalls you do not care about. When the target is a production service handling real traffic, reach for an eBPF tool instead — it reads the same events through in-kernel probes without halting the process on every call.

Beyond strace — ltrace, perf, and eBPF

strace sees the kernel boundary; it cannot see calls that stay in user space. ltrace fills that gap by intercepting library calls — it shows the malloc, strcmp, or SSL_read a program makes against shared libraries, which is where the logic often lives. perf goes the other way: perf trace is a lower-overhead syscall tracer, and perf top and perf record sample the CPU to show which functions actually burn cycles, kernel and user space together.

The production answer is eBPF — small, verified programs the kernel runs at trace points and kprobes with minimal cost. bpftrace gives you awk-like one-liners over those events, and the bcc tool collection (opensnoop, execsnoop, tcpconnect, biolatency) packages the common investigations. The defining difference is safety and overhead: the in-kernel verifier rejects any program that could loop forever or touch bad memory, so a bad probe cannot crash the box, and events are aggregated in the kernel instead of stopping the process on each one. Debian and Ubuntu ship kernels recent enough that bpftrace installs from apt with no custom kernel; the same holds for current RHEL-family kernels.

# Library calls a program makes — user space, where strace cannot see
ltrace -e 'malloc+free' myapp

# Lower-overhead syscall trace, tolerable under more load than strace
sudo perf trace -p 4821

# eBPF one-liner: every file opened, system-wide, near-zero overhead
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'

# bcc tool: trace new processes as they exec
sudo execsnoop-bpfcc

strace vs eBPF tools

strace — stops the target on every syscall through ptrace. Universal, zero setup, reads every argument and errno in full. The cost is overhead: 10x–100x slowdown on syscall-heavy processes. Reach for it ad hoc, on a single misbehaving process, on a box where you can absorb the slowdown.

eBPF tools (bpftrace, bcc) — run verified programs in the kernel at trace points, aggregating events with minimal cost and no per-call halt. Reach for them on production services under real load, for system-wide questions ("every process that opens this file"), and where the strace slowdown would itself cause an incident. The trade-off is a steeper learning curve and version-dependent probe availability.

Common Mistakes

Attaching strace to a busy production process — the per-syscall halt can slow it 10x–100x, pushing request latency past timeouts and tripping the health checks that then restart the very process you were debugging.
Reading the trace for the call name but ignoring the errno at the end of the line — EACCES versus ENOENT is the entire diagnosis (permission denied versus file missing), and skipping it wastes the run.
Hunting for a deleted file with find when df says full but du finds nothing — a process still holds the unlinked inode open, and only lsof's (deleted) marker reveals which PID to restart.
Tracing without -f when the launched command forks — the parent exits cleanly while the child that actually hit EACCES is never traced, so the trace looks successful and tells you nothing.
Forgetting Ubuntu's Yama ptrace_scope default — strace -p on a process that is not your child returns "Operation not permitted" even as your own user, and the fix is sudo, not chasing a phantom permissions bug in the target.
Leaving strace attached and walking away — it keeps the target stopped-and-resumed on every call for the whole session; detach the moment you have the evidence.
Reaching for strace to find a user-space hotspot — it only sees syscalls, so a CPU-bound loop that makes no kernel calls shows nothing; that is a job for perf or ltrace.

Best Practices

Read the errno first on any failing syscall — it names the exact problem (EACCES, ENOENT, EADDRINUSE) and the path, turning a vague "failed to start" into a one-line fix.
Narrow every strace with -e trace= to the syscall class you care about, so you stop the target far less and the output stays readable.
Add -f whenever the target forks — wrappers, shells, and supervisors hand the real work to a child, and without it you trace the wrong process.
Run strace -c for a syscall summary before reading the full trace — the count-and-time table points you at the hot or failing call instead of scrolling thousands of lines.
Use lsof -nP +L1 to reclaim disk that du cannot find — locate the deleted-but-held inode and restart the holding process to free the blocks.
Prefer bpftrace or the bcc tools over strace on any production service under load — the in-kernel verifier keeps a bad probe from crashing the box, and the overhead stays low enough to leave running.
Install tracing tools before the incident — apt install strace lsof bpftrace on the base image, so you are not adding packages to a degraded production host mid-outage.

Comparable toolsWindows Process Monitor (ProcMon) and ETW — event tracing for file, registry, and network activitymacOS dtruss and DTrace — syscall and dynamic kernel tracing on the BSD/Mach basesysdig — a unified capture tool combining strace-style syscall visibility with lsof-style open-file listing

Knowledge Check

A production database is slow, and you attach strace -p to its main process to see what it's doing. Latency immediately spikes and health checks start failing. Why?

strace uses ptrace to halt the process on entry and exit of every syscall, and a syscall-heavy process can slow 10x–100x — enough to blow past timeouts
strace takes an exclusive advisory lock on every one of the database's open files, blocking the engine's own read and write I/O completely until you finally detach
Attaching with -p sends a single SIGSTOP that pauses the process and holds it frozen for the entire duration of the trace session
strace rewrites the running binary in memory to insert its tracing probes, and that rewrite corrupts the database's latency-sensitive hot path

df reports a filesystem 100% full, but du -sh on every directory accounts for only a fraction of the space. What is the most likely cause and the tool that finds it?

A process holds a deleted file open, so its inode and blocks survive until the descriptor closes — lsof +L1 lists the (deleted) entry and the PID to restart
The filesystem journal has filled the volume and an offline fsck pass is required before the kernel will reclaim the trapped space
Inode exhaustion is what is hiding the space here, and running df -i will report and then free the consumed inodes automatically
Sparse files report far more allocated space than their actual contents occupy on disk, and running du --apparent-size is what reconciles the difference that df is showing

You trace a wrapper script that launches a service, but the trace exits cleanly while the service still fails. What did you most likely forget?

-f, to follow children — the wrapper forks the real process, and without it strace never traces the child that actually fails
-c, the summary mode — without that flag strace quietly drops the failing syscalls from its per-call output and you never see them
sudo — without root privilege strace silently traces only the parent wrapper's very first syscall and then stops
-e trace=all — the default filter hides the fork syscall along with every one of the child process's calls

When would you reach for bpftrace or a bcc tool instead of strace?

On a production service under real load, or for a system-wide question — eBPF reads events through in-kernel probes with low overhead and the verifier prevents a bad probe from crashing the box
Whenever you need the exact arguments, the full return value, and the precise errno of a single failing syscall on one process, which plain strace by itself is simply unable to surface for you at all
On older kernels only, since eBPF requires a specially custom-compiled kernel with extra options that ordinary strace tracing avoids needing
When you want to trace user-space library calls such as malloc that never cross into the kernel and stay entirely inside the process

You got correct