Topic 78

eBPF

KernelObservability

eBPF runs small, verified programs inside the running kernel, attached to events — a syscall entry, a packet arriving on a NIC, a function call deep in the kernel, a tracepoint. You compile a program to eBPF bytecode, the kernel's in-kernel verifier checks it cannot loop forever, dereference a wild pointer, or read memory it shouldn't, then a JIT compiles it to native code and runs it on every hit of that event. No kernel module, no reboot, no patched kernel — and if your program is unsafe, the verifier rejects it at load time instead of panicking the box at runtime.

That safety property is the whole story for a server operator. Before eBPF, instrumenting the kernel in production meant a custom module that ran with full ring-0 privileges and could take the machine down with one bad pointer, so nobody did it on a box that mattered. eBPF turns "instrument the kernel" from an outage risk into something you run on a loaded production host with single-digit-percent overhead — which is why the modern observability, networking, and runtime-security stack (bcc, bpftrace, Cilium, Falco, Pixie) is built on it.

The Core Idea

eBPF is a tiny virtual machine in the kernel: eleven 64-bit registers, a 512-byte stack, a fixed instruction set, and no unbounded loops in the early designs (bounded loops arrived in kernel 5.3). A program is attached to a hook and runs in the context of whatever triggered it, reading the event's arguments and the kernel state around it. Because it runs in kernel space, it sees everything the kernel sees — but it cannot do everything the kernel can, and that constraint is enforced, not trusted.

The verifier is what makes this safe rather than reckless. At load time it walks every possible path through the program, proves it terminates, proves every memory access is in-bounds, and rejects anything it cannot prove. A program that passes is guaranteed not to crash the kernel; a program the verifier can't reason about is refused, even if it would have been fine. The result is a genuinely new capability: arbitrary, operator-supplied code running in ring 0 that is safe by construction, not by careful review.

What It Powers

Observability is the entry point most people meet first. Tools like bpftrace and the bcc collection attach to tracepoints, kprobes, and uprobes to answer questions that no /proc file exposes — per-process disk latency histograms, which files a process opens, off-CPU time, syscall error rates — all without changing the application or restarting anything. The data is gathered in the kernel and only summaries cross into user space, so the overhead stays low even under load.

Networking is the second large domain. eBPF programs at the XDP hook run on the network driver before the kernel even builds an sk_buff, which is how DDoS scrubbers and load balancers drop or redirect millions of packets per second per core. Cilium replaces iptables-based Kubernetes networking and policy entirely with eBPF, and the third domain — runtime security — uses the same hooks to observe and block syscalls in real time, which is what tools like Falco and Tetragon do.

Hooks and Maps

A hook is where a program attaches. The common ones are kprobes (any kernel function entry or return, fragile across kernel versions because internal symbols change), tracepoints (stable, kernel-maintained instrumentation points — prefer these), uprobes (user-space function entry, for tracing applications), XDP and tc (network ingress/egress), and LSM hooks (security decisions). The choice of hook decides both what you can see and how portable your program is across kernel versions.

Maps are the other half of the model: typed key/value stores that live in the kernel and are shared between the eBPF program and user space, and between separate eBPF programs. A program writes counts, histograms, or per-PID state into a map on every event; a user-space process reads the map periodically to render output. This split — heavy per-event work in the kernel, cheap aggregated reads from user space — is exactly why eBPF tools stay cheap where strace does not.

# count system calls by name across the whole machine for 10s
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[probe] = count(); }'

# histogram of block I/O latency in microseconds (bcc tool)
biolatency

# which processes are opening which files, live
opensnoop

The Tooling

bpftrace is the awk of kernel tracing: a one-line language for ad-hoc questions, ideal when you don't yet know what you're looking for. The bcc project ships dozens of ready-made tools — execsnoop, opensnoop, biolatency, tcpconnect, runqlat — that answer specific production questions out of the box. On Debian and Ubuntu both arrive via apt install bpftrace bpfcc-tools; on RHEL family it is dnf install bpftrace bcc-tools. Both need root or CAP_BPF plus CAP_PERFMON on recent kernels.

The reason these tools beat strace under load is mechanical, not cosmetic. strace uses ptrace, which stops the target twice per syscall to hand control to the tracer — that can slow a syscall-heavy process by 100x or more, enough to change its behavior or trip timeouts. An eBPF program runs in-kernel at the hook with no context switch to a tracer and aggregates in a map, so tracing every syscall on a busy server costs a few percent. For a one-off look at a single process, strace is still faster to reach for; for anything running on a host under real traffic, eBPF is the only safe option.

Kernel Version and Portability

eBPF features are kernel-version-gated, and the gap between distributions is large. The verifier, the available hook types, bounded loops (5.3), BPF LSM (5.7), and the CO-RE / BTF machinery that lets one compiled program run across kernels all landed in specific releases. A Debian 12 kernel (6.1) or Ubuntu 22.04 (5.15) has a rich feature set; an old enterprise 4.x kernel has eBPF but a far narrower one. Check the running kernel before assuming a tool will load.

Portability used to mean compiling the eBPF program on the target host against its exact kernel headers, because kprobe-attached programs read kernel structs whose layout shifts between versions. CO-RE (Compile Once, Run Everywhere), built on BTF type information and libbpf relocations, removed that: a single compiled object adapts to the running kernel's struct layout at load time. Tools built on libbpf with CO-RE ship as one binary; older bcc tools still carry a compiler and kernel headers and recompile on first run, which is why they pull in clang and the headers as dependencies.

eBPF vs Kernel Modules vs strace

eBPF — verified, sandboxed bytecode attached to kernel hooks, JIT-compiled and run in ring 0 without a reboot. Use it to instrument or extend the kernel on a production host: low overhead, safe by construction, removable at runtime. This is the default for anything kernel-level that has to run on a box that matters.

Kernel modules — arbitrary code loaded into the kernel with full privileges and no verifier. Use them only when you genuinely need unrestricted kernel access (a real device driver, a filesystem) — a bad pointer panics the machine, and that risk is exactly what pushed instrumentation off modules and onto eBPF.

strace — a user-space tool using ptrace to intercept one process's syscalls, no kernel code at all. Use it for a quick, ad-hoc look at a single process; never on a process under production load, where its 100x slowdown changes behavior.

Common Mistakes

Assuming eBPF needs a custom or patched kernel. Every mainstream distro kernel — Debian, Ubuntu, RHEL — ships eBPF enabled; you install bpftrace/bpfcc-tools from the repos, you do not rebuild the kernel.
Expecting strace-level simplicity. eBPF has a real learning curve — hooks, maps, the verifier's rules — and a program that compiles can still be rejected at load, which surprises people coming from "just run the tool."
Writing kprobe-based tools against internal kernel functions and shipping them across a fleet, then watching them break on the next kernel because the symbol was renamed or inlined. Prefer stable tracepoints, or CO-RE programs on BTF kernels.
Treating eBPF programs as unrestricted kernel code. The verifier caps program complexity, bounds every loop and memory access, and limits stack to 512 bytes — a "valid C" program routinely fails to load, and the fix is rewriting to satisfy the verifier, not disabling it.
Running bcc tools on a host without the matching kernel headers and being puzzled when they fail to compile on first run. The older bcc tools recompile against the running kernel; libbpf/CO-RE tools avoid this, the bcc ones do not.
Reaching for strace on a busy production process to "see what it's doing" and slowing it 100x, tripping its timeouts or load-balancer health checks. Use bpftrace or a bcc tool there instead.

Best Practices

Reach for eBPF tools — bpftrace, bcc — for any tracing on a host under real traffic; reserve strace for a quick look at a single idle or low-traffic process.
Attach to tracepoints rather than kprobes whenever a tracepoint exists; tracepoints are a stable kernel-maintained API, kprobes break across versions.
Check the running kernel with uname -r before assuming a tool loads, and confirm BTF support (/sys/kernel/btf/vmlinux exists) before relying on CO-RE portability.
Install the curated bcc collection (apt install bpfcc-tools on Debian/Ubuntu, dnf install bcc-tools on RHEL) and learn execsnoop, opensnoop, biolatency, and tcpconnect before writing your own programs.
Grant CAP_BPF plus CAP_PERFMON instead of full root to a tracing agent on kernels 5.8 and newer; it is the least-privilege path to running eBPF.
Prefer libbpf/CO-RE tools for anything you deploy across a fleet, so one compiled binary runs on every kernel instead of shipping a compiler and headers to each host.

Comparable toolsDTrace — the Solaris/BSD/macOS tracing framework eBPF's observability tools consciously reimplement; safe in-kernel instrumentation predates eBPF hereWindows — ETW for tracing, plus the eBPF for Windows project bringing the same model to NTCilium — eBPF-based Kubernetes networking and policy, replacing iptables-based dataplanes

Knowledge Check

Why can eBPF run operator-supplied code in the kernel without risking a panic, where a kernel module cannot?

An in-kernel verifier proves at load time that the program terminates and never makes an out-of-bounds memory access, rejecting anything it cannot prove
eBPF programs run entirely in user space rather than the kernel, so any fault is contained to the tracing process that loaded them
The kernel spins up each eBPF program inside its own separate hardware virtual machine, complete with isolated page tables and memory protection enforced directly by the CPU
eBPF programs are restricted to strictly read-only access and are never allowed to execute on real kernel events like kprobes

On a production server handling heavy syscall traffic, why is a bcc/bpftrace tool preferred over strace?

strace uses ptrace and stops the target twice per syscall, which can slow it 100x and change its behavior; eBPF aggregates in-kernel with single-digit-percent overhead
eBPF tools surface far more syscall detail, yet under heavy load both impose essentially the same per-syscall runtime cost on the traced process, so the choice is purely about output
strace physically cannot follow more than a single process at a time, while eBPF tools trace the whole machine at once
strace requires loading a custom kernel module and a reboot first, which is far too disruptive on a running production host

A kprobe-based tool that worked on one host fails to load after a kernel upgrade on another. What is the most likely cause and the better design?

kprobes attach to internal kernel functions whose names and struct layouts shift between versions; a stable tracepoint or a CO-RE/BTF program is portable across kernels
The newer kernel disabled eBPF support entirely during the upgrade, so any eBPF tool now needs the whole kernel rebuilt from source with the feature flag re-enabled before it loads
kprobes require a custom-patched kernel that the upgrade silently overwrote; switching to uprobes on the same functions avoids all patching
The tool exhausted the kernel's fixed hard limit of one loaded eBPF program per host and must now be the only program running

What does the eBPF verifier actually constrain in a program?

It bounds loops and program complexity, requires every memory access to be provably in-bounds, and caps the stack — valid C can still be rejected if it cannot prove safety
It limits the program to a small fixed whitelist of approved high-level helper functions and flatly forbids it from reading any kernel data structures whatsoever at runtime
It only checks that the program object is signed by a trusted vendor key, leaving the program's actual runtime behavior entirely unchecked
It enforces a strict maximum runtime measured in milliseconds and forcibly kills any program that exceeds that budget while it runs

You got correct