Topic 02

The Kernel and User Space

Foundations

Linux splits every running machine into two zones separated by a hardware privilege boundary. Kernel space runs in CPU ring 0 with unrestricted access to memory, devices, and the instruction set. User space — every daemon, shell, and application — runs in ring 3, sandboxed, unable to touch hardware or another process's memory without asking the kernel first. The CPU enforces the split; it is not a convention the kernel can wish away.

The operational consequence is the difference between an annoyance and an outage. A crash in user space kills one process: Nginx segfaults, systemd restarts it, the rest of the box keeps serving. A crash in kernel space — a null dereference in a driver, a bad loadable module — has nothing above it to catch the fault, so it panics and takes the whole machine down. That is why you treat anything running in ring 0 with far more suspicion than anything in ring 3.

Protection Rings and Mode Switches

x86-64 CPUs define four privilege rings, 0 through 3, but Linux uses only two of them: ring 0 for the kernel and ring 3 for everything else. Rings 1 and 2 exist in the hardware but no mainstream operating system bothers with them, because a clean two-level split maps directly onto the kernel/user divide and is simpler to reason about. Ring 0 can execute privileged instructions — loading page tables, talking to I/O ports, disabling interrupts. Ring 3 traps and faults if it tries.

Crossing from ring 3 to ring 0 is a mode switch, and it is not free. A modern syscall instruction plus the kernel entry and exit work costs on the order of a few hundred CPU cycles even before the kernel does anything useful — and mitigations like Spectre and Meltdown page-table isolation have pushed that higher on affected CPUs. A function call inside one ring costs a handful of cycles. That gap is why batching work into fewer, larger system calls beats making many small ones.

You rarely think in rings day to day, but the cost shows up the moment a workload is syscall-bound. A program that reads a file one byte at a time pays the mode-switch toll on every byte; the same program with a 64 KB buffer pays it once per block. The boundary is real, and it has a price tag.

The System Call Interface

User space cannot open a file, send a packet, or fork a process on its own — it asks the kernel through a system call. Linux exposes roughly 350 syscalls on x86-64, each identified by a number, each a documented contract between user and kernel space. Almost no application invokes them directly; the C library wraps them in friendlier functions, so a call to fopen() bottoms out at the openat syscall, and a call to printf() eventually reaches write.

strace makes the boundary visible by tracing every syscall a process makes. It works through ptrace, which stops the target on entry and exit of each call — useful for debugging, but it adds two extra context switches per syscall, so a strace'd program can run an order of magnitude slower. That slowdown is the reason you reach for strace in development, not on a hot production path.

# Summarize syscalls a process makes, with time spent in each
strace -c -f curl -s https://example.com > /dev/null
# -c aggregates counts; -f follows forked children
# Output: openat, connect, sendto, recvfrom, read, write, close ...

In production, prefer eBPF-based tooling. bpftrace and the bcc tools attach probes inside the kernel and aggregate in place, so you observe syscalls with a fraction of strace's overhead and without stopping the process on every call.

Kernel Modules

Not all kernel code is compiled into the boot image. Loadable kernel modules — device drivers, filesystems, netfilter hooks — load and unload at runtime with modprobe and rmmod, and lsmod lists what is currently resident. A module runs in ring 0 with the same privileges as the core kernel, which is exactly why a buggy module is dangerous: there is no sandbox around it. A null pointer in a third-party driver panics the box just as surely as a bug in the scheduler would.

# Inspect and load modules
lsmod | grep nvme        # is the NVMe driver loaded?
modinfo nvme              # version, signer, parameters
modprobe dummy            # load a module by name (resolves deps)

On a machine with UEFI Secure Boot enabled, the kernel refuses to load modules that are not signed by a trusted key, and it logs a tainted-kernel warning. On Debian and Ubuntu this trips up out-of-tree drivers built locally — Nvidia, VirtualBox, ZFS — whose DKMS-rebuilt modules must be signed with a Machine Owner Key (MOK) enrolled through mokutil, or the load fails with Operation not permitted after a kernel upgrade. Red Hat behaves the same way; the signing tooling differs but the policy is identical.

Virtual Memory and the Kernel/User Split

Each process gets its own virtual address space, and the kernel maps itself into the top of every one of them. On x86-64 with a 48-bit canonical address space, user space owns the lower half and the kernel occupies the upper half. Mapping the kernel into every process is what makes a system call cheap to enter — there is no full address-space switch on the way into ring 0 — but the user-space half of that mapping is marked privileged in the page tables, so a ring-3 read of a kernel address faults immediately.

That protection is structural, not advisory. A user process literally cannot read kernel memory or another process's memory; the MMU rejects the access and the kernel delivers a SIGSEGV. The Meltdown vulnerability mattered precisely because it broke this guarantee through speculative execution, which is why current kernels add page-table isolation that unmaps most of the kernel during ring-3 execution — at the cost of more expensive syscalls.

Process Context and Scheduling

The scheduler runs in ring 0 and decides which thread occupies each CPU core. It manages both ordinary user processes and kernel threads — tasks like kworker and ksoftirqd that do work on the kernel's behalf and show in brackets in ps output. Linux is preemptive: the kernel can interrupt a running task to give the CPU to a higher-priority one, so a CPU-bound loop in one process does not starve the rest of the system.

The kernel/user split is visible in any CPU breakdown. The us column in top is time spent executing user-space code; sy is time in the kernel on behalf of processes; wa is time idle waiting on I/O. A web server pinned at 40 percent us and 50 percent sy is spending most of its CPU inside the kernel — usually on network or filesystem syscalls — and no amount of application tuning fixes that until you cut the syscall volume.

# Watch the user/kernel/iowait split over 1-second samples
vmstat 1
# columns: us = user, sy = system (kernel), wa = iowait, id = idle
pidstat -u 1            # per-process %usr vs %system

When you profile, read sy first. High system time points at the kernel — syscalls, lock contention, interrupt load — and tells you the fix lives in how the program talks to the kernel, not in its arithmetic.

System call vs library call vs shell command

System call — the only sanctioned crossing into ring 0: openat, read, write, execve. It costs a mode switch and is identified by a number, not a name. This is the actual kernel/user boundary.

Library call — a C-library function like printf() or fopen() that runs in ring 3 and may wrap one or more syscalls underneath. printf() buffers in user space and eventually calls write; plenty of library calls make no syscall at all.

Shell command — what you type at the prompt, like ls. The shell forks and execs a separate program that in turn makes dozens of library and system calls. One command is many syscalls; never equate a command with a single trip into the kernel.

Common Mistakes

Assuming a shell command equals a system call — ls alone makes dozens of syscalls, so reasoning about kernel cost from command count is wrong by an order of magnitude.
Believing root bypasses the user/kernel boundary. Root is still ring-3 user space; it has more permissions but crosses into the kernel through the same syscalls and faults the same way on a bad memory access.
Loading unsigned out-of-tree modules on a Secure Boot system and being surprised when modprobe fails with Operation not permitted after a kernel upgrade.
Ignoring context-switch cost in hot loops — reading a file one byte at a time pays the mode-switch toll on every byte instead of once per large buffer.
Running strace on a hot production process without realizing ptrace adds two context switches per syscall and can slow it tenfold.
Treating kernel memory as readable from user space. The MMU faults the access; a ring-3 read of a ring-0 address earns a SIGSEGV, not data.
Building DKMS modules once and forgetting them — after a kernel upgrade the module must rebuild and re-sign, or the device silently stops working on next boot.

Best Practices

Read kernel messages with journalctl -k (or dmesg) before suspecting an application — panics, OOM kills, and taint flags surface there first.
Sign your out-of-tree modules with an enrolled MOK via mokutil, or disable Secure Boot as a deliberate, documented decision — never by accident.
Profile with the user/system split in mind: check us versus sy in vmstat or pidstat before deciding where the time goes.
Prefer bpftrace and the eBPF-based bcc tools over strace for production tracing; they probe in-kernel without stopping the process per syscall.
Keep DKMS modules in sync with kernel upgrades — verify dkms status after an upgrade and before rebooting a remote box.
Read kernel state from /proc and /sys instead of guessing — /proc/<pid>/status, /proc/interrupts, and /sys/module expose what the kernel actually sees.
Batch I/O into larger reads and writes to amortize the mode-switch cost rather than crossing the syscall boundary on every byte.

Comparable toolsWindows — the NT executive runs in kernel mode (ring 0); user code traps into it through the ntdll syscall stubsmacOS — the XNU hybrid kernel mixes Mach traps with BSD syscalls below a user-space libSystem

Knowledge Check

A third-party storage driver dereferences a null pointer. What happens?

The kernel panics and the whole machine goes down — module code runs in ring 0 with no sandbox above it
The driver runs as an isolated process that segfaults, and systemd restarts it cleanly just like any other user-space crash
The CPU traps the fault, demotes the driver to ring 3, and lets the rest of execution continue along unaffected
The MMU delivers a SIGSEGV to the faulting driver and the rest of the system keeps running normally

A web server sits at 30% us and 55% sy in top. Where should you look first?

The kernel path — syscall volume, lock contention, or interrupt load — because most CPU is system time
The application's own arithmetic and business logic, since the CPU is clearly the bottleneck and that is where user code spends its cycles
The disk hardware, because a high sy figure in top always indicates a failing or saturated drive
The scheduler priority of the process, since sy measures the time the process loses to preemption

After a kernel upgrade on a Secure Boot box, modprobe for the Nvidia driver fails with Operation not permitted. Why?

The DKMS module rebuilt against the new kernel but was not signed by an enrolled trusted key, so the kernel refuses to load it
The new kernel package removed the modprobe command from the system during the upgrade
Ordinary ring 3 processes lack the permission needed to call modprobe at all after a kernel upgrade
The module actually loaded fine, and the error message originates from the GPU's own firmware rather than from the kernel itself

Why is strace a poor choice for tracing a busy production process?

It uses ptrace, which stops the target on every syscall entry and exit, adding context switches that can slow it tenfold
It can only trace one type of syscall per invocation, so on a busy process it silently misses most of the activity it should report
It requires Secure Boot to be disabled in firmware first, before it will attach to and trace a running process
It runs entirely in ring 3 and therefore cannot observe the kernel-side half of any syscall at all

You got correct