init and systemd
Topic 40

init and systemd

systemdServices

init is the first user-space process the kernel starts after it mounts the root filesystem, and it is always PID 1. The kernel hands it control and walks away; from that moment everything else on the machine — your shell, sshd, the database, every cron job — descends from PID 1. On every current Debian, Ubuntu, RHEL, Fedora, and SUSE release, the program that fills that slot is systemd, which is far more than a process launcher: it is a service manager that tracks dependencies, supervises long-running daemons, captures their logs, and confines them with cgroups and kernel namespaces.

That makes PID 1 the single most consequential process on the box. It reaps orphaned children so they do not pile up as zombies, it decides what starts in what order at boot, and it restarts a daemon that crashes if you told it to. When you write a unit file, set a restart policy, or run systemctl, you are configuring this one process — and a misconfiguration here does not crash one service, it can leave the machine unable to boot or refusing to come back after a reboot you cannot undo remotely.

PID 1 and the init Contract

PID 1 has two non-negotiable duties that the kernel itself depends on. First, it must reap dead children: when any process exits, its parent is supposed to call wait() to collect the exit status, and if the parent has already died, the orphaned process is re-parented to PID 1, which must reap it or the system slowly fills with zombie entries that consume PIDs. Second, PID 1 cannot die — if it exits, the kernel panics with Attempted to kill init and the machine halts. Everything systemd does on top is optional; these two properties are the contract.

This contract is exactly why a bare process makes a terrible PID 1, and why containers are a recurring trap. A shell script or an application started as PID 1 in a minimal container does not reap re-parented children and often ignores SIGTERM, so signals from docker stop are dropped and zombies accumulate. The fix is a tiny init shim — tini (shipped as Docker's --init flag) or dumb-init — that does the reaping and signal-forwarding the contract demands, without dragging full systemd into the image.

SysV init and Runlevels

Before systemd, the dominant scheme was SysV init: PID 1 read /etc/inittab, picked a runlevel (a single digit, 0 through 6), and ran the numbered shell scripts in the matching /etc/rcN.d/ directory in lexical order. Runlevel 0 was halt, 6 was reboot, 1 was single-user, and 2 through 5 were multi-user variants — though the meaning of 2–5 differed between Debian and Red Hat, which was a perennial source of confusion. Each service shipped a script in /etc/init.d/ accepting start, stop, restart, and status.

The model had two structural weaknesses that systemd was built to fix. Startup was strictly sequential — S01 finished before S02 began, so a slow mount stalled everything behind it — and the scripts had no real supervision: once a script forked a daemon and returned, init lost track of it, so a crashed service stayed dead and status often just checked a stale PID file. systemd keeps a thin compatibility layer that generates units from leftover /etc/init.d/ scripts, but new services should never be written this way.

The systemd Model: Units, Targets, and the Journal

systemd organizes the system into units, each a small declarative file describing one manageable object. The common types are .service (a daemon or one-shot command), .socket (a listening socket that starts its service on first connection), .timer (a cron replacement), .mount and .automount (filesystems), and .target (a named group used as a synchronization point). Targets replace runlevels: multi-user.target is the headless server state, graphical.target adds a display, and the symlink default.target selects which one boot aims for.

# where the running state lives, and the order systemd searches
/usr/lib/systemd/system/   # distro-shipped units (do not edit)
/etc/systemd/system/       # your overrides + custom units (wins)
/run/systemd/system/       # volatile, runtime-generated units

# inspect and control
systemctl status nginx.service
systemctl list-units --type=service --state=running
systemctl get-default              # prints the active default.target
systemctl cat nginx.service         # show the effective unit + drop-ins

Logging is unified through the journal: every line a service writes to stdout and stderr is captured by journald, tagged with the unit name, PID, timestamp, and boot ID, and stored as a structured binary log. You read it with journalctl -u nginx, scope it to the current boot with journalctl -b, or follow it live with journalctl -f. The default Storage=auto keeps the journal persistent in /var/log/journal when that directory exists and volatile in /run/log/journal otherwise; on Debian and Ubuntu that directory is absent by default, so the journal is volatile and lost on reboot. Setting Storage=persistent in /etc/systemd/journald.conf creates /var/log/journal and makes the log survive, which you want on any server you expect to debug after a crash.

cgroups and Service Supervision

systemd's real advantage over SysV is that it knows exactly which processes belong to a service, because it places each service in its own control group. Every process a daemon forks — workers, helper scripts, runaway children — stays inside that cgroup, so systemctl stop can signal the entire tree and systemctl kill guarantees nothing is left running. This is also where resource limits live: MemoryMax=512M, CPUQuota=50%, and TasksMax= are enforced by the kernel cgroup, not by fragile per-process ulimit settings.

Supervision is declarative. Restart=on-failure brings a crashed daemon back; Restart=always brings it back even after a clean exit. The rate limiter StartLimitIntervalSec= and StartLimitBurst= stops a crash loop from hammering the machine — exceed the burst and systemd marks the unit failed and stops trying. The Type= directive tells systemd how to know the service is actually up: simple assumes readiness the instant systemd executes the process, while notify waits for the daemon to send sd_notify(READY=1), which is the only honest readiness signal and what dependent units should wait on.

# a minimal, well-behaved service unit
[Unit]
Description=Widget API
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
ExecStart=/usr/local/bin/widget --port 8080
Restart=on-failure
RestartSec=2
MemoryMax=512M

[Install]
WantedBy=multi-user.target

Boot Sequence and Dependency Ordering

At boot, systemd pulls in default.target and walks its dependency graph, starting everything it can in parallel. Two directives govern this, and conflating them is the classic systemd mistake. Wants= and Requires= express what must be pulled in — a requirement, with Requires= being hard (if the dependency fails, this unit fails too) and Wants= being soft. Before= and After= express ordering only — they say nothing about whether a unit is started, just the sequence if both happen to be active.

The practical consequence: a unit with After=postgresql.service but no Wants= or Requires= will start after Postgres if Postgres is being started anyway, and will start immediately otherwise. Network ordering is the most common trap of all — After=network.target only means the networking stack has been configured, not that an interface has an address, so a service that binds to a specific IP must use After=network-online.target together with Wants=network-online.target. On Debian and Ubuntu, that online state is satisfied by systemd-networkd-wait-online or Netplan's renderer; on RHEL it is NetworkManager-wait-online.service. Inspect the whole chain with systemd-analyze critical-chain and find what is slowing boot with systemd-analyze blame.

SysV init vs systemd vs OpenRC/runit

SysV init — sequential shell scripts in /etc/init.d/ driven by runlevels, with no real supervision once a daemon forks. You will only meet it on legacy systems and inside the systemd compatibility layer; do not write new services against it.

systemd — declarative units, parallel boot, cgroup-based supervision, integrated logging and timers. The default on every mainstream distribution and the assumption behind the rest of this chapter; choose it unless an external constraint forbids it.

OpenRC / runit — lightweight alternatives kept by Alpine, Gentoo, Void, and Devuan. OpenRC adds dependency-based ordering on top of init scripts; runit is a minimal supervision suite with near-instant restarts. Reach for these on tiny container images or systems where a full systemd footprint is unwanted, accepting that you lose journald, timers, and unit semantics.

Common Mistakes
  • Confusing Wants=/Requires= with After=/Before= — adding only After=db.service and expecting the database to be pulled in. Ordering directives never start anything; the service races ahead and fails to connect because nothing requested the dependency.
  • Using After=network.target for a daemon that binds to a specific address. That target only means networking is configured, not that an interface has an IP, so the bind fails at boot; you need network-online.target with a matching Wants=.
  • Editing a distro-shipped unit in /usr/lib/systemd/system/ directly. The next package update silently overwrites it; use systemctl edit to create a drop-in under /etc/systemd/system/ that survives upgrades.
  • Running a raw application or shell script as PID 1 in a container. It does not reap re-parented children and usually ignores SIGTERM, so zombies accumulate and docker stop hangs for the full 10-second grace period before a SIGKILL.
  • Setting Restart=always with no StartLimitBurst= on a daemon that fails fast. The service enters a tight crash loop, spinning CPU and flooding the journal instead of failing cleanly and alerting you.
  • Assuming systemctl enable also starts the service now. It only sets the boot-time WantedBy symlink; the daemon stays down until the next boot unless you also run systemctl start or use enable --now.
  • Forgetting systemctl daemon-reload after changing a unit file. systemctl keeps acting on the cached version, so your edit appears to have no effect and you chase a phantom bug.
Best Practices
  • Customize units with systemctl edit unit, which writes a drop-in to /etc/systemd/system/unit.d/override.conf instead of touching the packaged file — your change then survives every package upgrade.
  • Pair After=network-online.target with Wants=network-online.target for any service that binds to a fixed IP or dials out at startup; never rely on network.target alone.
  • Set Type=notify and emit sd_notify(READY=1) from daemons you control, so dependent units start only when the service is genuinely ready rather than the instant it forks.
  • Bound every restarting service with StartLimitIntervalSec= and StartLimitBurst= so a crash loop trips into failed and surfaces, instead of hammering the box forever.
  • Cap resources in the unit with MemoryMax=, CPUQuota=, and TasksMax=; the kernel cgroup enforces them across the whole process tree, which per-process ulimit cannot.
  • Enable persistent logging by setting Storage=persistent in journald.conf, which creates /var/log/journal if it is missing, so post-crash investigation has the lines that explain the crash instead of the volatile default that vanishes on reboot.
  • Run systemctl daemon-reload after every unit edit, then systemd-analyze verify unit to catch directive typos before the service refuses to start.
Comparable toolsmacOS launchd — Apple's PID 1 and service manager, configured with property-list jobs; the closest single-tool analogue to systemd's unit-plus-supervisor roleWindows Service Control Manager — manages services, dependencies, and recovery actions, the SCM being the NT equivalent of the systemd service layerBSD rc — FreeBSD's rc.d and OpenBSD's rc, ordered shell scripts much like SysV init, with no cgroup-style supervision

Knowledge Check

A unit declares only After=postgresql.service and nothing else about Postgres. What actually happens at boot?

  • If Postgres is being started anyway it goes first; if nothing else pulls Postgres in, the unit starts immediately without it, because After= orders but never requests a dependency
  • systemd blocks and refuses to start the unit until postgresql.service is fully running first, treating the bare After= line as a hard requirement that must be satisfied before launch
  • Postgres is automatically pulled into the boot transaction and started, since After= implies the dependency must exist
  • The unit fails immediately with a missing-dependency error at boot if postgresql.service is not already active

Why is running a plain application as PID 1 in a container a problem?

  • It does not reap re-parented orphans and often ignores SIGTERM, so zombies pile up and docker stop hangs until the grace period forces a SIGKILL
  • The kernel forbids any process other than systemd from ever holding PID 1 in a namespace, so the container refuses to start and exits with an error at once
  • A process running as PID 1 cannot open network sockets, so the application is left unreachable from outside the container
  • Logs written by PID 1 are never captured anywhere, because journald requires a separate init process to exist first

You change ExecStart= in a service unit and the new command does not take effect. What is the most likely cause?

  • You did not run systemctl daemon-reload, so systemctl is still acting on the cached copy of the unit
  • A unit file becomes read-only after the first boot and must be deleted and then recreated from scratch to change it
  • Changes to ExecStart= require a full reboot, because the command is compiled into the initramfs boot image
  • The change is silently ignored unless you also bump a version number directive in the [Unit] section

A daemon binds to a fixed IP and fails at boot but works when started by hand a minute later. What is the correct fix?

  • Order it after network-online.target and add Wants=network-online.target, because network.target only means networking is configured, not that an interface has an address
  • Add only After=network.target to the unit, which by itself already guarantees that every interface has been fully brought up and has its routable static IP address assigned
  • Set Restart=always with a short delay so the unit keeps retrying until the network happens to be up
  • Move the unit file into /usr/lib/systemd/system/ so that it is read earlier in the boot sequence

What does placing each service in its own cgroup give systemd that SysV init lacked?

  • Exact accounting of every process the service forked, so stop and kill act on the whole tree and resource limits like MemoryMax= are enforced by the kernel
  • Measurably faster process creation, because membership in a cgroup lets a service bypass the normal fork() path
  • Automatic encryption of each service's resident memory pages by the kernel, so that one process can never read another service's in-memory data
  • The ability to run services entirely without a PID of their own, removing the need to ever reap their exited orphan child processes at all under any real load

You got correct