Topic 62

Exit Codes and Error Handling

Shell

Every command returns an integer status when it finishes: 0 means success, and 1 through 255 mean failure. The shell stores that number in $?, and you read it to decide what happens next. A script that never checks it is a script that keeps running after a step has already failed.

That is how a backup "succeeds" while writing an empty archive, and how a deploy continues past a failed migration. Error handling in shell is the discipline of checking the status of every command that matters and of turning silent failures into loud ones with set -e, set -o pipefail, and trap.

Exit Status Conventions

The status of the last foreground command lives in $?. It is a single byte, so the range is 0 to 255. Reading $? consumes nothing, but the next command overwrites it — so $? after an echo reports the echo, not the command you cared about. Capture it into a variable the instant you need it later.

Several codes carry fixed meanings, and knowing them turns an opaque number into a diagnosis. 127 means the command was misspelled or absent from $PATH; 126 means the file exists but is not executable; and any code of the form 128 + N means the process was killed by signal N, so 130 is a Ctrl-C (SIGINT, signal 2) and 137 is a SIGKILL (signal 9).

Code	Meaning
0	Success
1	General error (catch-all)
2	Misuse of a shell builtin
126	Command found but not executable
127	Command not found
128+N	Killed by signal N (130 = SIGINT, 137 = SIGKILL)

cp backup.tar.gz /mnt/archive/
status=$?           # capture before the next command overwrites $?
if [ "$status" -ne 0 ]; then
  echo "copy failed (status $status)" >&2
fi

The errexit Option

The set -e option, also spelled set -o errexit, tells the shell to exit the moment a command returns a non-zero status. It converts "keep going no matter what" into "stop at the first failure," which is the right default for almost every non-interactive script.

The option has sharp, surprising exceptions. In a pipeline only the last command's status counts, so false | true exits 0 even under errexit. A command tested in an if, while, or until condition, or anywhere in an && or || list except the final command, never triggers the exit. And a function called inside any of those conditional contexts loses errexit for its entire body. Anyone who assumes set -e catches everything eventually gets burned by one of these.

set -e
cat /var/log/app.log | grep ERROR | wc -l
echo done   # prints: only wc's status reaches $?, and wc succeeds

The pipefail and nounset Options

set -o pipefail closes the pipeline gap: a pipeline returns the status of the rightmost command that failed, not just the last stage. With it set, the grep pipeline above fails when grep finds nothing or cat cannot open the file, instead of being masked by a successful wc.

set -u, or set -o nounset, aborts the script when it expands a variable that was never set. That catches a typo like $DEST_DIRR before it expands to the empty string and turns rm -rf "$DEST_DIRR"/* into rm -rf /*. The three combine into the standard header set -euo pipefail, which most production scripts open with.

#!/usr/bin/env bash
set -euo pipefail   # errexit + nounset + pipefail

# Now a failed command, an unset variable, and a broken
# pipeline stage each stop the script immediately.

Traps and Cleanup

The trap builtin runs a command when the shell receives a signal or reaches a pseudo-signal. The most useful is EXIT, which fires whenever the script ends for any reason — normal completion, an errexit abort, or a fatal signal. A single trap ... EXIT guarantees cleanup runs on the failure paths, not just the happy path.

Make the cleanup idempotent so it is safe regardless of how far the script got before it ran. Add trap ... ERR to report which line failed during debugging, and trap INT and TERM separately when a Ctrl-C or a kill needs handling distinct from a normal exit.

set -euo pipefail
tmp=$(mktemp -d)
cleanup() { rm -rf "$tmp"; }            # idempotent: safe even if $tmp is gone
trap cleanup EXIT                       # runs on success, error, or signal
trap 'echo "failed at line $LINENO" >&2' ERR

tar czf "$tmp/out.tar.gz" /etc

Explicit Error Handling and Custom Codes

Sometimes you want to handle a failure rather than abort on it. The || operator runs its right side only when the left side fails, which gives the common command || die "message" pattern; && is the mirror image, running its right side only on success. Inside a function, return N sets the function's status without leaving the script, while exit N terminates the whole process — so an exit buried in a sourced library kills the parent shell.

Give your own scripts meaningful exit codes and document them, so callers and monitoring can tell a config error from a network timeout instead of seeing a generic 1. Stay inside 1 to 125, since 126, 127, and 128 + N are reserved by the shell, and any code above 255 wraps modulo 256 — exit 256 becomes 0 and reads as success.

die() { echo "$*" >&2; exit 1; }

deploy() {
  systemctl restart app || return 3   # report a specific failure to the caller
}

deploy || die "deploy failed, aborting"

Common Mistakes

Relying on set -e to catch pipeline failures: without pipefail, only the last stage's status counts, so generate | tee out.log reports success even when generate crashed.
Reading $? after an intervening echo or [ test — the second command overwrites it, so you branch on the status of the wrong command. Capture status=$? first.
Assuming set -e catches everything: a function called inside an if condition, or any command joined with && / ||, runs with errexit disabled and silently swallows failures.
Calling exit in a script meant to be sourced — it terminates the interactive shell that sourced it instead of returning to the caller. Use return in sourced code.
Writing non-idempotent cleanup in trap ... EXIT; if the temp dir was never created, the trap itself errors and masks the original failure.
Returning a status above 255 or reusing reserved codes (126, 127, 128 + N) for your own errors — the value wraps modulo 256 and collides with shell-defined meanings.
Treating exit code 127 as a generic bug instead of "command not found," which almost always means a typo or a missing package, not faulty logic.

Best Practices

Open every non-trivial script with set -euo pipefail so failed commands, unset variables, and broken pipeline stages all stop execution.
Register cleanup with trap cleanup EXIT and write the cleanup function to be safe even when nothing has been created yet.
Capture status=$? on the line immediately after the command you care about, before any other command can overwrite it.
Use command || die "message" for fatal steps and send the error to stderr with >&2, never to stdout where it pollutes parseable output.
Return status from functions with return N and let the caller decide; reserve exit for the top-level script, especially in anything that may be sourced.
Add trap 'echo "error on line $LINENO" >&2' ERR while debugging to pinpoint exactly where a failure happened.
Assign your own exit codes in the 1–125 range, document what each one means, and keep them stable so monitoring can distinguish failure modes.

Comparable toolsPowerShell $LASTEXITCODE and $?Windows cmd errorlevel

Knowledge Check

A script runs cat file | grep ERROR | wc -l under set -e, but file is missing. Why does the script keep going?

Without pipefail, only the last stage's status reaches the shell, and wc succeeds — so errexit sees a 0
set -e carries a built-in exception that quietly ignores the status of any command that happens to read its input from a pipe
cat returns a status of 0 even when the file it is given does not exist
grep always returns 0 whenever it produces no matching output lines

Why prefer trap cleanup EXIT over simply calling cleanup as the last line of the script?

EXIT fires on normal completion, an errexit abort, and signals, so cleanup also runs on the failure paths that never reach the last line
A trap dispatches the cleanup faster than calling the function directly as the final line
An EXIT trap can return a status greater than 255 to the parent shell, where a normally called function cannot
Without a trap, the cleanup function would run in its own scope and lose all access to the script's variables and temporary file paths when it finally runs

A helper is loaded with source lib.sh, and on an error it calls exit 1. What happens?

The shell that sourced the file terminates, because exit ends the current process rather than returning to the caller
Only the library's own subshell exits with status 1, and the interactive shell that sourced the file simply continues running unaffected
The shell silently converts the sourced file's exit into a return so the caller survives
The caller receives exit code 0 because sourcing suppresses failures

A command finishes and the shell reports exit code 130. What does that tell you?

The command was killed by SIGINT — the signal Ctrl-C sends — because 130 is 128 + 2
The named command could not be located anywhere on the current $PATH
The command file existed on disk but was not marked with the executable permission bit
A shell builtin was invoked with the wrong number or kind of arguments

Why should a script's custom error codes stay in the 1–125 range?

126, 127, and 128 + N are reserved by the shell, and any value above 255 wraps modulo 256 — so reusing them creates ambiguous or false-success statuses
The shell flatly rejects any exit argument greater than 125 with a syntax error before the script can even terminate, so codes in that range are simply not allowed to be written
Any code above 125 is automatically reset to a plain 1 by set -e before it reaches the caller
Only codes 0 through 125 can be read back from $?

You got correct