Topic 21

awk

ShellText Processing

awk is a small programming language built for column- and record-oriented text. It reads input one record at a time — by default one line — splits each record into fields on whitespace, and runs your program against every record in turn. The whole language is a list of pattern { action } rules: for each record, awk tests the pattern, and where it matches, runs the action. That model is what makes a one-liner like awk '$5 > 100 { print $3 }' — "print the third column of every line whose fifth column exceeds 100" — a single expression instead of a script.

Every Debian and Ubuntu install ships an awk at /usr/bin/awk; on those systems it is usually mawk, a fast, lean implementation, while the fuller-featured GNU version is gawk (installed with apt install gawk, and the default on RHEL/Fedora). For the field-splitting, arithmetic, and counting that this page covers, mawk and gawk behave identically — the divergence only shows up in gawk extensions like gensub(), true multidimensional arrays, and FIELDWIDTHS. The practical consequence: reach for awk the moment a problem is about columns of variable-width data, where cut breaks and a shell loop would be ten times slower.

The Pattern–Action Model

An awk program is a sequence of rules, each of the form pattern { action }. Both halves are optional. A rule with only a pattern and no action uses the default action, which is print $0 — print the whole record. A rule with only an action and no pattern runs that action on every record. So awk '/error/' prints lines containing "error" (pattern only, default print), and awk '{ print $1 }' prints the first field of every line (action only, no pattern). This is why awk so often replaces grep | cut chains: the pattern filters and the action projects, in one pass.

Patterns are not limited to regular expressions. A pattern can be any expression that evaluates to true or false — $3 == "DOWN", NF > 5, NR % 2 == 0 — or a range written as /start/,/stop/ that switches on at the first match and off at the second. The action is a block of statements in a C-like syntax: assignments, if/else, for, while, print, and printf. Awk runs the entire program once per record, top to bottom, so multiple rules compose naturally against the same input stream.

Fields, Records, and Separators

Awk splits each record into fields and numbers them from $1; $0 is the entire record, $NF is the last field, and $(NF-1) is the second-to-last. Two built-in variables track position: NR is the running record number across all input, and NF is the field count of the current record. The default field separator is not a single space — it is "runs of whitespace, with leading and trailing whitespace ignored", which is exactly why awk handles columns aligned with multiple spaces or tabs that cut -d' ' mangles.

Four variables control splitting and output. FS (or the -F flag) sets the input field separator; RS sets the record separator; OFS sets the separator awk inserts between fields when you print them; and ORS sets what it appends after each record. For comma-delimited input you set -F,; to read paragraph-mode blocks you set RS="". The catch with OFS trips up nearly everyone: assigning it changes the output separator, but awk only rebuilds $0 with that separator if you assign to a field first.

# Count active SSH sessions per source IP from `who`-style output
who | awk '{ print $5 }' | sort | uniq -c

# Print user (col 1) and RSS in MB (col 3, in KB) where RSS > 100 MB
ps -eo user,pid,rss,comm | awk 'NR>1 && $3 > 102400 { printf "%-12s %6.1f MB  %s\n", $1, $3/1024, $4 }'

# Force OFS to take effect by touching a field ($1=$1)
awk -F: 'BEGIN{OFS="\t"} {$1=$1; print $1, $3}' /etc/passwd

BEGIN and END Blocks

Two special patterns bracket the main loop. A BEGIN { } block runs once before any input is read; an END { } block runs once after the last record. BEGIN is where you set FS, OFS, and counters, or print a header row. END is where you emit totals, averages, or the contents of arrays you accumulated. A program can have any number of each, plus the ordinary per-record rules in between, and awk runs them in the right order automatically.

This three-phase shape — initialize, process each record, summarize — is the structural reason awk replaces a whole class of shell scripts. Summing a column is the canonical example: awk '{ s += $1 } END { print s }' needs no array, no temporary file, and no second pass. The accumulator lives across records because awk variables persist for the life of the program, not just one record.

# Sum and average a numeric column, with a header and a footer
awk 'BEGIN { print "bytes" }
     { total += $1; n++ }
     END   { printf "sum=%d avg=%.1f over %d rows\n", total, total/n, n }' sizes.txt

Variables, Arithmetic, and Formatting

Awk variables are untyped: a value is a string, a number, or both, and awk decides which interpretation to use from context. In a numeric context a string like "100" becomes the number 100, and a string with no leading digits becomes 0. That dynamic typing is convenient until it surprises you in comparisons. $1 > $2 compares numerically if both fields look like numbers, but lexically — string order, where "10" sorts before "9" — if either side is non-numeric. Forcing a numeric comparison is a matter of adding zero ($1+0 > $2+0); forcing a string comparison is concatenating an empty string ($1"" == $2"").

For output, print is the blunt instrument and printf is the precise one. print joins its arguments with OFS and appends ORS; printf takes a C-style format string and emits exactly what you specify — field widths, decimal places, zero-padding, no automatic newline. When a report needs columns to line up or numbers to show one decimal place, use printf "%-20s %8.2f\n" and control every character, rather than fighting print and OFS.

Associative Arrays

The feature that elevates awk from a field-printer to a real data tool is the associative array: a hash map whose keys are arbitrary strings. You never declare or size one — referencing count[$1]++ creates the entry on first touch and increments it. This single idiom does grouping and counting in one pass over the data, with no prior sort, which is precisely the work that otherwise needs sort | uniq -c plus a second sort.

In an END block you iterate the array with for (key in arr) to print every group and its tally. The pattern generalizes far past counting: sum bytes per IP, track the max latency per endpoint, collect the set of users seen per host. The one gotcha is iteration order — for (key in arr) visits keys in an unspecified order in mawk and standard gawk, so pipe the output through sort when you need it ordered, or use gawk's PROCINFO["sorted_in"].

# Top talkers: count requests per client IP in an access log, show the top 10
awk '{ hits[$1]++ } END { for (ip in hits) print hits[ip], ip }' access.log \
  | sort -rn | head -10

# Total response bytes per HTTP status code (status in $9, bytes in $10)
awk '{ bytes[$9] += $10 } END { for (s in bytes) printf "%s\t%d\n", s, bytes[s] }' access.log

awk vs cut vs sed

cut — extracts fixed columns by a single-character delimiter (-d) or by byte/character position (-c). It is the fastest and simplest choice when the delimiter is exactly one character and consistent — a true CSV with no quoting, or a colon-delimited file like /etc/passwd. It cannot collapse runs of whitespace, cannot reorder fields, and cannot do arithmetic; cut -d' ' on space-aligned columns is the classic failure.

awk — field-aware and programmable. Choose it when fields are separated by variable whitespace, when you need a field counted from the end ($NF), when you must filter on one column and print another, or when any arithmetic, conditional, or counting is involved. It is the right tool for the large middle ground that cut is too rigid for and a full script is overkill for.

sed — a line-oriented stream editor for substitution and deletion, with no concept of fields. Reach for it to rewrite text in place — s/old/new/ across a file — not to slice columns. When a job mixes both ("substitute in column 3 only"), awk usually wins because it can address the field directly while sed has to match the whole line.

Common Mistakes

Using cut -d' ' on whitespace-aligned output like ps, df, or ls -l — a column padded with multiple spaces makes cut see empty fields and return the wrong one, whereas awk's default separator collapses the run and gives you the field you meant.
Setting OFS and expecting the output to change without touching a field — awk only rebuilds $0 with the new OFS after you assign to some field, so the $1=$1 idiom is required to force the reformat.
Relying on default numeric comparison when a field can be non-numeric — $1 > $2 silently switches to string comparison if either side is non-numeric, so "10" can compare as less than "9"; add +0 to force arithmetic.
Writing a shell while read loop to sum or count a column — it forks subprocesses per line and runs orders of magnitude slower than a single awk pass that does the same accumulation in one process.
Assuming for (key in arr) iterates in insertion or sorted order — the order is unspecified, so a report that must be ordered needs an explicit sort on the output or gawk's PROCINFO["sorted_in"].
Passing a shell variable into the program by string-interpolating it into the quoted script — this breaks on spaces and special characters and invites injection; use awk -v name="$value" to bind it as a proper awk variable.
Expecting gawk-only features — gensub(), FIELDWIDTHS, true 2-D arrays — to work on a stock Debian/Ubuntu box where awk is mawk; install gawk or write to the common subset if the script must be portable.

Best Practices

Reach for awk the moment columns are separated by variable whitespace or you need a field counted from the end ($NF) — that is exactly where cut's single-delimiter model fails.
Use associative arrays (count[$key]++) for grouping and counting in a single pass, instead of sort | uniq -c followed by a second sort, especially on large inputs where the sort dominates.
Put initialization and totals in BEGIN and END blocks — set FS/OFS and headers in BEGIN, emit sums and averages in END — so the main rules stay one job each.
Set the input separator with -F for delimited data — -F, for comma-separated, -F: for /etc/passwd, -F'\t' for TSV — rather than post-processing whitespace by hand.
Pass external values with awk -v var=value and read program text from a file with -f script.awk once a one-liner outgrows a single line; both keep quoting sane and the logic versionable.
Format columnar reports with printf and explicit field widths instead of print plus OFS — it is the only way to get aligned, fixed-precision output reliably.
Replace shell read-loops over structured text with awk — one process, one pass, predictable speed — and drop to Perl or Python only when the logic needs real data structures or libraries.

Comparable toolsPerl / Python — full languages for logic that outgrows awk: rich data structures, regex engines, and libraries when a one-liner is no longer enoughcut — the faster, simpler choice when the delimiter is a single fixed character and no arithmetic or filtering is neededPowerShell — the object pipeline on Windows, where fields are typed properties of objects rather than positions in a text line

Knowledge Check

Why does awk handle ps aux output correctly where cut -d' ' -f3 often returns the wrong column?

Awk's default separator collapses any run of whitespace into one split, so the multi-space padding never creates the empty fields that a fixed single-space delimiter would
Awk reads the columns by fixed byte offsets defined in a built-in per-column width map, so the exact amount of spacing in a line never affects which field it ends up returning
cut cannot read from a pipe and only accepts a file argument, so it never receives the full ps output stream in the first place
Awk automatically sorts the fields by width before splitting each record, which normalizes the irregular column spacing

You set BEGIN { OFS="," } and then { print $1, $2 }, but the output is still space-separated for unchanged lines. What is the cause?

Awk only rebuilds $0 with the new OFS once you assign to a field; without something like $1=$1, the original record is reused as-is
OFS must be set with the -v flag on the command line before the program text, and is silently ignored whenever it is instead assigned inside a BEGIN block
print ignores OFS entirely and always joins with a single space; only printf honors the output field separator
OFS actually changes the input field separator rather than the output, so it has no effect at all on what gets printed

For counting how many log lines came from each source IP, why is awk '{ c[$1]++ } END { for (i in c) print c[i], i }' preferable to sort | uniq -c on a large file?

It counts in a single pass using a hash map, avoiding the full sort that uniq requires to bring identical lines adjacent
It guarantees the output rows come back already sorted in descending order by count, so no further sort -rn step is ever needed
uniq can only tally lines that are byte-for-byte identical across their full length, so it cannot operate on a single isolated column at all
Awk loads the whole file into memory up front, and an in-memory scan is always faster than streaming it through a pipeline

A comparison $1 > $2 gives wrong results when one field is a numeric string and the other contains letters. What is happening, and how do you force a numeric test?

Awk falls back to lexical string comparison when either operand is non-numeric; adding +0 to each side ($1+0 > $2+0) forces numeric context
Awk cannot compare two raw fields directly at all; you must first copy each one into an explicitly declared integer-typed variable before running the test
The fields have to be wrapped in quotes as "$1" > "$2" so that awk recognizes them as numeric operands
Comparison in awk always works numerically by default; the real fix is to set FS to a strictly numeric separator

On a stock Ubuntu server, a script using gawk's gensub() fails with a syntax error. What is the most likely reason?

The default /usr/bin/awk is mawk, which lacks gawk-only functions; installing gawk or rewriting to the common subset resolves it
gensub() requires the -F field-separator flag to be set on the command line before it can be called at all
gensub() may only be invoked inside a BEGIN block and raises a syntax error in any per-record rule
Ubuntu disables awk's string-manipulation functions by default as a hardening measure, so each one must be explicitly re-enabled with a command-line flag

You got correct