awk
awk is a small programming language built for column- and record-oriented text. It reads input one record at a time — by default one line — splits each record into fields on whitespace, and runs your program against every record in turn. The whole language is a list of pattern { action } rules: for each record, awk tests the pattern, and where it matches, runs the action. That model is what makes a one-liner like awk '$5 > 100 { print $3 }' — "print the third column of every line whose fifth column exceeds 100" — a single expression instead of a script.
Every Debian and Ubuntu install ships an awk at /usr/bin/awk; on those systems it is usually mawk, a fast, lean implementation, while the fuller-featured GNU version is gawk (installed with apt install gawk, and the default on RHEL/Fedora). For the field-splitting, arithmetic, and counting that this page covers, mawk and gawk behave identically — the divergence only shows up in gawk extensions like gensub(), true multidimensional arrays, and FIELDWIDTHS. The practical consequence: reach for awk the moment a problem is about columns of variable-width data, where cut breaks and a shell loop would be ten times slower.
The Pattern–Action Model
An awk program is a sequence of rules, each of the form pattern { action }. Both halves are optional. A rule with only a pattern and no action uses the default action, which is print $0 — print the whole record. A rule with only an action and no pattern runs that action on every record. So awk '/error/' prints lines containing "error" (pattern only, default print), and awk '{ print $1 }' prints the first field of every line (action only, no pattern). This is why awk so often replaces grep | cut chains: the pattern filters and the action projects, in one pass.
Patterns are not limited to regular expressions. A pattern can be any expression that evaluates to true or false — $3 == "DOWN", NF > 5, NR % 2 == 0 — or a range written as /start/,/stop/ that switches on at the first match and off at the second. The action is a block of statements in a C-like syntax: assignments, if/else, for, while, print, and printf. Awk runs the entire program once per record, top to bottom, so multiple rules compose naturally against the same input stream.
Fields, Records, and Separators
Awk splits each record into fields and numbers them from $1; $0 is the entire record, $NF is the last field, and $(NF-1) is the second-to-last. Two built-in variables track position: NR is the running record number across all input, and NF is the field count of the current record. The default field separator is not a single space — it is "runs of whitespace, with leading and trailing whitespace ignored", which is exactly why awk handles columns aligned with multiple spaces or tabs that cut -d' ' mangles.
Four variables control splitting and output. FS (or the -F flag) sets the input field separator; RS sets the record separator; OFS sets the separator awk inserts between fields when you print them; and ORS sets what it appends after each record. For comma-delimited input you set -F,; to read paragraph-mode blocks you set RS="". The catch with OFS trips up nearly everyone: assigning it changes the output separator, but awk only rebuilds $0 with that separator if you assign to a field first.
# Count active SSH sessions per source IP from `who`-style output who | awk '{ print $5 }' | sort | uniq -c # Print user (col 1) and RSS in MB (col 3, in KB) where RSS > 100 MB ps -eo user,pid,rss,comm | awk 'NR>1 && $3 > 102400 { printf "%-12s %6.1f MB %s\n", $1, $3/1024, $4 }' # Force OFS to take effect by touching a field ($1=$1) awk -F: 'BEGIN{OFS="\t"} {$1=$1; print $1, $3}' /etc/passwd
BEGIN and END Blocks
Two special patterns bracket the main loop. A BEGIN { } block runs once before any input is read; an END { } block runs once after the last record. BEGIN is where you set FS, OFS, and counters, or print a header row. END is where you emit totals, averages, or the contents of arrays you accumulated. A program can have any number of each, plus the ordinary per-record rules in between, and awk runs them in the right order automatically.
This three-phase shape — initialize, process each record, summarize — is the structural reason awk replaces a whole class of shell scripts. Summing a column is the canonical example: awk '{ s += $1 } END { print s }' needs no array, no temporary file, and no second pass. The accumulator lives across records because awk variables persist for the life of the program, not just one record.
# Sum and average a numeric column, with a header and a footer awk 'BEGIN { print "bytes" } { total += $1; n++ } END { printf "sum=%d avg=%.1f over %d rows\n", total, total/n, n }' sizes.txt
Variables, Arithmetic, and Formatting
Awk variables are untyped: a value is a string, a number, or both, and awk decides which interpretation to use from context. In a numeric context a string like "100" becomes the number 100, and a string with no leading digits becomes 0. That dynamic typing is convenient until it surprises you in comparisons. $1 > $2 compares numerically if both fields look like numbers, but lexically — string order, where "10" sorts before "9" — if either side is non-numeric. Forcing a numeric comparison is a matter of adding zero ($1+0 > $2+0); forcing a string comparison is concatenating an empty string ($1"" == $2"").
For output, print is the blunt instrument and printf is the precise one. print joins its arguments with OFS and appends ORS; printf takes a C-style format string and emits exactly what you specify — field widths, decimal places, zero-padding, no automatic newline. When a report needs columns to line up or numbers to show one decimal place, use printf "%-20s %8.2f\n" and control every character, rather than fighting print and OFS.
Associative Arrays
The feature that elevates awk from a field-printer to a real data tool is the associative array: a hash map whose keys are arbitrary strings. You never declare or size one — referencing count[$1]++ creates the entry on first touch and increments it. This single idiom does grouping and counting in one pass over the data, with no prior sort, which is precisely the work that otherwise needs sort | uniq -c plus a second sort.
In an END block you iterate the array with for (key in arr) to print every group and its tally. The pattern generalizes far past counting: sum bytes per IP, track the max latency per endpoint, collect the set of users seen per host. The one gotcha is iteration order — for (key in arr) visits keys in an unspecified order in mawk and standard gawk, so pipe the output through sort when you need it ordered, or use gawk's PROCINFO["sorted_in"].
# Top talkers: count requests per client IP in an access log, show the top 10 awk '{ hits[$1]++ } END { for (ip in hits) print hits[ip], ip }' access.log \ | sort -rn | head -10 # Total response bytes per HTTP status code (status in $9, bytes in $10) awk '{ bytes[$9] += $10 } END { for (s in bytes) printf "%s\t%d\n", s, bytes[s] }' access.log
cut — extracts fixed columns by a single-character delimiter (-d) or by byte/character position (-c). It is the fastest and simplest choice when the delimiter is exactly one character and consistent — a true CSV with no quoting, or a colon-delimited file like /etc/passwd. It cannot collapse runs of whitespace, cannot reorder fields, and cannot do arithmetic; cut -d' ' on space-aligned columns is the classic failure.
awk — field-aware and programmable. Choose it when fields are separated by variable whitespace, when you need a field counted from the end ($NF), when you must filter on one column and print another, or when any arithmetic, conditional, or counting is involved. It is the right tool for the large middle ground that cut is too rigid for and a full script is overkill for.
sed — a line-oriented stream editor for substitution and deletion, with no concept of fields. Reach for it to rewrite text in place — s/old/new/ across a file — not to slice columns. When a job mixes both ("substitute in column 3 only"), awk usually wins because it can address the field directly while sed has to match the whole line.
- Using
cut -d' 'on whitespace-aligned output likeps,df, orls -l— a column padded with multiple spaces makescutsee empty fields and return the wrong one, whereas awk's default separator collapses the run and gives you the field you meant. - Setting
OFSand expecting the output to change without touching a field — awk only rebuilds$0with the newOFSafter you assign to some field, so the$1=$1idiom is required to force the reformat. - Relying on default numeric comparison when a field can be non-numeric —
$1 > $2silently switches to string comparison if either side is non-numeric, so"10"can compare as less than"9"; add+0to force arithmetic. - Writing a shell
while readloop to sum or count a column — it forks subprocesses per line and runs orders of magnitude slower than a single awk pass that does the same accumulation in one process. - Assuming
for (key in arr)iterates in insertion or sorted order — the order is unspecified, so a report that must be ordered needs an explicitsorton the output or gawk'sPROCINFO["sorted_in"]. - Passing a shell variable into the program by string-interpolating it into the quoted script — this breaks on spaces and special characters and invites injection; use
awk -v name="$value"to bind it as a proper awk variable. - Expecting gawk-only features —
gensub(),FIELDWIDTHS, true 2-D arrays — to work on a stock Debian/Ubuntu box whereawkis mawk; installgawkor write to the common subset if the script must be portable.
- Reach for awk the moment columns are separated by variable whitespace or you need a field counted from the end (
$NF) — that is exactly wherecut's single-delimiter model fails. - Use associative arrays (
count[$key]++) for grouping and counting in a single pass, instead ofsort | uniq -cfollowed by a second sort, especially on large inputs where the sort dominates. - Put initialization and totals in
BEGINandENDblocks — setFS/OFSand headers inBEGIN, emit sums and averages inEND— so the main rules stay one job each. - Set the input separator with
-Ffor delimited data —-F,for comma-separated,-F:for/etc/passwd,-F'\t'for TSV — rather than post-processing whitespace by hand. - Pass external values with
awk -v var=valueand read program text from a file with-f script.awkonce a one-liner outgrows a single line; both keep quoting sane and the logic versionable. - Format columnar reports with
printfand explicit field widths instead ofprintplusOFS— it is the only way to get aligned, fixed-precision output reliably. - Replace shell read-loops over structured text with awk — one process, one pass, predictable speed — and drop to Perl or Python only when the logic needs real data structures or libraries.
Knowledge Check
Why does awk handle ps aux output correctly where cut -d' ' -f3 often returns the wrong column?
- Awk's default separator collapses any run of whitespace into one split, so the multi-space padding never creates the empty fields that a fixed single-space delimiter would
- Awk reads the columns by fixed byte offsets defined in a built-in per-column width map, so the exact amount of spacing in a line never affects which field it ends up returning
cutcannot read from a pipe and only accepts a file argument, so it never receives the fullpsoutput stream in the first place- Awk automatically sorts the fields by width before splitting each record, which normalizes the irregular column spacing
You set BEGIN { OFS="," } and then { print $1, $2 }, but the output is still space-separated for unchanged lines. What is the cause?
- Awk only rebuilds
$0with the newOFSonce you assign to a field; without something like$1=$1, the original record is reused as-is OFSmust be set with the-vflag on the command line before the program text, and is silently ignored whenever it is instead assigned inside aBEGINblockprintignoresOFSentirely and always joins with a single space; onlyprintfhonors the output field separatorOFSactually changes the input field separator rather than the output, so it has no effect at all on what gets printed
For counting how many log lines came from each source IP, why is awk '{ c[$1]++ } END { for (i in c) print c[i], i }' preferable to sort | uniq -c on a large file?
- It counts in a single pass using a hash map, avoiding the full sort that
uniqrequires to bring identical lines adjacent - It guarantees the output rows come back already sorted in descending order by count, so no further
sort -rnstep is ever needed uniqcan only tally lines that are byte-for-byte identical across their full length, so it cannot operate on a single isolated column at all- Awk loads the whole file into memory up front, and an in-memory scan is always faster than streaming it through a pipeline
A comparison $1 > $2 gives wrong results when one field is a numeric string and the other contains letters. What is happening, and how do you force a numeric test?
- Awk falls back to lexical string comparison when either operand is non-numeric; adding
+0to each side ($1+0 > $2+0) forces numeric context - Awk cannot compare two raw fields directly at all; you must first copy each one into an explicitly declared integer-typed variable before running the test
- The fields have to be wrapped in quotes as
"$1" > "$2"so that awk recognizes them as numeric operands - Comparison in awk always works numerically by default; the real fix is to set
FSto a strictly numeric separator
On a stock Ubuntu server, a script using gawk's gensub() fails with a syntax error. What is the most likely reason?
- The default
/usr/bin/awkis mawk, which lacks gawk-only functions; installinggawkor rewriting to the common subset resolves it gensub()requires the-Ffield-separator flag to be set on the command line before it can be called at allgensub()may only be invoked inside aBEGINblock and raises a syntax error in any per-record rule- Ubuntu disables awk's string-manipulation functions by default as a hardening measure, so each one must be explicitly re-enabled with a command-line flag
You got correct