Topic 23

Field Tools

Text ProcessingCoreutils

Field tools are the small, single-purpose GNU coreutils filters that compose into data pipelines: sort orders lines, uniq collapses adjacent duplicates, cut slices columns out of each line, and tr translates or deletes characters. Each does exactly one job, and a pipe chains them into a transform — they turn a stream of log lines into counts and reports without an editor or a script. They ship in the same coreutils package on every Debian, Ubuntu, RHEL, and Fedora install, so a one-liner you write on Ubuntu 24.04 runs unchanged on RHEL 9.

The operational payoff is triage speed. The top ten client IPs in an Nginx access log, the unique exit codes across a journal dump, the difference between two installed-package lists — each is one line and a few milliseconds. The trade-off is that these tools are deliberately dumb: they split on a literal delimiter, compare only adjacent lines, and know nothing about quoted CSV or multibyte characters. Knowing where that dumbness bites — and when to graduate to awk — is the whole skill.

sort — Ordering Lines

sort orders lines, and its flags decide how. By default the comparison is lexical under the current locale, which is why 10 sorts before 2 until you add -n for numeric order. -h understands human sizes like 4K and 2G, -r reverses, and -u drops duplicates as it sorts. For keyed sorts, -t sets a field delimiter and -k selects the key, so sort -t: -k3,3 -n /etc/passwd orders accounts by numeric UID.

Always bound the key — write -k3,3, never bare -k3, or the comparison runs from field 3 to the end of the line and produces order you did not ask for. The other surprise is locale: under a UTF-8 locale, collation rules fold case and punctuation in ways that differ between machines and run slower on large files. Prefix the command with LC_ALL=C when you want deterministic byte ordering and maximum speed.

# Numeric sort by UID, colon-delimited, key bounded to field 3
sort -t':' -k3,3 -n /etc/passwd

# Deterministic, fast byte-order sort of a large log
LC_ALL=C sort big.log -o big.sorted

uniq — Collapsing Adjacent Duplicates

uniq collapses runs of identical adjacent lines, so it is only correct on sorted input — it never re-orders and never looks past the previous line. uniq -c prepends a count, uniq -d shows only the duplicated lines, and uniq -u shows only the lines that appeared exactly once. The single most useful idiom in log work falls straight out of this: sort | uniq -c | sort -rn counts occurrences and ranks them highest-first.

Because uniq compares whole lines by default, -f skips leading fields and -w limits the comparison to the first N characters — useful when a timestamp prefix would otherwise make every line unique. The classic bug is running uniq on unsorted data and concluding it is broken; it is doing exactly what it promises, which is to merge only neighbours.

# Top 10 client IPs in an Nginx access log
cut -d' ' -f1 /var/log/nginx/access.log \
  | sort | uniq -c | sort -rn | head -n 10

cut — Slicing Columns

cut pulls fields or character ranges out of each line. -d sets the delimiter and -f picks fields, so cut -d',' -f1,3 keeps the first and third comma-separated fields. The default delimiter is a single TAB, which is why cut -f works cleanly on TSV exports and on /etc/passwd once you pass -d':'. Use -c for fixed character columns, as in cut -c1-8 to grab an eight-character timestamp prefix.

The hard limit is that the delimiter is a single literal byte. cut cannot collapse runs of spaces, so it falls apart on the space-aligned output of ps aux or ls -l, and it has no notion of quoting, so a CSV field containing a comma inside quotes gets split in the wrong place. For variable whitespace or quoted data, awk — which splits on whitespace runs by default — is the correct tool, not cut.

# Usernames and login shells from /etc/passwd
cut -d':' -f1,7 /etc/passwd

# WRONG: cut cannot parse space-aligned ps output
ps aux | cut -d' ' -f2   # garbled — use: awk '{print $2}'

tr — Translating and Deleting Characters

tr maps or removes characters from a stream, and it reads standard input only — it takes no filename arguments. tr 'a-z' 'A-Z' upcases, tr -d '\r' strips carriage returns when converting Windows CRLF line endings to Unix LF, tr -s ' ' squeezes runs of spaces down to one, and tr -c complements a set, so tr -cd '[:print:]\n' deletes everything that is not a printable character or a newline.

The limitation to keep in mind is that tr is byte-oriented. It has no real Unicode support, so under a UTF-8 locale a multibyte character is processed as its individual bytes, and case ranges like a-z only cover ASCII. For accented or non-Latin text, reach for sed or a locale-aware language, not tr. Within ASCII, though, it is the fastest way to do whitespace cleanup, newline conversion, and quick squeezing.

# Convert Windows CRLF to Unix LF
tr -d '\r' < dos.txt > unix.txt

# Squeeze repeated spaces, then cut the now single-space field
echo "a    b     c" | tr -s ' ' | cut -d' ' -f2   # prints: b

The Top-N Counting Pipeline

The canonical use of these four tools together is frequency counting. Extract the field you care about with cut, normalize it with tr if needed, then run sort | uniq -c | sort -rn: the first sort brings identical values together, uniq -c counts each run, and the second numeric reverse sort ranks them. This pattern answers most "which is most common" questions about logs in a single line, and it is the reason sort and uniq are almost always seen as a pair.

When the logic grows past extraction and counting — picking a column by condition, summing a field, grouping on a key — stop stacking filters and switch to awk. A pipeline of four cut and tr stages that an awk one-liner would replace is harder to read and slower, because each stage is a separate process copying the whole stream.

sort -u vs uniq

sort -u — deduplicates globally in one pass, comparing across the entire input. It needs no prior sort because it is the sort, and with a key (sort -u -k2,2) it keeps one line per key value. Reach for it when you just want distinct lines or distinct keys.

uniq — only collapses adjacent equal lines, so it must follow a sort, and it compares whole lines by default. Choose it when you need its extras that sort -u has no equivalent for: -c to count occurrences, -d to show only duplicates, -u to show only singletons.

Common Mistakes

Running uniq on unsorted input. It only merges adjacent lines, so identical entries scattered through the file are counted separately and the totals are silently wrong, with no error to warn you.
Lexically sorting numbers without -n, which puts 10 before 2 and 100 before 9. A "top by count" report comes out ranked alphabetically by digit, not numerically.
Using cut -d' ' on space-aligned output like ps aux or ls -l. A single-byte delimiter cannot collapse runs of spaces, so fields land in the wrong column — use awk, which splits on whitespace runs by default.
Sorting without LC_ALL=C when you need stable, reproducible output. A UTF-8 locale changes collation between machines and runs measurably slower on large files.
Expecting tr to handle multibyte UTF-8 by character. It works on bytes, so accented or non-Latin text is mangled and case ranges like a-z only touch ASCII.
Leaving -k unbounded, writing -k2 instead of -k2,2. The key then extends to end of line and a sort you expected to order on one field quietly orders on the whole tail.

Best Practices

Sort before every uniq, and sort on the same field uniq will compare, so the adjacency uniq assumes actually holds.
Pass -n whenever the key is numeric, and -h when it carries size suffixes like K/M/G, so the order is arithmetic rather than alphabetic.
Prefix heavy sorts with LC_ALL=C for deterministic byte ordering and faster runs on large logs.
Bound every sort key explicitly — write -k2,2, not -k2 — so the comparison stops at the intended field.
Reach for the sort | uniq -c | sort -rn idiom for ranked frequency counts; it is the fastest path to top-N answers from logs.
Switch to awk the moment field logic grows past extraction — conditionals, arithmetic, grouping — instead of stacking fragile cut and tr stages.

Comparable toolsawk — the field-aware language to graduate to when cut's fixed-delimiter model fails or the logic needs conditions and arithmeticPowerShell — Sort-Object and Group-Object operate on object properties, so there is no delimiter or prior-sort preconditiondatamash — a coreutils-style filter for grouped statistics (sum, mean, count) that goes beyond what uniq -c can express

Knowledge Check

You pipe an Nginx log straight into uniq -c without sorting first. What happens?

Only adjacent duplicates collapse, so identical lines spread through the file are counted separately and the totals are wrong
uniq quietly buffers and sorts the whole input stream internally before counting, so the per-line totals still come out completely correct
uniq detects the unsorted input and exits with a non-zero error rather than emitting any counts
The whole file is treated as a single duplicate group and reported just once with the full line count

When does sort -u behave differently from sort | uniq?

With a key like -k2,2: sort -u keeps one line per key value, while uniq dedups on the whole line
Never — the two pipelines are exactly equivalent and produce byte-identical output in every case
sort -u needs the input already pre-sorted to dedup correctly, whereas sort | uniq does not
uniq can deduplicate without any sort step, so it catches scattered duplicates that sort -u misses entirely

Why is awk '{print $2}' the right choice over cut -d' ' -f2 for ps aux output?

cut's single-byte delimiter cannot collapse the multiple spaces between columns, while awk splits on whitespace runs by default
cut can only read from a named file argument and refuses to take input from a pipe
awk is faster here because it is compiled ahead of time while cut runs through an interpreter
cut numbers its fields starting from zero rather than one, so -f2 ends up selecting the third column instead of the intended one

A "top counts" report sorted with plain sort -r shows 100 ranked below 99. What fixes it?

Add -n so the sort compares numerically instead of lexically
Add LC_ALL=C to force byte-order comparison
Pipe through uniq -c a second time to recount
Bound the key with -k1,1 so only the count field is compared

You got correct