Field Tools
Field tools are the small, single-purpose GNU coreutils filters that compose into data pipelines: sort orders lines, uniq collapses adjacent duplicates, cut slices columns out of each line, and tr translates or deletes characters. Each does exactly one job, and a pipe chains them into a transform — they turn a stream of log lines into counts and reports without an editor or a script. They ship in the same coreutils package on every Debian, Ubuntu, RHEL, and Fedora install, so a one-liner you write on Ubuntu 24.04 runs unchanged on RHEL 9.
The operational payoff is triage speed. The top ten client IPs in an Nginx access log, the unique exit codes across a journal dump, the difference between two installed-package lists — each is one line and a few milliseconds. The trade-off is that these tools are deliberately dumb: they split on a literal delimiter, compare only adjacent lines, and know nothing about quoted CSV or multibyte characters. Knowing where that dumbness bites — and when to graduate to awk — is the whole skill.
sort — Ordering Lines
sort orders lines, and its flags decide how. By default the comparison is lexical under the current locale, which is why 10 sorts before 2 until you add -n for numeric order. -h understands human sizes like 4K and 2G, -r reverses, and -u drops duplicates as it sorts. For keyed sorts, -t sets a field delimiter and -k selects the key, so sort -t: -k3,3 -n /etc/passwd orders accounts by numeric UID.
Always bound the key — write -k3,3, never bare -k3, or the comparison runs from field 3 to the end of the line and produces order you did not ask for. The other surprise is locale: under a UTF-8 locale, collation rules fold case and punctuation in ways that differ between machines and run slower on large files. Prefix the command with LC_ALL=C when you want deterministic byte ordering and maximum speed.
# Numeric sort by UID, colon-delimited, key bounded to field 3 sort -t':' -k3,3 -n /etc/passwd # Deterministic, fast byte-order sort of a large log LC_ALL=C sort big.log -o big.sorted
uniq — Collapsing Adjacent Duplicates
uniq collapses runs of identical adjacent lines, so it is only correct on sorted input — it never re-orders and never looks past the previous line. uniq -c prepends a count, uniq -d shows only the duplicated lines, and uniq -u shows only the lines that appeared exactly once. The single most useful idiom in log work falls straight out of this: sort | uniq -c | sort -rn counts occurrences and ranks them highest-first.
Because uniq compares whole lines by default, -f skips leading fields and -w limits the comparison to the first N characters — useful when a timestamp prefix would otherwise make every line unique. The classic bug is running uniq on unsorted data and concluding it is broken; it is doing exactly what it promises, which is to merge only neighbours.
# Top 10 client IPs in an Nginx access log cut -d' ' -f1 /var/log/nginx/access.log \ | sort | uniq -c | sort -rn | head -n 10
cut — Slicing Columns
cut pulls fields or character ranges out of each line. -d sets the delimiter and -f picks fields, so cut -d',' -f1,3 keeps the first and third comma-separated fields. The default delimiter is a single TAB, which is why cut -f works cleanly on TSV exports and on /etc/passwd once you pass -d':'. Use -c for fixed character columns, as in cut -c1-8 to grab an eight-character timestamp prefix.
The hard limit is that the delimiter is a single literal byte. cut cannot collapse runs of spaces, so it falls apart on the space-aligned output of ps aux or ls -l, and it has no notion of quoting, so a CSV field containing a comma inside quotes gets split in the wrong place. For variable whitespace or quoted data, awk — which splits on whitespace runs by default — is the correct tool, not cut.
# Usernames and login shells from /etc/passwd cut -d':' -f1,7 /etc/passwd # WRONG: cut cannot parse space-aligned ps output ps aux | cut -d' ' -f2 # garbled — use: awk '{print $2}'
tr — Translating and Deleting Characters
tr maps or removes characters from a stream, and it reads standard input only — it takes no filename arguments. tr 'a-z' 'A-Z' upcases, tr -d '\r' strips carriage returns when converting Windows CRLF line endings to Unix LF, tr -s ' ' squeezes runs of spaces down to one, and tr -c complements a set, so tr -cd '[:print:]\n' deletes everything that is not a printable character or a newline.
The limitation to keep in mind is that tr is byte-oriented. It has no real Unicode support, so under a UTF-8 locale a multibyte character is processed as its individual bytes, and case ranges like a-z only cover ASCII. For accented or non-Latin text, reach for sed or a locale-aware language, not tr. Within ASCII, though, it is the fastest way to do whitespace cleanup, newline conversion, and quick squeezing.
# Convert Windows CRLF to Unix LF tr -d '\r' < dos.txt > unix.txt # Squeeze repeated spaces, then cut the now single-space field echo "a b c" | tr -s ' ' | cut -d' ' -f2 # prints: b
The Top-N Counting Pipeline
The canonical use of these four tools together is frequency counting. Extract the field you care about with cut, normalize it with tr if needed, then run sort | uniq -c | sort -rn: the first sort brings identical values together, uniq -c counts each run, and the second numeric reverse sort ranks them. This pattern answers most "which is most common" questions about logs in a single line, and it is the reason sort and uniq are almost always seen as a pair.
When the logic grows past extraction and counting — picking a column by condition, summing a field, grouping on a key — stop stacking filters and switch to awk. A pipeline of four cut and tr stages that an awk one-liner would replace is harder to read and slower, because each stage is a separate process copying the whole stream.
sort -u — deduplicates globally in one pass, comparing across the entire input. It needs no prior sort because it is the sort, and with a key (sort -u -k2,2) it keeps one line per key value. Reach for it when you just want distinct lines or distinct keys.
uniq — only collapses adjacent equal lines, so it must follow a sort, and it compares whole lines by default. Choose it when you need its extras that sort -u has no equivalent for: -c to count occurrences, -d to show only duplicates, -u to show only singletons.
- Running
uniqon unsorted input. It only merges adjacent lines, so identical entries scattered through the file are counted separately and the totals are silently wrong, with no error to warn you. - Lexically sorting numbers without
-n, which puts10before2and100before9. A "top by count" report comes out ranked alphabetically by digit, not numerically. - Using
cut -d' 'on space-aligned output likeps auxorls -l. A single-byte delimiter cannot collapse runs of spaces, so fields land in the wrong column — useawk, which splits on whitespace runs by default. - Sorting without
LC_ALL=Cwhen you need stable, reproducible output. A UTF-8 locale changes collation between machines and runs measurably slower on large files. - Expecting
trto handle multibyte UTF-8 by character. It works on bytes, so accented or non-Latin text is mangled and case ranges likea-zonly touch ASCII. - Leaving
-kunbounded, writing-k2instead of-k2,2. The key then extends to end of line and a sort you expected to order on one field quietly orders on the whole tail.
- Sort before every
uniq, and sort on the same fielduniqwill compare, so the adjacencyuniqassumes actually holds. - Pass
-nwhenever the key is numeric, and-hwhen it carries size suffixes likeK/M/G, so the order is arithmetic rather than alphabetic. - Prefix heavy sorts with
LC_ALL=Cfor deterministic byte ordering and faster runs on large logs. - Bound every sort key explicitly — write
-k2,2, not-k2— so the comparison stops at the intended field. - Reach for the
sort | uniq -c | sort -rnidiom for ranked frequency counts; it is the fastest path to top-N answers from logs. - Switch to
awkthe moment field logic grows past extraction — conditionals, arithmetic, grouping — instead of stacking fragilecutandtrstages.
cut's fixed-delimiter model fails or the logic needs conditions and arithmeticPowerShell — Sort-Object and Group-Object operate on object properties, so there is no delimiter or prior-sort preconditiondatamash — a coreutils-style filter for grouped statistics (sum, mean, count) that goes beyond what uniq -c can expressKnowledge Check
You pipe an Nginx log straight into uniq -c without sorting first. What happens?
- Only adjacent duplicates collapse, so identical lines spread through the file are counted separately and the totals are wrong
uniqquietly buffers and sorts the whole input stream internally before counting, so the per-line totals still come out completely correctuniqdetects the unsorted input and exits with a non-zero error rather than emitting any counts- The whole file is treated as a single duplicate group and reported just once with the full line count
When does sort -u behave differently from sort | uniq?
- With a key like
-k2,2:sort -ukeeps one line per key value, whileuniqdedups on the whole line - Never — the two pipelines are exactly equivalent and produce byte-identical output in every case
sort -uneeds the input already pre-sorted to dedup correctly, whereassort | uniqdoes notuniqcan deduplicate without any sort step, so it catches scattered duplicates thatsort -umisses entirely
Why is awk '{print $2}' the right choice over cut -d' ' -f2 for ps aux output?
cut's single-byte delimiter cannot collapse the multiple spaces between columns, whileawksplits on whitespace runs by defaultcutcan only read from a named file argument and refuses to take input from a pipeawkis faster here because it is compiled ahead of time whilecutruns through an interpretercutnumbers its fields starting from zero rather than one, so-f2ends up selecting the third column instead of the intended one
A "top counts" report sorted with plain sort -r shows 100 ranked below 99. What fixes it?
- Add
-nso the sort compares numerically instead of lexically - Add
LC_ALL=Cto force byte-order comparison - Pipe through
uniq -ca second time to recount - Bound the key with
-k1,1so only the count field is compared
You got correct