Topic 19

grep and Regular Expressions

Text Processing

grep reads its input line by line and prints the lines that match a pattern. The pattern is a regular expression — a small language for describing sets of strings — and the same language drives sed, awk, and the search in every editor you will touch. Learn the regex model once and it pays off across the entire toolset, which is why this topic is worth more than the size of a single command suggests.

The trap is that there is no single regex. grep defaults to Basic Regular Expressions (BRE), where +, ?, |, and () are literal characters until you backslash-escape them; grep -E switches to Extended Regular Expressions (ERE), where those metacharacters are active without escaping; and grep -P hands the pattern to a Perl-compatible engine with lookarounds and non-greedy quantifiers. A pattern that works in one dialect silently matches the wrong thing — or nothing — in another. Knowing which dialect you are in is half of using grep correctly.

Basic grep

At its simplest grep PATTERN FILE prints matching lines, and with no file argument it reads standard input, which is how it lives in pipelines. The flags that matter day to day are few. -i makes the match case-insensitive, -v inverts it to print non-matching lines, -n prefixes each hit with its line number, -c prints only the count of matching lines, and -o prints just the matched text rather than the whole line — useful when you want to extract tokens, not read context.

# lines mentioning the word, case-insensitive, with line numbers
grep -in error /var/log/syslog

# everything that is NOT a comment or a blank line
grep -v '^#\|^$' /etc/ssh/sshd_config

# count failed SSH logins; -o pulls out just the matched IPs
grep -c 'Failed password' /var/log/auth.log
grep -oE '([0-9]{1,3}\.){3}[0-9]{1,3}' /var/log/auth.log

grep is line-oriented to its core: the unit of matching is a single line, and a pattern can never span a newline. Searching for foo.*bar finds foo and bar only when they sit on the same line. This is the most common surprise for people coming from editor search, and it shapes how you build pipelines — you filter lines first, then reach for sed or awk when the logic crosses line boundaries.

Regex Building Blocks

A regular expression is built from a handful of pieces. Anchors tie a match to a position: ^ is the start of the line, $ is the end, so ^root: matches only lines that begin with root:. Character classes match one character from a set: [0-9] is any digit, [a-zA-Z] any letter, [^0-9] any non-digit, and POSIX names like [[:space:]] and [[:alnum:]] stay correct across locales. A bare . matches any single character except newline.

Quantifiers control repetition. * means zero or more of the preceding atom, + one or more, ? zero or one, and {m,n} a bounded count — [0-9]{1,3} matches one to three digits. Grouping with () applies a quantifier to a whole subexpression and captures it for back-references, while alternation with | matches one branch or another: (GET|POST|PUT) matches any of the three HTTP methods. These atoms compose: ^([0-9]{1,3}\.){3}[0-9]{1,3}$ is a rough IPv4 matcher.

Construct	Meaning	Example
`^ $`	start / end of line	`^Jan`, `denied$`
`. [...]`	any char / one from a set	`[0-9]`, `[^/]`
`* + ?`	0+, 1+, 0-or-1 repetitions	`ab*`, `colou?r`
`{m,n}`	bounded repetition	`[0-9]{1,3}`
`( ) \|`	group / alternation	`(cat\|dog)s?`
`\b \w`	word boundary / word char (PCRE)	`\bsudo\b`

BRE versus ERE versus PCRE

The dialect decides what needs escaping. In BRE — plain grep — the characters +, ?, {, |, (, and ) are ordinary literals, and you must write \+, \?, \{, \|, $, $ to make them act as metacharacters. ERE — grep -E — reverses this: those characters are active by default, and you backslash them only when you want the literal. ERE is what most people picture when they think "regex," and it is the readable default for anything beyond a fixed string.

# BRE: groups and alternation need backslashes
grep '\(GET\|POST\)' access.log

# ERE: the same pattern, no backslashes
grep -E '(GET|POST)' access.log

# PCRE: lookbehind and non-greedy, unavailable in BRE/ERE
grep -P '(?<=user=)\w+' auth.log

grep -P is a different engine entirely: PCRE, the Perl-compatible library. It adds lookahead and lookbehind ((?=...), (?<=...)), non-greedy quantifiers (*?, +?), and the convenience shorthands \d, \w, \s, and \b. The cost is portability: -P is a GNU extension — the manual flags it as experimental when combined with -z — and on minimal images or BSD it may be absent. On macOS and the BSDs the default grep is the BSD implementation, where -P is unsupported entirely unless you install GNU grep. Reach for -P when ERE genuinely cannot express what you need — and not before.

Context and Recursive Search

A matching line is often useless without the lines around it. -A N prints N lines after each match, -B N N lines before, and -C N N lines on both sides — indispensable for reading a stack trace or a log event that spreads across several lines. For locating files rather than lines, -l prints only the names of files that contain a match and -L the names of files that do not.

# show the matching line plus 3 lines of trailing context
grep -A3 'Traceback' app.log

# recursive code search, only in Python files, case-insensitive
grep -rn --include='*.py' -i 'todo' ./src

# which config files reference the old hostname
grep -rl 'db-old.internal' /etc

grep -r walks a directory tree, and --include and --exclude globs narrow it to the file types you care about, while --exclude-dir=.git keeps you out of version-control noise. For searching a whole codebase this is workable, but it has no concept of a .gitignore and scans every byte of every matched file, which is exactly where the purpose-built code searchers pull ahead.

Fixed Strings and Performance

grep -F — historically the fgrep command — turns off regex entirely and treats the pattern as a literal string. Use it whenever your needle contains regex metacharacters that you mean literally: searching for an IP address 10.0.0.1, a version string 1.2.3, or a path /usr/lib/x86_64. Without -F, every . in those is a wildcard, so grep 10.0.0.1 also matches 1000x010. -F makes the match both correct and faster, because the engine can use a plain string search instead of compiling an automaton.

Performance also turns on the regex itself. Greedy quantifiers like .* backtrack hard on long lines, and a poorly bounded pattern can take seconds where a tight one takes milliseconds. Anchoring with ^ lets the engine reject most lines after the first character, and a fixed-string prefix lets it skip whole lines outright. Setting LC_ALL=C before a byte-oriented search drops Unicode collation work and speeds up large scans measurably — at the cost of locale-correct case folding, so reserve it for ASCII data.

BRE vs ERE vs PCRE

BRE — the default for plain grep and sed. +, ?, {}, |, and () are literal until escaped with a backslash. Fine for fixed-shape patterns; reach for it when you are stuck with portable, POSIX-only tooling.

ERE — grep -E (and sed -E, awk). Those same metacharacters are active without backslashes, so patterns read the way most people expect. Make this your default for any pattern with grouping or alternation.

PCRE — grep -P, a separate Perl-compatible engine. Adds lookaround, non-greedy *?, and \d/\w/\b shorthands. Use it only when ERE cannot express the match — and only where it is installed, since BSD and macOS grep lack it.

Common Mistakes

Writing a multiline pattern like foo.*bar and expecting it to match across lines — grep matches one line at a time, so a match spanning a newline never fires, and you silently get zero hits.
Using +, ?, or | in plain grep and wondering why nothing matches — in BRE they are literal characters; either escape them (\+) or switch to grep -E.
Searching for a literal string full of dots, such as grep 10.0.0.1, without -F — each . is a wildcard, so the pattern also matches 10x0y0z1 and produces false positives.
Leaning on grep -P in a script that runs on Alpine, BSD, or macOS — the default grep there has no PCRE support, and the script fails with an unrecognized-option error in production.
Catastrophic greedy patterns: .* against very long lines forces heavy backtracking, turning a one-second search into a hang on large logs.
Recursing with grep -r straight through .git, node_modules, and binary blobs — results drown in noise and the scan crawls, when --exclude-dir and --include would have scoped it.

Best Practices

Use grep -F whenever the pattern is a literal string with dots, slashes, or brackets — it is both correct and faster than letting them act as metacharacters.
Default to grep -E for any pattern with grouping or alternation; the absence of backslash-escaping makes the intent readable and reduces dialect mistakes.
Anchor patterns with ^ or $ when you know the position — it kills false matches and lets the engine reject most lines after the first character.
Scope code searches with grep -rn --include='*.ext' --exclude-dir=.git rather than a bare -r, so you read signal instead of binaries and vendored trees.
Pull context with -A, -B, or -C when reading logs and stack traces — a bare matching line rarely carries enough to act on.
Reserve grep -P for lookaround or non-greedy needs, and confirm GNU grep is present before depending on it in a portable script.
Prefix heavy ASCII scans with LC_ALL=C to skip Unicode collation, but keep the locale default when correct case folding on non-ASCII text matters.

Comparable toolsPowerShell Select-String — the object-pipeline equivalent on Windows, matching on .NET regexripgrep / ag — faster code searchers that respect .gitignore and skip binaries by defaultWindows findstr — the built-in line filter, with a limited, non-POSIX pattern syntax

Knowledge Check

Why does grep '(GET|POST)' access.log fail to match either method, while grep -E '(GET|POST)' works?

Plain grep uses BRE, where (, ), and | are literal characters; -E switches to ERE, where they act as grouping and alternation
The first form is case-sensitive by default, so the method names will only ever match once you also pass the -i flag to ignore case
-E quietly enables recursive directory search, which grep requires before alternation will work at all
BRE is unable to match uppercase letters at all unless they are wrapped in an explicit character class

You want to find the literal string 10.0.0.1 in a log. Why is grep -F '10.0.0.1' the better choice than grep '10.0.0.1'?

Without -F each . is a wildcard, so the pattern also matches strings like 10x0y0z1; -F treats it as a literal and avoids false positives
-F switches the whole search to case-insensitive matching, which is exactly what you need to match dotted IP addresses reliably across a log
Without the -F flag, grep only ever scans the very first line of the file and stops there
-F is the flag that is required before grep will print the line number alongside each match

A script using grep -P works on your Ubuntu workstation but errors out on an Alpine container. What is the most likely reason?

-P needs the PCRE-backed GNU grep; Alpine's BusyBox grep lacks PCRE support, so the option is unrecognized
PCRE patterns are only ever treated as line-oriented on Debian-based systems, which is precisely why Alpine rejects them
-P requires root privileges that the unprivileged Alpine container does not grant by default
Alpine disables regular expressions entirely to save image space, so every grep pattern fails there

Why does grep 'foo.*bar' app.log miss a case where foo appears on one line and bar on the next?

grep matches one line at a time, so .* never crosses a newline; a match spanning two lines cannot fire
.* is greedy and overshoots past bar whenever the two words sit too far apart on the line
The pattern silently requires the -z flag by default and fails without it on essentially every input file
grep stops scanning at the very first match it finds and so never reads as far as the second line

You got correct