find and xargs
Topic 22

find and xargs

ShellTooling

find walks a directory tree from one or more starting points, tests each entry it meets against an expression — name, type, size, age, permission bits, owner — and acts on the ones that match. It does not glob the way the shell does and it does not stop at the first level; it descends the whole tree, one entry at a time, until the expression has been evaluated against every file, directory, symlink, and device node it finds. The expression is the program: find /var/log -type f -name '*.log' -mtime +30 reads as "regular files named *.log last modified more than 30 days ago," and nothing happens to them until you add an action.

The operational weight of find is that it is usually paired with a destructive action — -delete, -exec rm, a pipe into xargs — and run over thousands of files at once. That is exactly where it bites: a filename with a space or a newline, an off-by-one in -mtime, or a missing dry-run turns a cleanup job into an outage. Get null-termination and batching right and find | xargs is the safest way to operate on a large fileset; get them wrong and it is the fastest way to delete the wrong thing.

The Expression Model

Everything after the starting paths is an expression of tests, actions, and operators. Tests return true or false for the current entry: -name '*.conf' (shell-glob match on the basename), -type f (regular file; d directory, l symlink, s socket), -size +100M, -perm -4000 (all of these mode bits set; /4000 matches any), -user www-data, and the time tests -mtime, -atime, -ctime in days plus -mmin in minutes. Quote glob patterns — -name '*.log', not -name *.log — or the shell expands the * against the current directory before find ever sees it.

Tests combine with operators, and the defaults matter. Two tests side by side are an implicit AND; -o is OR; ! or -not negates; parentheses group, but must be escaped as \( and \) because the shell would otherwise eat them. The time tests count in 24-hour periods measured back from now, not calendar days: -mtime +30 means "more than 30 full 24-hour periods ago" (31 days and older), -mtime 0 means "within the last 24 hours," and -mtime -7 means "less than 7 periods ago." That period-versus-calendar-day distinction is the single most common source of "why did it match a file from today" surprises.

# regular files over 100 MB, OR any file owned by an ex-employee's UID
find /srv \( -type f -size +100M \) -o -uid 1009 -print

# config files modified in the last 15 minutes (incident triage)
find /etc -type f -mmin -15

Acting on Results

A test-only find prints matches (the implicit -print); to do something with them you have three mechanisms, and choosing between them is a performance decision. -exec cmd {} \; runs the command once per matched file, substituting the path for {} — correct but slow, because a million matches fork a million processes. -exec cmd {} + batches as many paths as fit onto one command line and runs the command a handful of times, the way xargs does, which is dramatically faster for large sets. The third option pipes paths into xargs, which reads them from stdin and packs them into command lines for you.

xargs earns its place when you want to feed the output of any command — not just find — into another, or when you want its extras: -P for parallelism (-P 4 runs four jobs at once), -n to cap arguments per invocation, -I{} to place each item somewhere other than the end. Without find in the picture, xargs is still how you avoid argument list too long — the kernel caps a single command line near 2 MB (ARG_MAX), so rm *.log on a directory of a million files fails outright, while piping the list through xargs splits it into runs that fit.

# one rm process per file — slow on large sets
find . -name '*.tmp' -exec rm {} \;

# batched — a handful of rm processes total
find . -name '*.tmp' -exec rm {} +

# four parallel converts, each item placed via -I
find . -name '*.png' -print0 | xargs -0 -P 4 -I{} convert {} {}.webp

Null-Safety

By default find -print separates paths with newlines and xargs splits on any whitespace — spaces, tabs, and newlines. A file named quarterly report.csv arrives at xargs as two arguments, quarterly and report.csv, and a filename that actually contains a newline (legal on Linux) splits into two paths. The fix is to make the separator a NUL byte, the one character a Unix path can never contain: find ... -print0 emits NUL-terminated paths and xargs -0 reads them, so whitespace inside names is preserved exactly.

find ... -print0 | xargs -0 is the form to commit to muscle memory; treat any find | xargs without it as a latent bug, because it works in testing on tidy names and fails the day a real filename has a space. When you stay inside find, -exec sidesteps the problem entirely — it passes paths as separate arguments with no shell word-splitting, so -exec rm {} + needs no -print0 at all.

# safe across spaces and newlines in filenames
find /data -type f -name '*.bak' -print0 | xargs -0 rm

# equivalent, no piping, no null-handling needed
find /data -type f -name '*.bak' -exec rm {} +

Pruning and Depth

On a large tree, the cost is the descent itself — find calls stat() on every entry, and a scan that crosses /proc, a 200 GB node_modules, or a network mount can hammer I/O for minutes. -maxdepth N stops the recursion at N levels (-maxdepth 1 is the current directory only); -mindepth skips the top levels. Place -maxdepth before other tests for clarity — GNU find warns if a global option appears after a test.

-prune is the surgical tool: it tells find not to descend into a matched directory at all, which is how you exclude .git or a mount point without paying to walk its contents. The idiom reads backwards at first — find . -path ./.git -prune -o -type f -print means "if the path is ./.git, prune it and stop; otherwise, if it is a file, print it." The trailing -print is required because the -o branch overrides the implicit default print, and forgetting it makes the command silently output nothing.

# current directory only, skip the whole .git subtree
find . -path ./.git -prune -o -type f -name '*.py' -print

# don't cross into other filesystems (skip mounted volumes)
find / -xdev -type f -size +1G

Common Patterns

A handful of forms cover most real work. Cleanup by age: find /var/cache/app -type f -mtime +14 -delete removes files older than 14 days — but run it once as -print first to see exactly what matches, because -delete has no undo and applies to whatever the expression selected, brackets and all. Bulk permission fixes split files from directories, since 755 on a file makes it needlessly executable: find . -type d -exec chmod 755 {} + and find . -type f -exec chmod 644 {} + as two passes.

Find-and-grep is the everyday code search: find . -type f -name '*.conf' -exec grep -l max_connections {} + lists config files containing a setting, batched so grep runs a few times rather than once per file. Note -delete implies depth-first traversal and only removes empty directories, so combining it with directory matches behaves differently than -exec rm -r — when you mean to remove whole subtrees, be explicit about which one you want.

-exec {} \; vs -exec {} + vs xargs

-exec cmd {} \; — runs the command once per matched file. Correct and readable, and the only form when the command genuinely takes one file at a time, but it forks a process per match and crawls on large sets. Reach for it when the file count is small or the command can't accept multiple arguments.

-exec cmd {} + — packs as many paths as fit onto each command line and runs the command a few times total. The fast default for big sets, with no piping and no whitespace handling to get wrong. Choose it whenever the command accepts a list of files (rm, chmod, grep).

find ... -print0 | xargs -0 — the same batching as {} +, but with xargs's extras: parallelism (-P), argument caps (-n), and placement (-I{}). Choose it when you need parallel execution, when the producer isn't find, or when a long argument list would otherwise hit ARG_MAX.

Common Mistakes
  • Piping find ... | xargs without -print0 and -0 — a filename with a space splits into two arguments and one with a newline splits into two paths, so xargs rm deletes the wrong files or fails. It passes every test on tidy names and breaks on the first real one.
  • Running -delete or -exec rm before previewing with -print — the expression's match set is whatever the operators actually selected, and a misplaced -o or unparenthesized test can widen it far beyond what you intended, with no undo.
  • Misreading -mtime as calendar days — it counts 24-hour periods back from now, so -mtime +30 matches files 31 days and older, and -mtime 0 means the last 24 hours, not "today." The off-by-one quietly keeps or kills a day's worth of files.
  • Unquoted glob patterns like -name *.log — the shell expands the * against the current directory first, so find receives already-expanded filenames and either errors or matches nothing in subdirectories. Always quote: -name '*.log'.
  • Scanning huge trees with no -maxdepth, -prune, or -xdev — descending into /proc, a network mount, or a giant node_modules calls stat() on millions of entries and saturates I/O, sometimes for minutes, on a production box.
  • Forgetting the trailing -print after a -prune ... -o expression — the -o branch overrides the implicit default print, so the command runs and outputs nothing, which looks like "no matches" rather than the bug it is.
Best Practices
  • Use -print0 | xargs -0 for any pipeline, or stay inside find with -exec, which passes paths as separate arguments and needs no null-handling. Treat a bare find | xargs as a defect.
  • Run any destructive expression with -print first, confirm the match set, then swap in -delete or -exec rm on the identical expression. The dry-run is the only check between you and an irreversible mistake.
  • Prefer -exec cmd {} + over -exec cmd {} \; whenever the command accepts a list — it replaces one fork per file with a handful of invocations and runs orders of magnitude faster on large sets.
  • Bound large scans with -maxdepth, exclude subtrees with -prune, and add -xdev to stay on one filesystem, so a routine search never wanders into /proc or a mounted volume.
  • Quote every glob pattern (-name '*.conf') and escape grouping parentheses (\( ... \)) so the shell hands them to find intact rather than expanding or eating them first.
  • Use xargs -P for embarrassingly parallel work — image conversion, checksums, per-file processing — to spread the job across cores instead of running it one file at a time.
Comparable toolsfd — a modern find rewrite with sane defaults: regex by default, gitignore-aware, parallel, and null-safe without ceremonyPowerShellGet-ChildItem -Recurse piped to ForEach-Object, an object pipeline rather than text and NUL byteslocate / mlocate — an indexed name search, instant but only as fresh as its last updatedb, for "where is this file" rather than tests and actions

Knowledge Check

Why is find ... -print0 | xargs -0 preferred over find ... | xargs?

  • Default output and xargs split on whitespace, so a filename with a space or newline breaks into multiple arguments; NUL-termination uses the one byte a path can't contain
  • It automatically runs the downstream command in parallel across multiple worker processes for higher throughput, something the plain newline-separated form is simply unable to do
  • It is the only one of the two forms that manages to avoid the argument list too long error when run on very large filesets
  • Newline-separated output is silently truncated at a 4096-byte line limit, while NUL-terminated output has no such length cap at all

What is the practical difference between -exec cmd {} \; and -exec cmd {} +?

  • \; runs the command once per matched file; + batches many paths per invocation, running it far fewer times and much faster on large sets
  • \; handles filenames containing spaces safely on its own, while + must be paired with -print0 to stay safe
  • + runs the command once per matched file but spreads that work across several parallel workers, whereas \; runs each one strictly in sequence
  • \; is restricted to working only with rm and chmod, whereas + can be used with any arbitrary command

A cleanup job uses -mtime +30 but keeps a file modified exactly 30 days ago. Why?

  • -mtime counts whole 24-hour periods, so +30 matches files older than 30 full periods — 31 days and up — not "30 calendar days or more"
  • -mtime reads the inode ctime internally instead of the modification time, so a recent metadata change reset the file's apparent age back below the threshold
  • The + prefix actually means "newer than," so +30 matched only files that were modified within the last 30 days
  • Daylight-saving-time transitions shift the -mtime boundary by one full day twice a year, skewing which files match

Why does rm *.log fail in a directory of a million files while a find ... | xargs rm pipeline succeeds?

  • The shell expands *.log into one command line that exceeds ARG_MAX (about 2 MB); xargs splits the list into runs that each fit
  • rm refuses to delete more than 65,535 files in a single invocation as a built-in safety limit, a hard cap that xargs quietly bypasses
  • The glob silently skips dotfiles during expansion, so rm aborts partway through on an inconsistent file count
  • rm is single-threaded, so it simply times out on very large sets while xargs parallelizes the deletion across cores

A scan of a large server tree stalls disk I/O for minutes. Which change most directly bounds the cost?

  • Limit the descent with -maxdepth, skip subtrees with -prune, and add -xdev so it never crosses into other filesystems
  • Add -print0 so that the output stream is buffered far more efficiently while the tree walk is in progress
  • Replace -exec {} \; with -exec {} +, which is the change that actually makes the directory traversal itself run faster
  • Wrap the whole command in xargs -P to parallelize the directory descent itself across all available cores

You got correct