Git Internals
Topic 24

Packfiles and Garbage Collection

Internals

A new object starts as a standalone "loose" file — one zlib-compressed file per object under .git/objects/. That is simple but wasteful at scale, so git gc packs loose objects into delta-compressed packfiles and prunes the ones nothing references. This is how a repo with millions of objects stays a few hundred megabytes instead of tens of gigabytes.

The two ideas to hold onto are delta compression, which makes storing many versions of a file cheap, and reachability, which is the single rule that decides what survives garbage collection. Get those and the storage behavior — including why git rm of a big file does not shrink the repo — stops being mysterious.

Loose Objects and Packfiles

A loose object is an individual file: convenient to write, but each one carries its own compression overhead and there is no shared structure between versions. A packfile is the opposite — a .pack file storing many objects together, paired with a .idx index that lets Git find any object inside it by SHA without scanning the whole pack.

You can see the split at any time. git count-objects -vH reports how many objects are loose versus packed and how much disk each consumes, which is the quickest health check on a repo.

Delta Compression

Inside a packfile, Git stores some objects not in full but as a delta against a similar object — often an earlier version of the same file. Reconstructing the object means applying the delta chain back to a full base. Because successive versions of a file usually differ only slightly, this shrinks storage dramatically, which is exactly why keeping every version of a source file forever is affordable.

To see the chains, git verify-pack -v .git/objects/pack/*.idx lists each object, its type, its size, and its delta depth, so you can spot which objects are bases and which are deltas.

Reachability and Pruning

An object is alive if it is reachable from a ref or from the reflog; everything else is a garbage-collection candidate. git gc runs git prune to delete unreachable objects, but only after a grace period that protects recently-created objects. Passing --prune=now skips that grace period and deletes unreachable objects immediately — which also defeats reflog-based recovery.

This reachability rule is the whole reason deleting a branch does not free disk space on its own: the commits stay reachable through the reflog, and survive until they are both unreachable and pruned.

gc and repack

Most of the time you never run gc by hand — gc.auto triggers it once enough loose objects pile up, and it repacks, prunes, and packs refs in one pass. When you do want control, git repack -ad rebuilds everything into a single optimal pack. Reserve a forced full repack for when you actually need it; running an aggressive repack on every CI push burns minutes of CPU for negligible gain.

$ git count-objects -vH
count: 14
size: 56.00 KiB
in-pack: 9320
packs: 1
size-pack: 38.21 MiB
Common Mistakes
  • Running git gc --prune=now while hoping to recover an orphaned commit — it deletes unreachable objects immediately and defeats reflog recovery.
  • Assuming a git rm of a large file shrinks the repo — the blob stays packed and reachable through history until the history is rewritten.
  • Disabling gc.auto on a busy repo and accumulating millions of loose objects that cripple every operation.
  • Confusing "deleted the branch" with "freed the disk" — objects persist until they are both unreachable and pruned.
  • Forcing an aggressive repack on every push in CI, spending minutes of CPU for a negligible size improvement.
Best Practices
  • Let automatic GC run via gc.auto rather than disabling it on active repos.
  • Check storage health with git count-objects -vH before assuming a repo is bloated.
  • Force a full repack only when you need it, with git repack -ad.
  • Inspect delta chains and pack contents with git verify-pack -v .git/objects/pack/*.idx.
  • Reclaim space after a history rewrite with git reflog expire --expire=now --all followed by git gc --prune=now.
Comparable toolsMercurial revlog stores deltas inline, so no separate pack/gc stepSubversion server-side revision storage with its own compactionFossil SQLite store handles compaction internally

Knowledge Check

Why is storing many versions of a source file cheap in a packfile?

  • Delta compression stores later versions as small deltas against a similar earlier object
  • Only the newest version is retained in the pack; every older revision is discarded during repack
  • Each distinct version is stored just once and shared across every branch
  • The packfile records only changed line numbers, never the file content

Which objects survive garbage collection?

  • Those reachable from a ref or the reflog; everything else is a prune candidate
  • Only objects created within the last 24 hours of wall-clock time
  • Every object ever written, since Git never deletes anything once it has been packed
  • Only the objects backing files currently in the working tree

Why does git rm of a large file not shrink the repository?

  • The blob stays reachable through history and remains packed until the history is rewritten
  • The file is merely hidden from directory listings; git gc has no way to ever remove it
  • git rm deletes the working-tree copy but quietly doubles the packed blob
  • The removal only takes effect after collaborators run a fresh clone

What is the tradeoff of git gc --prune=now?

  • It reclaims space immediately but deletes unreachable objects, defeating reflog-based recovery
  • It is always safe to run because it only ever touches loose, unpacked objects
  • It rewrites the existing history in place, changing the SHA of every reachable commit it encounters
  • It pushes the cleanup to the remote so collaborators reclaim space too

You got correct