Foundations
Topic 03

How Git Thinks

Concept

Git stores each commit as a full snapshot of the project tree, not as a delta from the previous version. This one design choice explains why nearly every Git operation is fast and local: Git is moving pointers to immutable snapshots, not replaying a chain of diffs against a server.

If you come from Subversion or Perforce, this is the mental model to reset. Those systems think in per-file changes against a central server. Git thinks in whole-tree snapshots stored by content hash on your own disk. Internalize that and the rest of Git — cheap branches, instant log, offline everything — stops being surprising.

A commit is a snapshot of objects
commit
snapshot · tree + parents + meta
tree
directory listing
blob
file bytes
same content → stored once

Snapshots, Not Diffs

Each commit references a tree object that names the exact content of every file at that moment. Files that did not change between commits are not copied — the new tree simply points at the same content as before. So a commit is a complete picture of the project, but storing it costs only the bytes that actually changed, because unchanged content is shared by reference.

The payoff is that checking out any commit is just reading a snapshot, not unwinding a stack of diffs. "Go to the state from three weeks ago" is a direct lookup, which is why it is instant even in a large repository.

Content-Addressable Storage

Git names every object — blob, tree, commit — by the hash of its content. Historically that hash is SHA-1; Git now also supports SHA-256 repositories. Because the name is a digest of the content, identical content always produces the same name and is therefore stored only once. Rename a file and the file's content (the blob) is untouched; only the tree entry that points to it changes.

Content addressing also gives integrity for free. Change a single byte anywhere in history and that object's hash changes, which changes every commit downstream of it. Silent corruption and tampering cannot hide, because the names would no longer match the contents.

The Commit Graph

Every commit points to its parent — two parents for a merge, none for the first commit. That chain of pointers forms a directed acyclic graph, and the history of a project is exactly that graph. Branches and tags are not heavyweight things; they are named pointers into the graph. A branch is a 41-byte file containing one commit hash.

This is why fear of "too many branches" is misplaced. Creating a branch writes a tiny pointer; it does not copy your files. The same operation that takes a server-side directory copy in Subversion is, in Git, instantaneous and local.

Why Operations Are Local and Fast

Because the full object database lives in your .git directory, git log, git diff, git branch, and a checkout all read local files and complete in milliseconds with no network. The network is involved only when you deliberately synchronize with a remote — fetch, pull, push. Everything else is yours, offline.

When you want to see what Git actually stored, git cat-file -p <hash> prints the raw object — a commit's tree and parent, a tree's entries, or a blob's bytes. It is the quickest way to turn the abstract model into something concrete.

Snapshot Model vs Delta Model

Snapshot (Git) — each commit is a whole-tree snapshot, with identical content deduplicated by hash. Branch, checkout, and diff are cheap and local because they read snapshots, not diff chains.

Delta (SVN, Perforce) — the server stores per-file deltas, so branching is a server-side copy and browsing history is a network round-trip. Simpler central control, but every common operation pays a network cost.

Common Mistakes
  • Believing a branch is a heavyweight copy of all files, then avoiding branches — a branch is one small pointer, so creating one per task costs nothing.
  • Reasoning about storage as if Git kept a diff per commit, and worrying that many commits of a large file are expensive — identical content across commits is stored once.
  • Trying to "edit" a commit in place and expecting the hash to stay the same — objects are immutable, so any change produces a new object with a new hash.
  • Expecting git log or git diff to hit the network like Subversion, and over-engineering around a latency that does not exist — they are entirely local.
Best Practices
  • Treat branches as cheap, disposable pointers and create one per task without hesitation.
  • Use commit hashes as stable identifiers in links, bug reports, and scripts, rather than branch names that move.
  • Reach for git cat-file -p <hash> when you need to see exactly what Git stored for a commit, tree, or blob.
  • Run git gc occasionally on long-lived repos so accumulated loose objects get packed and compressed.
Comparable toolsMercurial immutable hash-identified graph, revlog storageFossil content-addressed objects in SQLiteSubversion server-side delta model, no local object store

Knowledge Check

What does a Git commit actually reference for its contents?

  • A tree object naming the full content of every file at that moment, with unchanged files shared by reference
  • A stored diff against its parent commit, replayed in order on demand whenever you check the commit out or browse it
  • A list of the per-line edits accumulated since the last push to the remote branch was completed and acknowledged
  • A pointer to the sequential server-side revision number the central server assigns to it at commit time

One file in a 10,000-file repository is edited and committed. What changes in storage?

  • A new blob for that file and new tree objects along its path; the other 9,999 files are reused by reference
  • All 10,000 files are physically re-stored as a fresh, fully independent snapshot written out on disk every commit
  • A single line-level diff is appended to that one file's running per-file delta chain kept since the file was first added
  • Nothing at all until you push, when the remote server finally computes and records the change for you on its side

Why can't you alter a commit's contents without changing its hash?

  • The hash is computed from the object's content, so any change produces a different name — a new object
  • Git locks every committed object against editing by setting read-only file permissions on it in the .git directory on disk
  • The hash is a random value that Git generates afresh and reassigns to the object on each and every single edit you make
  • Only the remote server is permitted to change an object's hash, and it always refuses the request when asked to

Why is creating a branch nearly free in Git but costly in Subversion?

  • A Git branch is a tiny pointer to a commit; an SVN branch is a server-side copy under a branches path
  • Git keeps every branch purely in memory and never writes any of them to disk at all, so creating one is essentially free
  • SVN compresses and indexes each new branch as it is created on the server, a slow step that Git deliberately skips
  • Git caps every repository at a small fixed number of branches, which is the reason each individual one stays so cheap

You got correct