Topic 30

Large Files and Partial Clones

Remotes

Git stores every version of every file forever, which is exactly what you want for source code and exactly what you do not want for a 200 MB binary that changes every week. The same design that makes history cheap for text makes it ruinous for large or numerous binaries, and the bloat is permanent once it is committed.

There are two separate problems and two separate families of fixes. Git LFS keeps large files out of the packfiles entirely by storing pointers; shallow clone, partial clone, and sparse-checkout each fetch less of the history or tree so you do not download data you will never use.

Why Big Files Hurt

Git's storage relies on delta compression: similar versions of a text file are stored as small diffs against each other. Binaries — images, compiled artifacts, video — do not delta-compress, so each version is stored in full. Because Git keeps every version forever, a binary edited fifty times bakes fifty full copies into history, and every clone downloads all of them permanently.

The only real fix is to keep those blobs out of the packfiles in the first place; removing them after the fact requires rewriting history, which is disruptive for everyone.

Git LFS

git lfs track "*.psd" writes a filter rule into .gitattributes. From then on, committing a matching file stores a small text pointer in Git while LFS uploads the actual blob to a separate store; on checkout, LFS fetches the real content back. Your history stays small because it carries pointers, not bytes.

The catch is that the filter lives in .gitattributes: if you do not commit that file, other clones never see the rule and store the raw binaries instead of pointers, defeating the whole arrangement.

Shallow Clone

git clone --depth 1 fetches only the most recent commit (or the last N with a larger depth), truncating history. The clone is fast and small, which is ideal for a throwaway checkout. The cost is that the missing history is genuinely absent: git blame, git log beyond the cut, and merges that reach across the boundary fail with "shallow" errors.

That makes shallow clone a tool for ephemeral environments, not for a working copy where you will investigate history or integrate branches.

Partial Clone

git clone --filter=blob:none keeps the full commit history but skips downloading file contents until something actually needs them, fetching blobs on demand. A variant, --filter=blob:limit=1m, downloads small blobs up front and defers only the large ones.

This is the right choice when you need real history — for blame and cross-history merges — but do not want to pull down every blob of every version up front, as in a large repo with heavy assets.

Sparse-Checkout

git sparse-checkout set --cone <dirs> populates only the directories you name from a monorepo, leaving everything else out of the working tree while history stays intact. Cone mode restricts patterns to whole directories, which lets Git match them quickly.

Without cone mode, sparse-checkout falls back to full pattern matching against every path, which is noticeably slow on a huge repository.

Shallow Clone vs Partial Clone

Shallow clone (--depth 1) truncates history: you lose old commits, blame, and cross-history merges. Fast and tiny, but a one-way trip for anything that reaches back.

Partial clone (--filter=blob:none) keeps full history but defers downloading file contents until accessed. Use shallow for throwaway CI checkouts, partial clone when you need history but not every blob up front.

Common Mistakes

Committing large binaries before setting up LFS, baking the blobs into history so removing them needs a full history rewrite.
Using a --depth 1 shallow clone in CI then running git blame or merging across the cut and hitting "shallow" errors.
Setting up LFS but forgetting to commit .gitattributes, so other clones store the raw files instead of pointers.
Running out of LFS storage or bandwidth quota mid-release because the budget was never planned.
Using sparse-checkout without cone mode on a huge repo and suffering slow non-cone pattern matching.

Best Practices

Track binaries with git lfs track and commit the resulting .gitattributes before adding the files.
Use git clone --filter=blob:none for large repos where you need history but not all content up front.
Reserve --depth 1 for ephemeral CI checkouts that never blame or merge across history.
Scope a monorepo working tree with git sparse-checkout set --cone <dirs>.
Audit bloat with git lfs ls-files and git count-objects -vH before it becomes unmanageable.

Comparable toolsMercurial largefiles and lfs extensions plus narrow/shallow clonePerforce handles huge binary assets natively, the game-studio normSubversion checks out a single subtree by default, close to sparse-checkoutFossil no large-file offloading, stores everything in SQLite

Knowledge Check

Why do binary files defeat Git's delta compression?

They do not delta well, so each version is stored in full, and Git keeps every version forever
Git refuses to store binaries at all and silently drops them on commit, recording only their file paths while discarding the actual bytes during the staging step
Binaries are kept only on the server and never copied to local clones, which fetch lightweight placeholders instead and stream the real bytes on demand
Git compresses binaries so well that the repository history shrinks instead

What does a shallow clone omit that a partial clone does not?

Shallow omits old history; partial keeps full history but defers downloading blob contents
Shallow omits the blob contents and fetches them lazily; partial omits the older history beyond a depth boundary
They both omit exactly the same commits and objects, applying one identical truncation rule so the two clone types end up with the same trimmed-down history
Shallow omits the working tree; partial omits the staging index

Why does LFS require committing .gitattributes to work for everyone?

The track rule lives in .gitattributes; without it, other clones store raw files instead of pointers
It holds the LFS server password and authentication tokens, so without committing it other clones cannot log in to the large-file store and download the real binaries
Git rejects any commit made in a repo lacking the file, aborting the commit with an error until a valid .gitattributes is present at the repository root
It is only needed on the central server, never in clones

When does a shallow clone break blame and merge?

When the operation reaches across the truncated history that the --depth cut left out
Only when the repository also stores files in LFS, since the missing pointer objects are what trip up blame and merge on a truncated clone
Never; shallow clones fully support every history operation, reconstructing any missing commits on the fly whenever blame or merge needs them
Only on the very first commit you make after cloning

What does cone mode optimize in sparse-checkout?

It restricts patterns to whole directories so Git can match them quickly instead of scanning every path
It compresses the working tree into a single cone-shaped pack file on disk, replacing the loose checked-out files with one bundled archive that Git unpacks on access
It downloads file blobs lazily on demand like a partial clone
It removes all commit history older than the cone depth, truncating the log at a boundary so the repository keeps only the most recent slice of commits

You got correct