Remotes and Large Repositories
Topic 29

Submodules and Subtrees

Remotes

Sooner or later you need one repository's code inside another — a shared library, a vendored dependency, a component you maintain elsewhere. Git offers two mechanisms for this, and they make opposite tradeoffs, so picking the wrong one is a decision you live with for years.

A submodule keeps the other repo separate and pins your parent to one of its commits, so the two histories never mix. A subtree merges the other repo's files and history directly into yours, so consumers clone one repository and never think about the boundary. Most submodule pain comes from expecting it to behave like a subtree, and the reverse.

Submodules

git submodule add <url> <path> records the dependency's URL in a .gitmodules file and pins a specific commit SHA in the parent's tree. The submodule directory is genuinely a separate repository checked out at that SHA; the parent stores only the pointer, not the submodule's files or history.

This is why a submodule keeps histories cleanly separate: the parent commit says "use this dependency at exactly this commit," and the dependency keeps its own log, branches, and releases entirely apart from yours.

Working With Submodules

The trap that catches everyone: a plain git clone of a submodule-using repo leaves the submodule directories empty, because the clone fetches the parent's pointer but not the separate repo it names. The fix is to clone with git clone --recurse-submodules, or run git submodule update --init --recursive afterward to fetch and check out each submodule at its pinned SHA.

Submodules check out in detached HEAD by design — they sit on a commit, not a branch. Commits you make inside a submodule without first creating a branch are easy to lose, since nothing names them once you move off.

Subtrees

git subtree add --prefix=lib/foo <url> <ref> vendors the other repo's files and history into a subdirectory of yours. There is no separate repo and no pointer — the content lives in your tree like any other directory. Consumers clone your repository normally, with no special flag and no initialization step.

You sync changes with git subtree pull to bring in upstream updates and git subtree push to send changes back, but the everyday experience for anyone just consuming the code is that there is nothing special to know.

Picking One

Reach for submodules when the dependency is a genuinely independent component with its own release cycle, owned and versioned separately. Reach for subtrees when downstream simplicity matters most — when consumers should clone once and never manage a second repository.

The cost mirror image: submodules ask every contributor to understand the two-repo dance, while subtrees bloat your history with the vendored repo's commits and make contributing fixes back upstream awkward.

Submodule vs Subtree

Submodule — keeps the dependency as a separate repo pinned to a SHA, so clones need --recurse-submodules and contributors must understand the two-repo workflow. History stays separate and clean.

Subtree — copies the files and history into your repo, so it clones normally but bloats history and complicates upstream contribution. Choose submodules for independently-versioned components, subtrees when downstream simplicity matters most.

Common Mistakes
  • Cloning a submodule-using repo without --recurse-submodules and getting empty directories that break the build.
  • Committing in the parent without updating the submodule pointer, so teammates check out a different submodule SHA than you tested.
  • Making commits inside a submodule's detached HEAD without creating a branch, then losing them when you move off the commit.
  • Choosing subtree and later fighting git subtree split complexity when you try to contribute fixes back upstream.
  • Editing the .gitmodules URL without running git submodule sync, leaving clones pointed at the old URL.
Best Practices
  • Always clone with git clone --recurse-submodules, or run git submodule update --init --recursive right after.
  • Update a pinned SHA deliberately: cd sub && git checkout <sha> && cd .. && git add sub.
  • Propagate URL changes to everyone with git submodule sync --recursive.
  • Reserve submodules for components with independent release cycles and subtrees for vendored code consumers should not manage.
  • Pin to specific commits, never a moving branch, so builds stay reproducible.
Comparable toolsMercurial subrepositories via .hgsub/.hgsubstate, the direct submodule analogSubversion svn:externals embeds other repo pathsPerforce composes code via streams and client mappingsFossil no built-in submodule mechanism

Knowledge Check

What is the core tradeoff between a submodule and a subtree?

  • A submodule keeps the dependency's history separate and pinned by SHA; a subtree merges its files and history into yours
  • A submodule copies the files into your tree; a subtree keeps them in a separate pinned repo
  • They are functionally identical and differ only in the command name you type, storing the dependency the same way and producing byte-for-byte the same layout in your repository
  • A subtree pins the dependency by SHA so histories stay separate; a submodule merges its files and history into yours

Why does a plain clone leave submodules empty, and how do you fix it?

  • The clone fetches the pointer but not the separate repo; fix with --recurse-submodules or git submodule update --init --recursive
  • The submodule was deleted from the upstream repo, leaving a dangling pointer behind; re-add it from scratch with git submodule add and the original URL to repopulate the directory
  • The directories are merely hidden by a config flag you can toggle on, since the files were fetched during the clone but stay collapsed until you reveal them
  • Submodules need a paid hosting feature enabled before they populate

Why does a submodule live in detached HEAD, and what is the risk?

  • It is checked out at a pinned commit, not a branch, so commits made without first creating a branch can be lost
  • It is checked out read-only and cannot accept any commits at all, since the parent locks the nested repo and rejects every attempt to stage or record changes inside it
  • Detached HEAD pushes every commit straight to the upstream repo, syncing each new commit you record back to the submodule's origin automatically
  • There is no risk; a detached HEAD behaves just like a normal branch

Which approach simplifies the consumer's clone, and which simplifies upstream contribution?

  • Subtree simplifies the consumer's clone; submodule simplifies contributing back to the independent upstream repo
  • Submodule simplifies the clone; subtree simplifies contributing upstream
  • Both simplify the consumer's clone equally with no extra steps, since each one populates its files automatically on a plain git clone without any initialization flags
  • Neither approach affects the clone or upstream contribution, so the two behave the same whether you are cloning the consumer repo or sending fixes back

What does the parent repo store for a submodule versus a subtree?

  • A submodule stores a URL and a pinned SHA pointer; a subtree stores the actual files and history
  • Both store only a URL and a pinned SHA pointer, recording just a reference to the dependency while leaving its real files and history outside your tree
  • Both embed the full vendored files and their complete history, copying every commit of the dependency directly into the parent repository's object store
  • A submodule stores the files; a subtree stores just a SHA pointer

You got correct