How Git Works

August 06, 2026 · 25 min read

Most developers use git every day and understand about 15% of it. They know add, commit, push, pull, and merge. They’ve memorised a few incantations for when things go wrong. And they’re vaguely terrified of rebase. This is because git is taught as a set of commands rather than as a data structure. Once you understand the data structure, the commands stop being mysterious. They’re just operations on a graph.

Before git: the long road to distributed version control

The history of version control is the history of people trying to collaborate on code without destroying each other’s work.

SCCS (Source Code Control System, 1972) and RCS (Revision Control System, 1982) were the first generation. They tracked changes to individual files, one file at a time, using a lock-edit-unlock model. If you wanted to edit a file, you locked it. Nobody else could edit it until you unlocked it. This prevented merge conflicts by preventing concurrent editing entirely, which also prevented concurrent work.

CVS (Concurrent Versions System, 1990) was the first widely-used system that allowed multiple people to edit the same file simultaneously. It tracked file-level history on a central server. Developers checked out a working copy, made changes, and committed them back. If two people changed the same file, CVS would attempt to merge the changes automatically. If the changes overlapped, it flagged a conflict for manual resolution.

CVS had serious limitations. It tracked files individually, not as a group, there was no concept of an atomic commit spanning multiple files. Renaming a file lost its history. Moving directories was dangerous. Branching existed but was so painful that people avoided it. Despite these flaws, CVS was the standard for open-source development through the 1990s. The FreeBSD project, the Apache Foundation, and thousands of other projects used it.

Subversion (SVN, 2000) was designed as “CVS done right.” Created by CollabNet (with significant contributions from Karl Fogel and Ben Collins-Sussman), it addressed CVS’s biggest problems: atomic commits (all changes in a commit succeed or fail together), directory versioning, efficient branching (branches and tags were cheap copy operations), and better binary file handling. Subversion used a simple revision numbering system, revision 1, revision 2, revision 3, which made it easy to reference specific points in history.

But Subversion was still centralised. The repository lived on a single server. If the server was down, you couldn’t commit. If the server was slow (and your team was on the other side of the world), every operation was slow. And branching, while cheaper than CVS, was still a heavyweight operation that required network access.

Then came the event that changed everything.

The BitKeeper controversy

The Linux kernel, the largest collaborative software project in history, needed a version control system that could handle its scale. By the early 2000s, kernel development involved thousands of developers, thousands of patches per release, and a workflow based on Linus Torvalds reviewing and applying patches sent by email.

In 2002, the kernel project adopted BitKeeper, a proprietary distributed version control system created by Larry McVoy. BitKeeper was technically excellent, it was designed for exactly the kind of large-scale, distributed development that the kernel required. It offered a free license for open-source projects, and the kernel team used it for three years.

In 2005, Andrew Tridgell (creator of Samba and rsync) reverse-engineered parts of BitKeeper’s protocol, violating the terms of the free license. Larry McVoy revoked the kernel project’s access. Suddenly, the world’s most important open-source project had no version control system.

Linus Torvalds, characteristically, decided to write his own. He started work on git in April 2005, and the first version was managing the kernel’s source code within weeks. His design goals were explicit: speed, support for distributed development, strong safeguards against data corruption, and the ability to handle a project the size of the Linux kernel.

Git was not the only distributed VCS to emerge from this period. Mercurial (hg) was started by Matt Mackall in the same month, April 2005, with similar motivations. Both were responses to the same crisis. Mercurial was generally considered more user-friendly, with a more intuitive command set and a cleaner interface. Git was faster and more flexible, but also more complex and initially quite hostile to newcomers. (Early git documentation was famously written for kernel developers, not for the general public.)

Git won the adoption war, largely because the Linux kernel used it and because of what happened three years later.

GitHub: where social met source control

In 2008, Tom Preston-Werner, Chris Wanstrath, PJ Hyett, and Scott Chacon launched GitHub, a web-based hosting service for git repositories that added a social layer on top of version control.

GitHub didn’t change git. It changed how people used git. The key innovations were:

Pull requests: a structured way to propose changes, review code, and discuss modifications before merging. Pull requests aren’t a git feature, git has no concept of them. They’re a GitHub workflow (and later GitLab’s “merge requests” and Bitbucket’s “pull requests”). But they changed how open-source contributions work. Instead of emailing patches to a mailing list, you fork a repository, make changes on a branch, and open a pull request. The maintainer reviews, discusses, requests changes, and eventually merges.

The fork model: anyone can fork any public repository, creating their own copy to experiment with. This lowered the barrier to contributing to open source from “convince the maintainer to give you commit access” to “click a button and start coding.”

The social graph: following developers, starring repositories, contribution activity graphs. GitHub made open-source development visible and discoverable in a way that mailing lists and SourceForge never did.

README-driven development: GitHub renders the README.md file on a repository’s main page, making it the first thing a visitor sees. This simple feature changed how projects presented themselves. A good README became a form of marketing, explaining what the project does, how to install it, how to use it, and how to contribute. Open-source projects that would have been invisible on a mailing list archive became discoverable and approachable.

GitHub became the default platform for open-source development, and eventually for much private development as well. Microsoft acquired it in 2018 for $7.5 billion. By then, it hosted over 100 million repositories.

GitHub shaped git’s perception more than git itself. Many developers conflate the two. “Push it to git” means “push it to GitHub.” Pull requests, issues, Actions (CI/CD), code review, none of these are git features. They’re GitHub features built on top of git. GitLab and Bitbucket offer similar features with different implementations. Understanding the boundary between git (the version control system) and GitHub (the hosting platform with social and collaboration features) is important because it tells you what’s portable and what’s vendor-specific.

Git’s object model: the four types

Underneath the commands and workflows, git is a content-addressable object store. Everything git stores is an object, identified by the SHA-1 hash of its content. There are four types:

Blobs store file content. Not file names, not permissions, just the raw content. A blob is the SHA-1 hash of a header (“blob” + content length) plus the file’s bytes. If two files in your repository have identical content, they’re stored as a single blob. The file name is stored elsewhere.

Trees represent directories. A tree object contains a list of entries, each with a file mode (permissions), a name, and a reference (SHA hash) to either a blob (for a file) or another tree (for a subdirectory). A tree is a snapshot of a directory’s contents at a point in time.

Commits are the backbone. A commit object contains:

  • A reference to a tree (the snapshot of the entire project at this point)
  • References to zero or more parent commits (zero for the initial commit, one for a normal commit, two or more for a merge)
  • The author’s name, email, and timestamp
  • The committer’s name, email, and timestamp (usually the same as the author, but different for cherry-picked or rebased commits)
  • A commit message

Tags (annotated tags, specifically) are named references to other objects (usually commits), with an optional message and signature. They’re typically used for release markers: “this commit is version 2.1.0.”

That’s it. Four object types. Everything git does, every branch, every merge, every diff, every log entry, is an operation on these four types of objects.

You can see these objects yourself. git cat-file -t <hash> tells you the type. git cat-file -p <hash> prints the content. Try it on a commit hash from git log, you’ll see the tree reference, parent references, author, committer, and message. Follow the tree reference and you’ll see the directory listing. Follow a blob reference and you’ll see the file content. The entire repository is just these objects, linked by their hashes.

Content-addressable storage: the same content, the same hash

The term “content-addressable” means that the address (identity) of an object is determined by its content. The SHA-1 hash of a blob’s content is its name in the object store. If you have a file containing “hello world\n” and I have a file containing “hello world\n”, they produce the same hash: 95d09f2b10159347eece71399a7e2e907ea3df4f. Git stores it once.

This has profound consequences:

Integrity is built in. If any byte of a stored object changes, due to disk corruption, a bug, or tampering, the hash no longer matches the content. Git detects this automatically. You can’t silently corrupt a git repository. This is why git fsck (file system check) exists and why it works.

Deduplication is automatic. Identical content is stored once, regardless of how many files or commits reference it. If you have a 10 MB library file that’s the same across 50 branches, it’s stored once.

History is tamper-evident. A commit’s hash depends on its tree, its parents, its message, and the author information. The tree’s hash depends on its entries. Each entry’s hash depends on its content. Changing anything in history changes the hash of that object, which changes the hash of everything that references it, all the way to the branch tip. You can’t alter history without the hashes changing. This is why force-pushing a rewritten branch is visible to everyone, the commit hashes are different.

A note on SHA-1: yes, SHA-1 is cryptographically broken in the sense that it’s possible to construct two different inputs with the same hash. Google and CWI Amsterdam demonstrated the first practical SHA-1 collision in 2017, producing two different PDF files with the same SHA-1 hash at a cost of about $110,000 in GPU compute time. Git is transitioning to SHA-256 (the work has been underway since 2018, with object format version 2 supporting SHA-256). But for git’s purposes, detecting accidental corruption and enabling deduplication. SHA-1 remains practical. A deliberate collision attack against a git repository would require an attacker to construct a malicious object with the same hash as a legitimate one, which is significantly harder than finding any two colliding inputs. Git also added collision detection hardening after the SHAttered attack, rejecting objects that exhibit known collision patterns.

The directed acyclic graph (DAG)

Commits in git form a directed acyclic graph. Each commit points to its parent(s), creating a graph that flows in one direction (from newer to older) and never forms cycles (a commit can’t be its own ancestor).

A simple linear history looks like this:

A ← B ← C ← D  (main)

Each arrow means “D’s parent is C, C’s parent is B, B’s parent is A.” The arrows point backwards, each commit knows its parent(s), but parents don’t know their children.

When you create a branch and make commits on it:

A ← B ← C ← D  (main)
         ↑
         E ← F  (feature)

Commit E’s parent is C. Commits D and F are on different branches, both descended from C. The graph has diverged.

When you merge:

A ← B ← C ← D ←── G  (main)
         ↑         ↗
         E ← F ──╯    (feature)

Commit G is a merge commit with two parents: D and F. The branches have converged. The DAG now records that G incorporates the history of both branches.

This structure is why git is fast at operations that other systems find expensive. Finding the common ancestor of two branches? Walk the graph backwards from both until you find a shared node. Determining whether one commit is an ancestor of another? Graph traversal. These are fundamental graph algorithms, and git’s entire model is built on them.

Refs: branches are just pointers

Here’s the thing that demystifies branching: a branch is a file containing a 40-character SHA-1 hash.

Look inside your .git/refs/heads/ directory. Each file is named after a branch. Each file contains the SHA hash of the commit that branch points to. That’s it. Creating a branch is writing a 40-character string to a file. Deleting a branch is deleting that file. This is why branching in git is instantaneous, there’s nothing to copy, nothing to compute.

HEAD is a special ref that tells git which branch (or commit) you’re currently on. It’s usually a symbolic reference, the file .git/HEAD contains something like ref: refs/heads/main, meaning “HEAD points to whatever main points to.” When you switch branches with git checkout feature, git updates .git/HEAD to ref: refs/heads/feature.

Tags are similar to branches, they’re refs in .git/refs/tags/, but they don’t move. When you commit on a branch, the branch ref advances to the new commit. A tag stays where it is. That’s the only difference.

Remote-tracking branches (like origin/main) live in .git/refs/remotes/. They’re updated when you fetch from the remote, but you can’t commit to them directly. They’re your local record of where the remote’s branches were the last time you checked.

Understanding that branches are just movable pointers to commits eliminates most of the fear around branching. Creating a branch costs nothing. Deleting a branch that’s been merged costs nothing (the commits are still in the graph, referenced by the merge). The only thing a branch does is give a human-readable name to a commit hash.

What git merge actually does

When you run git merge feature while on main, git performs a three-way merge:

  1. Find the merge base, the most recent common ancestor of main and feature. In our earlier example, that’s commit C.
  2. Compute the diff from the merge base to the tip of main (C → D): what changed on main?
  3. Compute the diff from the merge base to the tip of feature (C → F): what changed on feature?
  4. Combine the two diffs:
    • Changes that appear in only one diff are applied cleanly
    • Changes that affect different files are applied cleanly
    • Changes that affect different parts of the same file are applied cleanly
    • Changes that affect the same part of the same file are a merge conflict, git can’t decide which change wins, so it asks you

If there are no conflicts, git creates a merge commit with two parents. If there are conflicts, git pauses and presents the conflicting regions (the familiar <<<<<<< HEAD / ======= / >>>>>>> feature markers) for you to resolve manually.

The three-way merge is what makes this work. A two-way diff (just comparing main and feature) can’t tell you what changed, only that they’re different. The three-way merge, by including the common ancestor, can identify what each side changed and merge those changes intelligently.

There’s a special case: the fast-forward merge. If main hasn’t moved since feature branched off (i.e., main is an ancestor of feature), there’s no divergence and no merge needed. Git simply moves the main pointer forward to where feature is. No merge commit is created. The history stays linear.

What git rebase actually does

Rebasing is replay. When you run git rebase main while on feature, git:

  1. Finds the common ancestor of feature and main
  2. Saves all the commits on feature that aren’t on main (the “feature-only” commits)
  3. Resets feature to point to the tip of main
  4. Replays each saved commit, one by one, on top of the new base

The result is the same changes, but with different commit hashes (because the parent has changed, so the hash changes) and a linear history. Instead of:

A ← B ← C ← D  (main)
         ↑
         E ← F  (feature)

You get:

A ← B ← C ← D  (main)
               ↑
               E' ← F'  (feature)

E’ and F’ have the same changes as E and F, but they’re new commits with new hashes. The old E and F still exist in the object store (until garbage collection), but nothing references them anymore.

This is why rebasing rewrites history, and why you should never rebase commits that other people have based work on. If you rebase a shared branch, everyone else has the old commits, and you have new commits with the same changes but different hashes. Git sees these as completely different commits, and the next merge will be a mess.

The golden rule: rebase your own branches freely. Never rebase shared branches. If you’re the only one working on a feature branch, rebasing onto main before merging gives you a clean, linear history. If others have pulled your branch, merge instead.

The index: git’s staging area

Between your working directory and the repository sits the index (also called the staging area or the cache). It’s one of git’s most misunderstood features, and it’s the reason git add exists as a separate step from git commit.

The index is a binary file (.git/index) that represents the next commit. When you run git add file.txt, you’re copying the current state of file.txt into the index. When you run git commit, git creates a tree from the index and wraps it in a commit. The working directory is not directly involved in the commit, only the index matters.

This design allows partial commits. You’ve changed five files, but only two of those changes are related to the feature you’re committing. You git add the two relevant files, commit, then continue working on the other three. The index is the mechanism that makes this possible.

It also explains git diff versus git diff --staged. git diff shows the difference between your working directory and the index (what you haven’t staged yet). git diff --staged shows the difference between the index and the last commit (what you’re about to commit). They’re answering different questions.

The index is also why git reset has three modes that confuse everyone:

  • git reset --soft HEAD~1 moves the branch pointer back one commit but leaves the index and working directory unchanged. The changes from the undone commit are still staged, ready to be committed again.
  • git reset --mixed HEAD~1 (the default) moves the branch pointer and resets the index, but leaves the working directory unchanged. The changes are in your files but not staged.
  • git reset --hard HEAD~1 moves the branch pointer, resets the index, and resets the working directory. The changes are gone (though recoverable via reflog for 30 days).

Each mode resets to a different boundary: soft stops at the branch pointer, mixed stops at the index, hard goes all the way to the working directory. Once you know the three layers (repository, index, working directory), the three reset modes make perfect sense.

Remotes and the distributed model

The word “distributed” in “distributed version control” means something specific: every clone of a git repository is a complete repository. Not a working copy. Not a checkout. A full copy of every commit, every tree, every blob, every ref. You can work entirely offline, committing, branching, merging, viewing history, because everything is local.

Remotes are named references to other copies of the repository. When you git clone https://github.com/someone/repo.git, git creates a remote called origin that points to the URL you cloned from. You can have multiple remotes – origin for your fork, upstream for the original repository, staging for a deployment target.

git fetch origin downloads all new objects and refs from the remote without modifying your local branches. It updates the remote-tracking branches (origin/main, origin/feature) to reflect the remote’s current state. Your local branches are untouched. This is why fetch is always safe, it only adds information.

git push origin main uploads your local main branch’s objects and refs to the remote. If the remote’s main has moved since you last fetched (someone else pushed), the push is rejected. Git won’t let you overwrite someone else’s work without explicit force.

The forking model, popularised by GitHub, uses this distributed nature. You fork a repository (creating your own remote copy), clone your fork, create a branch, push to your fork, and open a pull request back to the original. The original repository’s maintainers can pull your changes without giving you write access. This is the model that scaled open source from “email patches to a mailing list” to “millions of contributors across millions of projects.”

The shared repository model is simpler: everyone has push access to the same remote. You create feature branches, push them, open pull requests, and merge. This is how most teams work on private repositories. The tradeoff is less access control but simpler workflow.

Both models work because git’s distributed nature means there’s no privileged copy. Your clone is as complete as the “central” repository on GitHub. GitHub is convenient (hosting, pull requests, CI integration), but it’s not special from git’s perspective. It’s just another remote.

Packfiles: how git stays efficient

If every object is stored as a separate file on disk, a repository with millions of objects would have millions of small files. Filesystems don’t handle this well. Git solves this with packfiles.

Periodically (and always during git gc, git push, and git clone), git compresses objects into packfiles. A packfile stores multiple objects in a single file, using delta compression, instead of storing each version of a file in full, it stores the most recent version in full and each previous version as a delta (a set of changes) from the next version.

This is counterintuitive: git’s conceptual model is snapshots (each commit references complete trees and blobs), but its storage model uses deltas for efficiency. The abstraction layer means you never need to think about deltas, every operation works as if every version is a complete snapshot. But on disk, a repository that contains hundreds of versions of a large file doesn’t store hundreds of copies.

The combination of content-addressable storage, deduplication, and delta compression in packfiles is why git repositories are surprisingly compact. The Linux kernel repository contains over a million commits spanning nearly two decades, and the packfile is about 4 GB. That’s the complete history of one of the world’s largest software projects, fully traversable, on a USB stick.

Garbage collection (git gc) is the process that creates packfiles and cleans up unreferenced objects. Git runs it automatically when the number of loose objects exceeds a threshold (about 6,700 by default). It compresses loose objects into packfiles, removes objects that are no longer reachable from any ref or the reflog (after the reflog’s expiry period, typically 30-90 days), and optimises the repository’s storage.

This is why “deleting” a branch doesn’t immediately free space, the commits are still in the object store, referenced by the reflog. They’ll be cleaned up by garbage collection eventually, but not immediately. It’s also why you can recover from most mistakes within the reflog expiry window, the data is still there, you just need to find its hash.

git stash: the shelf for unfinished work

git stash is a convenience feature, but understanding how it works reinforces the object model. When you run git stash, git creates two (or three) commit objects: one for the current state of the index, one for the current state of the working directory, and optionally one for untracked files. These commits are stored on a special ref (refs/stash) and don’t appear in your branch history.

git stash pop applies the stashed changes back to your working directory and removes them from the stash. git stash apply applies them but keeps them in the stash. Under the hood, it’s all commits and refs, the same machinery as everything else in git.

The practical use: you’re halfway through a feature, something urgent comes up on another branch, and you need to switch. Stash your changes, switch branches, fix the urgent thing, switch back, pop the stash. Without stash, you’d either need to commit half-finished work (polluting the history) or risk losing changes when you switch branches.

git bisect: binary search through history

git bisect is one of git’s most powerful and least-used features. It performs a binary search through your commit history to find the commit that introduced a bug.

You tell git a “bad” commit (where the bug exists, usually HEAD) and a “good” commit (where the bug doesn’t exist, maybe last week’s release tag). Git checks out the commit halfway between them and asks: is this good or bad? You test, answer, and git narrows the range by half. For 1,000 commits between good and bad, bisect finds the guilty commit in about 10 steps.

This only works well if each commit is a testable, coherent change. If your commits are “WIP” or “stuff” or combine unrelated changes, bisecting is pointless because individual commits don’t correspond to meaningful states. This is the practical reason for making small, focused commits, not just tidiness, but debuggability.

You can even automate it: git bisect run ./test.sh will run your test script at each step and determine good/bad automatically. Give it a script that returns 0 for “good” and 1 for “bad,” and git will find the offending commit without any manual intervention.

Practical wisdom

Understanding git’s internals changes how you use it:

Commit messages matter. A commit message is stored in the commit object and is part of the permanent record. “fix bug” tells you nothing six months later. “Fix null pointer in payment processing when customer has no default card” tells you exactly what happened and why. The first line should be a concise summary (50 characters is the convention). If more context is needed, leave a blank line and write a longer description.

Make small, focused commits. Each commit should represent a single logical change. This makes git bisect (binary search through history to find when a bug was introduced) effective, makes reverts safe (reverting a small commit is unlikely to have side effects), and makes code review possible.

Don’t fear rebasing on your own branches. If you’re working on a feature branch and main has moved, rebasing your branch onto the current main gives you a clean, linear history that’s easier to review and bisect. The commits are yours, nobody else has them, and rewriting them is harmless.

Never force-push shared branches. If you rebase a branch that other people have pulled, their local history diverges from the remote. They’ll need to do a complicated recovery, and they’ll be annoyed. Force-pushing to main or any shared development branch is a cardinal sin of collaborative development.

Use git reflog when things go wrong. The reflog records every time a ref (branch or HEAD) changes. Even if you accidentally delete a branch or reset to the wrong commit, the old commits are still in the object store, and the reflog tells you their hashes. git reflog is your time machine. Commits don’t actually disappear until git gc runs (by default, unreferenced commits are kept for 30 days).

Understand that git pull is git fetch + git merge. If you want to see what changed on the remote before incorporating it, use git fetch first, inspect the changes, and then merge or rebase. git pull --rebase does a fetch followed by a rebase instead of a merge, which keeps your local history linear.

Use .gitignore before you commit secrets. Once a file is in git history, removing it is painful (you’d need git filter-branch or the BFG Repo-Cleaner, which rewrite history). Preventing the problem is far easier than fixing it. Add patterns for build artifacts, dependency directories (node_modules/, vendor/), environment files (.env), and IDE configuration before your first commit.

Cherry-pick for surgical precision. git cherry-pick <commit> applies the changes from a single commit to your current branch, creating a new commit with the same changes but a different hash (different parent, different hash). It’s useful for backporting a specific fix to a release branch without merging the entire development branch.

Interactive rebase for polishing history. git rebase -i main lets you reorder, squash, edit, or drop commits on your feature branch before merging. You can combine five “WIP” commits into two clean, logical commits. You can reword a commit message. You can split a commit that changed too many things. This is the tool that turns messy development history into a clean, readable record.

Git is a content-addressable object store with a DAG on top. Branches are pointers. Commits are snapshots. Merges are graph operations. Rebases are replays.

The fear that most people feel around git comes from not seeing the data structure. When you only know the commands, git feels like an incantation system, type the magic words and hope for the best. When you understand that every operation is just manipulating a graph of immutable, content-addressed objects, adding nodes, moving pointers, replaying diffs, the commands become intuitive. And when something goes wrong, you know exactly where to look: the reflog, the object store, and the graph.

The graph is the truth. Everything else is just a convenient way to look at it.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.