Git and Mutable History

Posted on 2011-07-16 by Curt Sampson :: comments enabled

There are quite a number of blog posts out there that posit variations of idea that Git has mutable history. For example, Patrick Thompson, in a post about Git versus Mercurial, says, “git provides you with tools to go back in time and edit your commit history.” Dustin Sallings, in a similar post, says, “The culture of mercurial is one of immutability…. git is all about manipulating history.”

Well, this isn’t quite true. It may look like this at first glance, because Git has the unfortunate property of hiding its clean interior behind a rather messy and inconsistent user interface, but a closer look will give us a better understanding of what’s really going on.

As a quick read of Git for Computer Scientists will reveal, Git (like many other distributed version control systems) stores the history of commits as a directed acyclic graph (DAG). Each commit is a snapshot of a directory tree and the files within it, along with some metadata such as the parent(s) of the commit, commiter, date, comment, and so on. (Actually, the DAG has separate blob and tree objects on which each commit depends, but we’ll ignore that implementation detail for the purposes of this article.) Commits are identified by an SHA-1 hash of the contents of the commit, such as 4f4d43d71971751f6c1df6a2acbfdf4edee1aa29. (This is often abbreviated to just the first seven characters of the has, e.g. 4f4d43d)

From the way these commit identifiers are generated, it should clear that any two commits that have the same identifier have the same contents (which includes a list of parent commit identifiers, and thus are the same commit. A corollary of this is that commits are immutable: they can never be changed once created.

So, when you do something that looks like it’s “changing” a commit, such as an amend, you are in fact changing nothing about that commit. What really happens here? Let’s look at an example.

We’ll start with a simple commit tree of parent commit a and child commit b, or a-b when viewed as a path from the root node to an end node (what Git calls a “branch”). If our current HEAD is b, and we execute a git commit --amend that “changes” something about it, we actually create a new commit, which I’ll call b’, that also has a as the parent. This leaves us with a DAG that now has three nodes, a with its two children b and b’, or the two branches a-b and a-b’.

So why does it look to some people as if we changed a commit here, when in reality we simply added a new one? The key insight we need to understand this is that a Git repository includes, as well as the commit DAG, a set of name to commit mappings. Just as symlinks in a Unix filesystem make one filename point to another filename, the name to commit mappings make a name point to a commit. Unlike commits, however, these name to commit mappings are mutable: while any commit in the world with hash 4f4d43d71971751f6c1df6a2acbfdf4edee1aa29 will be the same as any other with that hash, in a repo a particular name such as “HEAD” may point to commit 4f4d43d at one moment but commit 84db254 at another, and the same name in two different repos may point to two different commits.

(The set of name to commit mappings is generated from several different sources in the repository. These include the sets of branches and tags and the reflog that records the changes to both. So, as it happens, there usually is another name pointing to the otherwise-abandoned commit b, the name “HEAD@{1}” generated from the reflog.)

So what really happened when we did the amend in the example above was that we had the name HEAD pointing to commit b, and when we created commit b’ we changed the name HEAD to point to that instead. But commit b is still in the repository, and can still be accessed directly, via its hash, whether or not there are still any other names pointing to it.

A similar thing happens when we rebase one set of commits on to another; this doesn’t actually change or move commits, but instead generates a new series of commits that have a similar effect to that of another series that started from another commit.

One way to prove to yourself that commits really are immutable is also an easy way to implement “undo” functionality when you’re doing an amend, rebase, or similar operation. Start by creating a new branch name referring to the the current commit (we’ll call it “saved”) by using the command git branch saved. After you commit your changes, the current branch name you’re working on will point to the latest new commit, but saved will still point to the head commit of your previous state. To revert to that, all you need to do is change the branch pointer “master” (if that’s what you were working on) to be the same as “saved,” delete the name “saved”, and change your working copy to match:

git branch -M saved master # also deletes the name "saved"
git reset --hard

I find that if you always think of Git operations in terms of the commit DAG and what names point to what commits at any particular time, Git becomes easier to understand. I hope that this article has made the operation of Git more clear and put to rest the misleading idea that in git we “change history,” as opposed to simply creating new branches and changing name to branch mappings.

One more note: over time, as a repository is used, there will be an accumulation of commits that are not either pointed to by a ref (branch or tag) or the parent of one pointed to by a ref—in other words, commits that are basically unused. This is why we have the git gc command; that will trim the reflog and remove all of the commits that can no longer be reached by a name to commit mapping but only by their commit IDs.

Add a comment »
comments are moderated