What is Git?
Created in 2005 by Linus Torvalds, Git was designed to maintain the development of the Linux kernel. Git was built upon the pillars of full distribution, speed, simple design, ability to handle large projects, and strong support for non-linear development. A significant amount of lore exists surrounding the history of Git and its creation, including why and how it got its peculiar name, the reason for its creation in the first place, and more.
Let’s take a look at how Git works under the hood in order to better understand this powerful version control system (VCS). We’ll cover what the Git folder is, how it’s utilized internally, what data structures are used, what really happens in Git when performing common operations, and how to recover from specific data loss that may occur within your repositories.
“@GitKraken makes the hard or redundant parts of Git easy. It doesn’t hide Git concepts. It’s as good as it Gits.” – @texastoland
Understanding the Git Folder
.git folder is a folder that lives inside every local Git repository you have on your machine. It contains multiple files that are utilized to track your code base.
The Git hooks folder is automatically created when you initialize a new repository. Git hooks are shell scripts that execute after an event, such as a Git commit or push. Developers use Git hooks to inject user functionality into common Git operations. There are 17 different types of Git hooks that make that functionality possible.
For example, if you wanted to use a Git hook to assist you in formatting a commit message to match a standard pattern enforced by your team, you would perform the following steps:
- Navigate to the
- Remove the .sample from the commit-msg Git hook
- Ensure that the prepare-commit-msg works by running
chmod +x prepare-commit-msg
In Git, the index file defines the staging area. This is clearly visible when you run a
git add or
git rm. If you try to “cat” the file using either
HEAD, you’ll see that much of the output is garbled and unclear. This is because the information stored in the index file is represented in binary which makes it easier for Git and computer programs to read, but a little harder for people to understand.
Git contains meta data on each of the files in the index, like what type of file it is and other associated information. However, Git’s index does not know how to handle empty folders on its own. This is why you usually have to create a file called “Git keep”.
The HEAD file is another file found in the Git folder and is an addendum to the index file. The HEAD file tracks what is currently checked out.
What is a Ref?
A Ref can be one of three different things.
- HEAD – a symbolic reference to the branch you’re currently on. Contains a pointer to another reference.
- Remote – similar to branches but are stored in whatever cloud Git provider you have like GitLab, GitHub, etc.
- Tag – similar to a commit object but generally points to a commit rather than a tree.
A commit object is created during a gate commit. If you run the
cat-file-p command, you can see that it contains information about the file tree that is associated with the commit, including what files were present, which parent commit it tracks, information about the author committer, and the commit message.
All commit objects share some common properties; they each live within an object’s folder that is stored within the
.git folder, all of their paths are a SHA of one of their contents, and all objects are compressed with
There are four types of commit objects:
- Tree – stores information about a directory tree
- Commits- the snapshot of your changes
- Blob- a file’s contents
- Tag- a ref that point to specific points in a Git history
The easy to read commit graph offered by the GitKraken Client helps you visualize with all types of commit objects, along with branch structure and complete commit history, giving you more control and understanding over your projects.
A packfile is a collection of many objects within one file. While the packfile itself is not compressed with zlib, the objects stored inside are. The packfile has a
.pack extension as well as a complementary index file which has the
.idx extension. This index file simply makes unpacking individual objects from the packfile much faster.
If you visit the Git objects directory, you might find packfiles living alongside objects. Those objects that are not part of that packfile are called loose objects. Loose objects can be added to a packfile later, something that is often done automatically when
git gc is run. This process is triggered only after a substantial repo size has been detected and
git gc is run.
Packfiles also come into play when you utilize network operations. When you run
git push or
git pull, all of the files are compressed into a packfile in order to be shipped down the pipeline. This is also true for fetch cloning, because under the hood, Git cloning utilizes fetching. Here you can see the relationship between clone and fetch demonstrated.
Unlike commit objects, config files are not zlib compressed, meaning you can view them as plain text files. The config file contains repository settings, information about remotes, and information about tracked branches.
Using Log to Recover From Mistakes
We have all been in a position where we accidentally made a commit that we shouldn’t have. Let’s say you’ve just run a
git log and discover that you’ve committed something you didn’t intend to. You then run a
git reset hard to remove that commit, but realize the commit also included an important file you still needed. Is that content lost forever? Are you back to square one? 😅
Luckily it’s not. This is where our log folder comes into play. The log folder is updated every time a ref is updated. Meaning every time you have made a commit or changed a reference in some way, the log folder stores that information. Using the log file, you’re still able to see the object id of the deleted commit. Even though the reset has updated the HEAD, it has not removed the local objects. You can now Git checkout that commit and grab the important files that you need.
You can also access this log functionality using Git’s
ref log command. By default, the command will show the HEAD of what you have checked out, but you can select other refs using this command simply by providing the name of the desired ref. So if you want to see which commits have been changed for a ref of another branch, simply run
git ref log and point to that reference.
Recovering Unreachable Objects
While Git reset can be a beneficial tool for updating a ref, it’s important to note that it doesn’t modify local objects. This means that even if you’re doing a Git rebase or a force push, you’re still able to track your objects in the object folder.
Using Git reset makes it so objects are no longer referenced by a ref. Those objects are called unreachable objects.
git gc by default removes these objects after two weeks, so there is a brief window where the data may be recoverable.
You can find unreachable objects using
git fsck which stands for Git file system check.
git fsck does not show unreachable commit objects that are tracked within the logs folder, so you’ll need to temporarily get rid of that folder before running the command. This should reveal any outstanding unreachable objects and allow you to access them.
The More You Know
Knowing the ins and outs of Git makes you a valuable asset to your team and employers. If you want to continue to grow your Git knowledge and add more tools that will help you succeed, look no further than the GitKraken Client, now with a GUI and CLI. The GitKraken Client makes coding easy and visual with its beautiful commit graph, safe with its ability to undo mistakes, and powerful with its integration compatibility.