Skip to main content

Command Palette

Search for a command to run...

The fastest way to scan your whole git repo

Published
2 min read

TLDR:

git cat-file --unordered --batch-all-objects --buffer --batch | grep 'GITHUB_API_KEY='

Replace grep by whatever command you want to use, it will be piped the entire contents of the repo, every version of every file.

What is this?

Imagine you wanted to go through all of the files that ever existed in a git repo. Maybe you are scanning for old passwords/API keys in the history, or you are collecting stats about the repo to plot.

How does it work?

Important fact about git: it operates on whole files, not on diffs. All of its vocab is around whole files, and it stores those in "blobs" : if you have a 100MB generated_api_schema.graphql file and you make a small tweak to it, git will store a second copy (in another blob) and give each a different hash. You see diffs when using most the git commands, but git actually stores entire copies of each file. Under the hood, as a storage optimization, it sometimes compresses old files and packs them into packfiles using diffs. When that happens (and for most large repos you will see at least one .pack file in your .git/objects folder), you need to iterate over the packfile in order to get the best performance. This is what the linked git command does. Let's break it down:

  • git cat-file normally takes a sha1 hash (an object id) and looks up the file in the repo

  • --batch tells it to work with more than one object

  • --batch-all-objects will make it go through all of the objects in the repo, instead of looking for the ones given in its args

  • --buffer is a performance optimization to tell it not to flush stdout after every file, which takes time and isn't worth it when printing out the entire repo

  • --unordered this is the magic: the default order is sha1 order, this instead goes through the packfiles in the order they are written. The docs say: if you do not require a specific order, this should generally result in faster output, especially with --batch.

Here is the SO question where I learned of the command https://stackoverflow.com/questions/7348698/how-to-list-all-git-objects-in-the-database/51956653#51956653

The downside is that you don't get the names of the files, because it just goes through the objects. To get the filenames and performance, you have to write some custom code (I'm using git2 and looking at gitoxide).