Recovering Code from a Damaged Git Repository
Background: Virtualbox virtual machine on Windows. Ubuntu 14.04.1 LTS, 3.13 kernel. ext4 file system.
Mistake: A few days ago, I was developing a website on this virtual machine, made N commits, thought I had git pushed, but in fact, the push failed.
Mystery: Today, my colleague git pulled and found no updates, then said I hadn’t been working these days.
Tragedy: Logged into the virtual machine and saw that a few newly written files in the project directory had become 0-byte empty files. (ext4 is so stable, it must be the evil NTFS and Virtualbox in the host machine that caused the trouble)
Many files in the .git directory also became 0-byte empty files. Git prompts that the repository is damaged.
1 | $ git status |
1 | $ git fsck |
Has the code written in the past few days just vanished like this? We know that when you delete something, you just delete the reference to this thing in the current three-dimensional space, and the entity of this thing still exists in the four-dimensional space-time. Let’s time travel!
Looking for the Victims
First, let’s see which files have been truncated (become 0 bytes) in this disaster.
1 | $ find . -size -1b |
The .git directory here is the file system-based database of git. Git timestamps the files committed, compresses and stores them in a key-value database in its own format, which can be indexed by file content (how does git status work?), and can also be indexed by commit number or tag (how does git checkout work?). The long SHA-1 values in the .git/objects directory are the index keys.
Comparing with a Normal Repository
Let’s git clone another copy of the code and see if these missing git objects exist in the original repository.
After cloning a new git repository, I was surprised to find that there were no SHA-1 value git objects in the library, only a large pack file.
1 | . |
1 | $ ls -l objects/pack/ |
You can use git unpack-objects to expand these files. Note that you need to move the original pack file in the .git directory out instead of copying it out, otherwise the smart git will detect that the pack file in the objects directory already has the same object, and it will not expand.
1 | $ mv .git/objects/pack/pack-d4da3e51cfa0c0650e2b3b663d71bb1f8ce4d825.pack . |
Now those long SHA-1 values are back.
1 | ./.git/objects/8d |
Now we can go back to the damaged repository and compare the differences between the damaged repository and the upstream repository:
1 | $ find . -size -1b -exec ls ../newrepo/{} \; |
Those No such file or directory are git objects that do not exist in the upstream repository, that is, the content newly committed (add or commit) into the repository after the last push. Unless using file system level recovery technology, these git objects are hard to find again.
Extracting Files from the Git Database
A file that has been committed or added to the git repository has a copy in the working directory and another copy in the git database (.git directory). As long as one of the two copies is usable, the data can still be retrieved. What we are mainly concerned about is whether those lost files in the working directory can be found from the copy in the git database.
First, try to open a git object and find it is garbled. File it and find it is also very messy.
1 | $ file .git/objects/4f/1f12f7a41593de4fc4131df05fb05e517e717a |
RTFM is always useful. Git Object Format tells us that git object is stored after deflate compression of file content. We know that gzip also uses zlib’s deflate compression, but gz files have special headers and tails. Instead of writing a piece of code to call the gzip library, it’s better to supplement the gzip file header and directly call gunzip to decompress. (How do I know? The HTTP request returned by the browser is often also deflate compressed, and decompressing from the packet capture record is a necessary skill)
1 | $ printf "\x1f\x8b\x08\x00\x00\x00\x00\x00" | cat - .git/objects/4f/1f12f7a41593de4fc4131df05fb05e517e717a | gunzip | head -n 5 |
We see the plaintext! But wait, what is gzip: stdin: unexpected end of file? Is the file corrupted? Not really, there are 8 bytes at the end of the gz file to store CRC32 and the size of the decompressed file for verification, we didn’t add this information. As long as the file content comes out, it’s fine.
How to check if the file is complete? The file header of the git object specifies the size of the original file. Hexdump can see that in the decompressed file, the first string represents the git object type, here the blob indicates file storage; the second decimal integer represents the size of the original file; then a \0 indicates the end of the file header, and the rest is the original file content.
1 | $ printf "\x1f\x8b\x08\x00\x00\x00\x00\x00" | cat - .git/objects/4f/1f12f7a41593de4fc4131df05fb05e517e717a | gunzip | hexdump -C | head |
The file header is 10 bytes, plus the file content is 1176 bytes, which is exactly the size of the decompressed file, 1186 bytes, indicating that nothing is missing from the decompressed file.
1 | $ printf "\x1f\x8b\x08\x00\x00\x00\x00\x00" | cat - .git/objects/4f/1f12f7a41593de4fc4131df05fb05e517e717a | gunzip | wc |
If we want to get rid of that annoying git object file header, we can use sed to remove the first \0 and everything before it:
1 | $ printf "\x1f\x8b\x08\x00\x00\x00\x00\x00" | cat - .git/objects/4f/1f12f7a41593de4fc4131df05fb05e517e717a | gunzip 2>/dev/null | sed -z 1d | head -n 5 |
Then we can decompress all the git objects in the damaged git library to the recovery directory, and count as many as we can decompress.
1 | $ mkdir -p ../recovery |
The resulting recovery directory will look like this:
1 | $ ls -l ../recovery/ |head -n 5 |
Finding a Needle in a Haystack from the Code
At this step, those who are familiar with file system recovery must have a sense of déjà vu. When the directory tree of the file system is damaged and we can only find the files scattered on the disk based on the inode’s magic number, we lose the path information and metadata of the files, and we don’t even know what the file name is (because it is stored in the directory), we can only see the file content. Indeed, we can scan all the scattered files and try to rebuild part of the directory tree, but for the current need to recover a few code files, it’s overkill.
Recall some unique fragments in the newly written code, and then grep from these decompressed files, there is a certain possibility to find the code swallowed by the file system.
1 | $ grep 'review-comment-' -r ../recovery/ |
Fortunately, the objects of these newly written files in the git database are still there.
Conclusion
The code was recovered, and the PM girl was naturally very happy. I opened my blog that has been dusted for nearly half a year and wrote this summary.
Two lessons:
- Code must be committed & pushed frequently, so that other members of the team can be aware of your progress in time, and you won’t be accused of not working.
- Regardless of whether this bad thing is done by Virtualbox or ext4, or it’s a prank by a mysterious hacker, you can’t trust the stability of the file system in the virtual machine too much.