Recovering Code from a Damaged Git Repository

Background: Virtualbox virtual machine on Windows. Ubuntu 14.04.1 LTS, 3.13 kernel. ext4 file system.

Mistake: A few days ago, I was developing a website on this virtual machine, made N commits, thought I had git pushed, but in fact, the push failed.

Mystery: Today, my colleague git pulled and found no updates, then said I hadn’t been working these days.

Tragedy: Logged into the virtual machine and saw that a few newly written files in the project directory had become 0-byte empty files. (ext4 is so stable, it must be the evil NTFS and Virtualbox in the host machine that caused the trouble)
Many files in the .git directory also became 0-byte empty files. Git prompts that the repository is damaged.

$ git status
error: object file .git/objects/71/cbcbbc9d06a74f2fd8ea9109b81b88086f1430 is empty
error: object file .git/objects/71/cbcbbc9d06a74f2fd8ea9109b81b88086f1430 is empty
fatal: loose object 71cbcbbc9d06a74f2fd8ea9109b81b88086f1430 (stored in .git/objects/71/cbcbbc9d06a74f2fd8ea9109b81b88086f1430) is corrupt

$ git fsck
error: object file .git/objects/00/837a7e1f8afb8da8609369f7acf95fe9b7fc5b is empty
error: object file .git/objects/00/837a7e1f8afb8da8609369f7acf95fe9b7fc5b is empty
fatal: loose object 00837a7e1f8afb8da8609369f7acf95fe9b7fc5b (stored in .git/objects/00/837a7e1f8afb8da8609369f7acf95fe9b7fc5b) is corrupt

Has the code written in the past few days just vanished like this? We know that when you delete something, you just delete the reference to this thing in the current three-dimensional space, and the entity of this thing still exists in the four-dimensional space-time. Let’s time travel!

Looking for the Victims

First, let’s see which files have been truncated (become 0 bytes) in this disaster.

$ find . -size -1b
./config/__init__.py
./app/static/css/style.css
./app/views/home.py
./app/views/user.py
./app/templates/404.html
./app/templates/common-footer.html
./app/templates/about.html
./app/templates/community-rules.html
./app/templates/settings.html
./app/templates/copyright.html
./app/templates/course.html
./.git/objects/20/1588d6dac033f6c313f2bf4f0fd01c81276632
./.git/objects/35/ea71c044277cb8ac874699ead4edfafe4a4cfa
./.git/objects/1d/3221615759851d3ff16a65f614432c4ae857ee
./.git/objects/1d/7a4d6633a5e5301442a0c92c349b50d8ad0e8c
./.git/objects/2e/647f1c50f883442680962f404247d29b018b16
./.git/objects/7c/fdee2b6ef8d2cddfd9b41bca2600e3d6fba4e0
...（数十个 object 遇难）

The .git directory here is the file system-based database of git. Git timestamps the files committed, compresses and stores them in a key-value database in its own format, which can be indexed by file content (how does git status work?), and can also be indexed by commit number or tag (how does git checkout work?). The long SHA-1 values in the .git/objects directory are the index keys.

Comparing with a Normal Repository

Let’s git clone another copy of the code and see if these missing git objects exist in the original repository.

After cloning a new git repository, I was surprised to find that there were no SHA-1 value git objects in the library, only a large pack file.

.
./index
./info
./info/exclude
./branches
./logs
./logs/refs
./logs/refs/remotes
./logs/refs/remotes/origin
./logs/refs/remotes/origin/HEAD
./logs/refs/heads
./logs/refs/heads/master
./logs/HEAD
./hooks
./hooks/applypatch-msg.sample
./hooks/pre-push.sample
./hooks/pre-rebase.sample
./hooks/pre-applypatch.sample
./hooks/prepare-commit-msg.sample
./hooks/post-update.sample
./hooks/update.sample
./hooks/pre-commit.sample
./hooks/commit-msg.sample
./config
./description
./objects
./objects/info
./objects/pack
./objects/pack/pack-d4da3e51cfa0c0650e2b3b663d71bb1f8ce4d825.idx
./objects/pack/pack-d4da3e51cfa0c0650e2b3b663d71bb1f8ce4d825.pack
./packed-refs
./refs
./refs/remotes
./refs/remotes/origin
./refs/remotes/origin/HEAD
./refs/heads
./refs/heads/master
./refs/tags
./HEAD

$ ls -l objects/pack/
total 7400
-r--r--r-- 1 vagrant vagrant   84456 May 23 14:58 pack-d4da3e51cfa0c0650e2b3b663d71bb1f8ce4d825.idx
-r--r--r-- 1 vagrant vagrant 7491426 May 23 14:58 pack-d4da3e51cfa0c0650e2b3b663d71bb1f8ce4d825.pack

You can use git unpack-objects to expand these files. Note that you need to move the original pack file in the .git directory out instead of copying it out, otherwise the smart git will detect that the pack file in the objects directory already has the same object, and it will not expand.

1
2
3

$ mv .git/objects/pack/pack-d4da3e51cfa0c0650e2b3b663d71bb1f8ce4d825.pack .
$ git unpack-objects < pack-d4da3e51cfa0c0650e2b3b663d71bb1f8ce4d825.pack
Unpacking objects: 100% (2978/2978), done.

Now those long SHA-1 values are back.

./.git/objects/8d
./.git/objects/8d/16e06ef2c91ffc329868da4a124370191400e0
./.git/objects/8d/a67cc2c2787e6eac1f3b327664f0f62fde6535
./.git/objects/8d/1d7a4fca5bd4994c30cc0ea9c743aa67474735
./.git/objects/8d/54da727431e0560228f8f240082ec58e39fed8
./.git/objects/4f
./.git/objects/4f/1c8299469a493380b25765c464e33141a95fe6
./.git/objects/4f/ef67cbe5a6c70327a63a309d0e8b780a2b278c
...

Now we can go back to the damaged repository and compare the differences between the damaged repository and the upstream repository:

$ find . -size -1b -exec ls ../newrepo/{} \;
...
../newrepo/./.git/objects/20/1588d6dac033f6c313f2bf4f0fd01c81276632
../newrepo/./.git/objects/35/ea71c044277cb8ac874699ead4edfafe4a4cfa
../newrepo/./.git/objects/1d/3221615759851d3ff16a65f614432c4ae857ee
../newrepo/./.git/objects/1d/7a4d6633a5e5301442a0c92c349b50d8ad0e8c
../newrepo/./.git/objects/2e/647f1c50f883442680962f404247d29b018b16
ls: cannot access ../newrepo/./.git/objects/7c/fdee2b6ef8d2cddfd9b41bca2600e3d6fba4e0: No such file or directory
ls: cannot access ../newrepo/./.git/objects/00/837a7e1f8afb8da8609369f7acf95fe9b7fc5b: No such file or directory
ls: cannot access ../newrepo/./.git/objects/78/2ed6614f481f77b358aeb5955439292b551a2c: No such file or directory
...

Those No such file or directory are git objects that do not exist in the upstream repository, that is, the content newly committed (add or commit) into the repository after the last push. Unless using file system level recovery technology, these git objects are hard to find again.

Extracting Files from the Git Database

A file that has been committed or added to the git repository has a copy in the working directory and another copy in the git database (.git directory). As long as one of the two copies is usable, the data can still be retrieved. What we are mainly concerned about is whether those lost files in the working directory can be found from the copy in the git database.

First, try to open a git object and find it is garbled. File it and find it is also very messy.

1 2	$ file .git/objects/4f/1f12f7a41593de4fc4131df05fb05e517e717a .git/objects/4f/1f12f7a41593de4fc4131df05fb05e517e717a: VAX COFF executable not stripped

RTFM is always useful. Git Object Format tells us that git object is stored after deflate compression of file content. We know that gzip also uses zlib’s deflate compression, but gz files have special headers and tails. Instead of writing a piece of code to call the gzip library, it’s better to supplement the gzip file header and directly call gunzip to decompress. (How do I know? The HTTP request returned by the browser is often also deflate compressed, and decompressing from the packet capture record is a necessary skill)

$ printf "\x1f\x8b\x08\x00\x00\x00\x00\x00" | cat - .git/objects/4f/1f12f7a41593de4fc4131df05fb05e517e717a | gunzip | head -n 5

gzip: stdin: unexpected end of file
blob 1176{% extends "layout.html" %}
{% block content %}

<div class="container">
  <div class="row float-element">

We see the plaintext! But wait, what is gzip: stdin: unexpected end of file? Is the file corrupted? Not really, there are 8 bytes at the end of the gz file to store CRC32 and the size of the decompressed file for verification, we didn’t add this information. As long as the file content comes out, it’s fine.

How to check if the file is complete? The file header of the git object specifies the size of the original file. Hexdump can see that in the decompressed file, the first string represents the git object type, here the blob indicates file storage; the second decimal integer represents the size of the original file; then a \0 indicates the end of the file header, and the rest is the original file content.

$ printf "\x1f\x8b\x08\x00\x00\x00\x00\x00" | cat - .git/objects/4f/1f12f7a41593de4fc4131df05fb05e517e717a | gunzip | hexdump -C | head

gzip: stdin: unexpected end of file
00000000  62 6c 6f 62 20 31 31 37  36 00 7b 25 20 65 78 74  |blob 1176.{% ext|
00000010  65 6e 64 73 20 22 6c 61  79 6f 75 74 2e 68 74 6d  |ends "layout.htm|
00000020  6c 22 20 25 7d 0a 7b 25  20 62 6c 6f 63 6b 20 63  |l" %}.{% block c|
00000030  6f 6e 74 65 6e 74 20 25  7d 0a 0a 3c 64 69 76 20  |ontent %}..<div |
00000040  63 6c 61 73 73 3d 22 63  6f 6e 74 61 69 6e 65 72  |class="container|
00000050  22 3e 0a 20 20 3c 64 69  76 20 63 6c 61 73 73 3d  |">.  <div class=|
00000060  22 72 6f 77 20 66 6c 6f  61 74 2d 65 6c 65 6d 65  |"row float-eleme|
00000070  6e 74 22 3e 0a 0a 0a 20  20 20 20 3c 64 69 76 20  |nt">...    <div |
00000080  63 6c 61 73 73 3d 22 63  6f 6c 2d 6d 64 2d 38 22  |class="col-md-8"|
00000090  3e 0a 20 20 20 20 20 20  3c 68 34 20 63 6c 61 73  |>.

The file header is 10 bytes, plus the file content is 1176 bytes, which is exactly the size of the decompressed file, 1186 bytes, indicating that nothing is missing from the decompressed file.

1
2
3

$ printf "\x1f\x8b\x08\x00\x00\x00\x00\x00" | cat - .git/objects/4f/1f12f7a41593de4fc4131df05fb05e517e717a | gunzip | wc        
gzip: stdin: unexpected end of file
     42      81    1186

If we want to get rid of that annoying git object file header, we can use sed to remove the first \0 and everything before it:

$ printf "\x1f\x8b\x08\x00\x00\x00\x00\x00" | cat - .git/objects/4f/1f12f7a41593de4fc4131df05fb05e517e717a | gunzip 2>/dev/null | sed -z 1d | head -n 5
{% extends "layout.html" %}
{% block content %}

<div class="container">
  <div class="row float-element">

Then we can decompress all the git objects in the damaged git library to the recovery directory, and count as many as we can decompress.

$ mkdir -p ../recovery
$ find .git/objects/ | while read f; do
    printf "\x1f\x8b\x08\x00\x00\x00\x00\x00" | cat - $f 
      | gunzip 2>/dev/null
      | sed -z 1d 
      > ../recovery/$(echo $f | cut -s -d/ -f3,4 --output-delimiter="")
      2>/dev/null;
done

The resulting recovery directory will look like this:

$ ls -l ../recovery/ |head -n 5
total 33584
-rw-rw-r-- 1 vagrant vagrant       0 May 23 18:03 00
-rw-rw-r-- 1 vagrant vagrant     241 May 23 18:03 0008584b7db75782df11f35983b59a94e89fa201
-rw-rw-r-- 1 vagrant vagrant    9867 May 23 18:03 000bb86306810c5b7f020313b1db2c559d47d7d9
-rw-rw-r-- 1 vagrant vagrant     241 May 23 18:03 000ea7ccc6842e84bb1caa4fbb2010b5a28eb32b

Finding a Needle in a Haystack from the Code

At this step, those who are familiar with file system recovery must have a sense of déjà vu. When the directory tree of the file system is damaged and we can only find the files scattered on the disk based on the inode’s magic number, we lose the path information and metadata of the files, and we don’t even know what the file name is (because it is stored in the directory), we can only see the file content. Indeed, we can scan all the scattered files and try to rebuild part of the directory tree, but for the current need to recover a few code files, it’s overkill.

Recall some unique fragments in the newly written code, and then grep from these decompressed files, there is a certain possibility to find the code swallowed by the file system.

$ grep 'review-comment-' -r ../recovery/
../recovery/81011e32805a9d2978cd0127bcc8974b1e17f935:            [<span class="glyphicon glyphicon-comment grey left-pd-md" aria-hidden="true"></span> <span id="review-comment-count-{{review.id}}">{{ review.comment_count }}</span>](javascript: show_review_comments({{ review.id }}))
../recovery/81011e32805a9d2978cd0127bcc8974b1e17f935:       $('#review-comment-count-' + review_id).parent().find('span.glyphicon').addClass('grey');
../recovery/81011e32805a9d2978cd0127bcc8974b1e17f935:       $('#review-comment-count-' + review_id).parent().find('span.glyphicon').removeClass('blue');
../recovery/81011e32805a9d2978cd0127bcc8974b1e17f935:       $('#review-comment-count-' + review_id).parent().find('span.glyphicon').addClass('blue');
../recovery/81011e32805a9d2978cd0127bcc8974b1e17f935:       $('#review-comment-count-' + review_id).parent().find('span.glyphicon').removeClass('grey');

Fortunately, the objects of these newly written files in the git database are still there.

Conclusion

The code was recovered, and the PM girl was naturally very happy. I opened my blog that has been dusted for nearly half a year and wrote this summary.

Two lessons:

Code must be committed & pushed frequently, so that other members of the team can be aware of your progress in time, and you won’t be accused of not working.
Regardless of whether this bad thing is done by Virtualbox or ext4, or it’s a prank by a mysterious hacker, you can’t trust the stability of the file system in the virtual machine too much.