Inserting Backdoors into Open Source Compilers

When we mention Ken Thompson, the first thing that comes to mind is the UNIX operating system he invented. For this, he won the Turing Award in 1984. In his Turing Award speech, Ken Thompson raised a profound question: Does seeing the source code of software mean there are no backdoors? Could there be self-replicating backdoors in compilers?

KenThompson

This paper published in “ACM Communications” is only three pages long and omits many details. The principle is a bit like a C program that outputs its own code, but it’s much more difficult. Following in the footsteps of the sages, I inserted a self-replicating backdoor into an open-source C compiler—tcc. This compiler, with the backdoor, will automatically insert a backdoor when compiling the source code of the Linux login program sulogin.

What is sulogin

When the file system fails to mount or other faults occur during the startup process, Linux often enters a screen as shown below, asking for the root password to enter the recovery console. The program that requests the root password is /sbin/sulogin.

sulogin interface sulogin interface

This program runs as root, checks the user-entered root password against the system user database, and if correct, enters a root shell. I used sulogin as an example simply because it has fewer lines of code than the standard command-line login program /bin/login. If the sulogin program could quietly accept the password “bojieli” in addition to the correct password…

Let’s start with the source code of sulogin. (sulogin is in the util-linux package, Debian series can use dpkg -S to search for it) The part responsible for password verification is shown below.

Untitled

Just add a condition check, and it’s all clear!

Untitled

Of course, inserting such an obvious backdoor code into sulogin is too crude. Why not let the compiler accomplish this glorious and great mission?

Let the compiler insert a backdoor into sulogin

Simply put, the input of the compiler is the program source code, and the output is binary machine code. As long as the compiler discovers that it is compiling sulogin, it replaces specific parts of the source code and inserts a backdoor.

Untitled

How to “discover” that what is being compiled is sulogin? Compilers are complex. Replacing at the AST (Abstract Syntax Tree) level is relatively hidden and robust to modifications of sulogin code, but it’s more difficult. Since this is a demonstration, we’ll just do simple source code text matching and replacement.

Since gcc (GNU C compiler) is too complex and time-consuming to compile, let’s use the small and simple tcc (tiny C compiler). We start with the buffer that reads the source code. Once the part read matches the password comparison part of sulogin, it is replaced with a source code string with a backdoor.

Untitled

The above code grabs the “throat” of tcc reading the source code. When it matches the login_pattern (red arrow), it adds login_append (green arrow) after it. It’s simple and crude. There are obvious bugs in this code. When the code to be matched crosses the buffer boundary, it can’t match, but don’t mind these details…

The C compiler with a backdoor has a clear backdoor code. If it is released as open-source code, it will obviously be discovered. We need to let the compiler “hide” the backdoor.

Self-replication of compiler backdoor

During the compilation process of gcc, to avoid potential problems, gcc needs to compile its own source code to get an executable file A, and then use A to compile the gcc source code to get an executable file B. Only when A and B are the same is the compilation considered successful. That is to say, the compiler must be able to compile itself.

Our backdoor must also have the ability to self-replicate. The tcc executable with a backdoor, when compiling the normal tcc source code, should generate a tcc executable that also contains the same backdoor.

At first glance, this process is not complicated. As the pseudocode below describes, when matching the sulogin source code, insert the login backdoor; when matching the tcc’s own source code, insert the tcc backdoor.

Untitled

However, when actually implementing it, you will find a difficulty: the backdoor code itself is a string, and it has to appear in the “tcc-backdoor” part… it feels a bit self-referential.

Self-outputting C program

Many of you may have heard of C programs that can output their own code. You can find them by Googling, but the authors often write the programs very short and concise, so they are not easy to understand. In fact, this is not rocket science.

How to output itself? The source code must be placed in the data segment of the binary file. The simplest self-output code snippet looks like this:

Untitled

The printf code above is repeated twice, the first time as part of a string constant, and the second time as source code being compiled. And this string is also output twice, because there are two %s in printf.

To explain to friends who are not familiar with C language: char *s defines a string, and takes the string constant in the following purple and red parts as the initial value. The \ is an escape character in the string constant s, indicating that the following quote or \ is not a quote indicating the end of the string, but a normal character in the string. The string s is exactly the same as the complete code that follows. And the code that follows copies the code before s and outputs s twice.

The power of this code is that it can contain any other code, so any program can be packaged into a self-output form. For example, if we add a statement to output Hello World at the end of the program, just copy it verbatim in the string s (except for paying attention to escape characters).

Untitled

Careful readers may have noticed the subtle differences in newline characters and escape characters, so the real self-output code cannot simply output the string twice. The first time it is output, escape characters and the string line continuation character \ at the end of each line need to be added, and the second time it is output as is. Don’t worry about OCR, there is a code download link at the end of the article.

Untitled

Inserting a self-replicating backdoor into the compiler

With the theoretical basis of “outputting its own code”, it can be applied to tcc. The code string s that is output twice, here is named tcc_replace. After the compiler reads a piece of code, once it finds that it is tcc’s code, it replaces this piece of code with the backdoor code: the escaped tcc_replace connected to tcc_replace. The pseudocode is as follows:

Untitled

Of course, C is a relatively low-level language with low expressiveness, so the implemented code is very long.

full

This is how the compiler with a backdoor works:

Compile the already inserted backdoor tcc-new, any compiler will do;
Use the tcc-new with the backdoor to compile the genuine tcc source code tcc-orig, to generate the still backdoor executable file tcc-orig. This is called the bootstrap process;
Use tcc-orig to compile sulogin, to get the sulogin with the backdoor.
If you use tcc-orig to compile the clean tcc source code again, the compiler you get still has a backdoor. This generated compiler will be used as the release version.
Users who download the malicious version of tcc, compile the seemingly normal tcc source code, and what they get is still the malicious version of tcc, and the binary files are exactly the same. That is, unless you disassemble the binary file, you cannot discover the malicious behavior of this tcc compilation version.

Of course, in the data segment (.data section) of the compiler with the inserted backdoor, you can see a large section of source code, which is definitely suspicious, and can be found with the strings command. You should use a method similar to software protection to encrypt this data and decrypt it at runtime. In addition, you can write a general framework to automatically insert backdoors, so you don’t have to manually construct the tcc_replace string. This article only gives a proof of concept. If you want to leave the backdoor without a trace, you still have to put a lot of thought into it.

Some people will say, use other compilers (such as gcc) to compile the clean tcc, isn’t the result a tcc without a backdoor? Unfortunately, compilers are becoming more and more complex, adding various unique extension syntaxes, so many compilers can only bootstrap themselves, for example, gcc source code can only be compiled with gcc. Who knows if there is a self-replicating backdoor hidden in the millions of lines of gcc code?

The lower-level the vulnerability, the more dangerous it is

Ken Thompson said, what if the backdoor is not inserted into the compiler, but the assembler, linker, or even the hardware microcode? The lower the level, the harder it is to detect the backdoor.

Ken Thompson’s prophecy has come true. The Intel x86_64 SYSRET local privilege escalation vulnerability is a notorious example. Strictly speaking, this should be considered as Intel’s manual not being clear enough. SYSRET was first implemented by AMD on 64-bit systems, and if RIP triggered a general protection fault upon return, this fault was triggered in ring 3. When Intel x86_64 later implemented the SYSRET instruction, RIP triggered a general protection fault in ring 0, but the manual did not point out this difference. Then the problem arose.

BsEKniQCAAA994L

As early as 2006, the Linux community discovered this vulnerability (CVE-2006-0744) and patched it, but this issue did not attract the attention of other operating systems. Until 2012, Xen discovered this issue again (CVE-2012-0217), and incidentally found that FreeBSD and Windows 7 also had this vulnerability, but mistakenly thought that the problem with Linux no longer existed.

In 2014, the Linux community found that ptrace could still trigger this vulnerability (CVE-2014-4699), which means that unless ptrace is disabled, almost every Linux machine has this vulnerability. But CVE just lightly stated that it was a “linux ptrace bug”, and Debian oldstable has not been fixed to this day (the stable mainline has been fixed).

Backdoors in security algorithms are very low-level and very scary. In 2013, the NSA was suspected of choosing the flawed Dual_EC_DRBG algorithm as the default algorithm for an RSA company’s encryption library, and promoted this algorithm to become an ANSI standard (Ding Laoguai even used this as a final exam question). The SSL 3.0 POODLE vulnerability that just broke out in September 2014 is a problem with the design of the SSL 3.0 protocol (of course, this may not necessarily be a backdoor). This is different from the Heartbleed and other vulnerabilities in the first half of this year, which are not software implementation issues, but the protocol itself is insecure, so there is no good solution other than disabling the protocol.

poodle

Postscript

In the complex world of software, which source code can we trust? As Ken Thompson said, we may only be able to trust the people who write the code.

This was my junior year assignment for the “Introduction to Hacker Disassembly” course, and I also talked about it at the LUG Weekly Gathering.

The code has been uploaded to Github, click here. For the presentation slides click here to download.

Inserting Backdoors into Open Source Compilers

What is sulogin

Let the compiler insert a backdoor into sulogin

Self-replication of compiler backdoor

Self-outputting C program

Inserting a self-replicating backdoor into the compiler

The lower-level the vulnerability, the more dangerous it is

Postscript

Comments