Long article warning: The first in the “Five Years of PhD at MSRA” series, about 12,000 words, to be continued…

On July 31, 2021, at the ACM Turing Conference in China, I was standing on the podium waiting for the ACM China Outstanding Doctoral Dissertation Award. I didn’t expect that the person who came up to present the award to me was President Bao, and my legs involuntarily trembled a bit. This was the only time I had seen President Bao up close. President Bao happily said that seeing one of us from USTC among the award winners shows that USTC can also cultivate masters. He hoped that we could become masters in the future, serve our motherland, and return to our alma mater.

The host of the award ceremony, Professor Liu Yunhao, asked us to talk about the title of our doctoral dissertation and our advisors. I blurted out, “High-Performance Data Center Systems Based on Programmable Network Cards“, my advisors are Professor Chen Enhong from USTC and Dr. Zhang Lintao from Microsoft, and I would like to give special thanks to Dr. Tan Kun from Huawei. I can clearly remember the title of my doctoral dissertation, it’s hanging on my homepage. In the company, people often send me private messages asking if I am the author of a certain paper. I would shyly say, yes…

Many people may think that I am the kind of PhD student who is solely focused on studying, but my PhD life is actually much more interesting than many people imagine, truly embodying the MSRA (Microsoft Research Asia) motto “Work hard, play harder“.

Research Novice

Joint Training

MSRA (Microsoft Research Asia) has joint PhD training programs with many universities in China. Among them, the joint training program with USTC has been ongoing for many years. In the second semester of my junior year, MSRA interviewed dozens of candidates at our school, selected about a dozen students for summer internships and a year-long internship in their senior year, and after the summer internship, about 7 students were confirmed to become joint training PhDs. These joint training PhDs will complete their first year of master’s and doctoral courses at USTC, and the next four years will be spent on academic research at MSRA in Beijing, finally obtaining a PhD degree from USTC.

The requirements for MSRA to select joint training PhDs are the so-called “three good” students: good at math, good at programming, and good attitude. This rule is said to have been set by the former dean, Dr. Shen Xiangyang. Because I spent all day tinkering with various Linux network services in the Youth Class College computer room and LUG activity room during my undergraduate studies, I didn’t study very well, and naturally my grades weren’t very good. My GPA was only 3.4 (out of 4.3), and I even failed Calculus II. The interviewer asked me at the time why my math grades were so poor. Probably because I had won awards in programming competitions (NOI) in high school, and my resume had many network service projects I worked on at LUG, I was surprisingly admitted to the joint training PhD program. The GPAs of other students admitted to the joint training program were at least 3.7, and most of them were top students with 3.8 or above.

First Entry into MSRA

I still remember the summer vacation when I was promoted from junior to senior year (July 2013), the first day I came to MSRA, I was shocked by the superior office environment. The pantry has free drinks, yogurt, and fruit, the office space is spacious and bright, the internet speed is fast and can access Google, and the server configuration is top-notch, already 128 GB of memory at that time. Although I have never been to the school’s laboratory, I worked on Freeshell (a container hosting service based on OpenVZ) in the Youth College computer room. Thousands of containers were just squeezed onto a small cluster of 7 nodes, each physical node only had 16 GB of memory, the total memory capacity of the entire cluster was not as much as one server at MSRA, and our group had dozens of such servers.

Above: The servers in the Youth College computer room at school, with a layer of dirt on topAbove: The servers in the Youth College computer room at school, with a layer of dirt on top

Above: MSRA's servers, obviously much more high-endAbove: MSRA's servers, obviously much more high-end

The head of the Network Systems Group, Professor Zhang Yongguang, assigned me to Dr. Tan Kun. Dr. Tan called us for a meeting to discuss how to do research. What impressed me the most was that he said we needed to slowly learn some management. I was a bit puzzled, why do we need to learn management to do research? I later understood that doing research requires collaboration with other members of the team; once you cross the threshold of a junior researcher, you need to lead other researchers to work together. Senior researchers like Dr. Tan need to plan for the entire team, fight for resources, recruit talents, choose and retain them, which is very difficult. Of course, this is hindsight. At that time, I didn’t even know what research was, let alone what system research was.

In my senior year, I began to participate in the research of high-performance data center networks, and realized that the system performance I tinkered with at school was nothing. I still remember when I upgraded the network in the Youth College computer room from 100 Mbps to 1 Gbps, I pulled a lot of Cat5e and Cat6 cables, and replaced several 100 Mbps switches with 1 Gbps ones; while in the Network Systems Group at MSRA in 2013, the network of the servers was 40 Gbps, a full 40 times faster. Under such a high-speed network, the performance of the host TCP/IP protocol stack became a bottleneck, and technologies like DPDK user-mode protocol stack were needed to accelerate it. That year, we gathered the strength of the entire group to create a high-performance network processing framework based on DPDK, which we submitted to SIGCOMM in January 2014, but unfortunately it was rejected. I was mainly responsible for conducting experiments in this project, and the gnuplot template I inherited from the group at that time was used until I graduated from my PhD.

Although this project did not result in an academic paper, I tasted the sweetness of doing research from it. During my undergraduate studies, I thought research was just about pushing formulas, which was boring and seemed useless. After half a year of internship, I realized that research in networking and systems is about solving problems that exist in the real world. Although it may not necessarily be directly applied to products, it is of guiding significance to others in academia and industry. Therefore, I gave up starting a business, working directly, and studying abroad, and chose to do a joint PhD at USTC and MSRA.

It is said that the deadline is the first productivity, and the weekly group meeting is everyone’s biggest motivation. At each group meeting, one student has to do a more formal sharing, and everyone takes turns; the rest of the students also need to synchronize the latest status. At MSRA, written emails and PPTs are required to be in English, discussions are naturally in Chinese, but formal sharing reports are also required to be in English. Not to mention academic papers and technical materials, they are all in English. For this reason, the English level of students at MSRA generally improves naturally. For example, I took the TOEFL purely for fun in 2018 and scored 103, with reading and listening as my strengths and speaking and writing as my weaknesses. Most MSRA students who go abroad have prepared seriously and scored higher than me.

Above: MSRA's servers, all using 40G networksAbove: MSRA's servers, all using 40G networks

Later I found out that such a good experimental platform is not available in every group at MSRA, and it was even more difficult for other domestic university laboratories to match. Tan Bo not only complied with Microsoft’s “Cloud + Client” strategy, adjusted the research focus of the wireless and network group to data center networks, established a research direction based on FPGA with unique platform advantages, but also deeply cooperated with the Azure Networking and Catapult product teams at headquarters, obtained server, switch, network card, FPGA and other hardware resources, and built a world-leading data center network and programmable network card experimental platform. 32 servers and 10 switches form a typical three-layer data center network topology, consisting of 2 Spine, 4 Leaf, and 4 ToR switches, with 8 servers in each cabinet.

Above: Sister Meng Meng and Brother He Tong borrowed the large conference room to debug the SORA software radio system, wireless and network were the two main directions of our group at that timeAbove: Sister Meng Meng and Brother He Tong borrowed the large conference room to debug the SORA software radio system, wireless and network were the two main directions of our group at that time

My First Research Project

My first research project was programming network switches. At that time, we had the Broadcom switch chip spec and SDK, which could configure switch entries. Guohan also made an operating system image, burned it into the data center switch, and the switch became a standard Linux system. With PXE network booting, the switch configuration became very convenient. I still admire Guohan to this day, I don’t know how he made this operating system image.

Tan Bo found that the switch would restart due to upgrades or software failures, and the restart time was usually long. Once the top-of-rack switch restarts, all servers in the rack will have a network interruption lasting several minutes, which is a great threat to the high availability of services. At that time, most other research on switch high availability required the addition of redundant links and devices, which would bring higher costs. The “Warm Reboot” solution proposed by switch manufacturers only shortened the operating system restart time, and the router chip still needed to be reinitialized.

We further explored and found that software failures often occur when forwarding rules issued by different types of routing protocols conflict. Therefore, our idea is to virtualize the forwarding capabilities of the switch to a certain extent, logically isolate different routing protocols, and use a centralized synchronization service to resolve configuration conflicts between routing protocols, reducing logical errors. On the other hand, the need to restart the entire switch in the event of a switch failure or upgrade is because the switch software is a whole, and once the process restarts, the routing information is lost. We adopt a decoupling approach, splitting the functions of the switch software into a simple and stable lightweight routing information library and a relatively complex synchronization service, thereby avoiding single-point failures and not affecting other components when one component is upgraded.

This is my undergraduate thesis, Fault-Tolerant Software Architecture of SDN Routers“]. At that time, my good friend Guo Jiahua and I were both interning at MSRA’s Wireless and Network Group, and Tan Bo asked us to collaborate on this project. Coincidentally, Guo Jiahua was the first new student I met at USTC. I took an overnight train to the dormitory building of Shao Yuan, and after a short wait, Guo Jiahua also arrived with his suitcase. I had hardly any exposure to Linux in high school, but he was already proficient in Linux. Therefore, during my time at USTC, I always asked Guo Jiahua for advice on Linux, from the Shao Yuan computer room to LUG, and we often did course assignments together.

In this router fault-tolerant software architecture project, I was responsible for the synchronization service, and Guo Jiahua was responsible for the lightweight routing information library. Our undergraduate graduation designs were two aspects of this project. The synchronization service I did mainly solves the configuration conflicts between multiple clients (each responsible for different types of routing protocols), and ensures the determinism of the algorithm, that is, the final result of resolving conflicts for the same client configuration (which may arrive in different orders) after fault recovery is the same. In addition, it needs to call the interface of the switch chip SDK for incremental synchronization.

At that time, we also considered why we should make a customized routing information library instead of directly using MySQL, Redis, etc. This is because the entries in the fuzzy matching table are ordered, and the order represents priority. And the tables in relational databases are unordered, so they cannot directly use relational databases. If the order of fuzzy matching rules is simply used as an additional field in a relational database, then after inserting a fuzzy matching rule at a specified position, the order of subsequent entries needs to be modified, which is costly. The abstraction provided by key-value databases like Redis is even less suitable for router lookup tables.

First Attempt at Writing a Paper

After graduating from my undergraduate studies, I received a letter of consent from my school, allowing me to do another summer internship during the period from undergraduate graduation to graduate school enrollment.

At that time, Tan Bo had a bigger idea along the path of switch chip table item configuration. The table items of the switch chip are very complex, a switch has many tables, and the fields on the data packet header that each table can match are not the same, and the table item configurations of different types of switch chips are vastly different. At that time, the concept of SDN (Software Defined Network) was just starting to catch on, and everyone wanted to use a unified description language to configure routing, forwarding, firewall, packet modification and other strategies in the data center. Tan Bo naturally thought, can we use a unified description language to abstract different types of switch chips?

We spent two months designing an algorithm that maps the general table structure of OpenFlow to the specific table structure of the switch, but there are still significant limitations in practice. On September 17, 2014, the boss returned to Beijing after a month of international conferences, and I reported the progress of the past month. Although I was not satisfied with my results, he said that my model had made great progress compared to a month ago. September 18th was the deadline for NSDI abstract registration. That night, Tan Bo and I stayed up until three or four in the morning in the office. I organized the algorithm and proof into a few pages of LaTeX files. Tan Bo looked at it and said that it was definitely not in time for the paper deadline in 7 days. With the experience of submitting to SIGCOMM before, I know how much work it takes to write a system paper from algorithm to implementation, and then to writing. On the 19th, I boarded the train back to school and started my first year of graduate courses.

After returning to school, I had not completely lost contact with the entrepreneurial team at the school, and I had even called Tan Bo for advice on wireless issues. Tan Bo said that I should focus on my own research. However, during this year at school, I had long forgotten the research project on switch chip table item configuration, and I had not taken the initiative to contact my advisor for two months after returning to school, so this project was considered dead. Tan Bo said, “This year, you just need to attend classes at school, enjoy life, and read more classic papers.”

Above: The progress report email I sent to Tan Bo during my first year of graduate schoolAbove: The progress report email I sent to Tan Bo during my first year of graduate school

In the following one or two years, as expected, other papers on this issue were indeed published at top conferences like SIGCOMM. The theoretical problems were indeed as difficult to solve as I had imagined, but the actual routing configuration might not be so complicated. If I had been more invested at the time, could the paper have been published by me? However, I feel that because I lack a background in programming language theory, compared to other people’s papers, mine is slightly inferior in theoretical depth. The acceptance rate of SIGCOMM is only 10%~20%, and my paper may not survive in the fierce competition.

I later discovered that what new graduate students like to do most is “improvement” work, thinking that there are various problems with other people’s work (such as poor performance), and I can design a better algorithm to solve this problem. In terms of algorithm design, there is often no “ingenious” feeling, but just considering a few more factors than others, or combining two or more related works in an A+B form. Such articles are very difficult to get into top conferences. But this is indeed a shortcut to increase the number of papers. Top conference papers often require new scenarios, new problems, and new methods. As a research newbie, I almost fell into this pit at the beginning, but fortunately, Tan Bo pulled me out in time.

In the first year of graduate school, according to the plan of the joint training doctoral program, I took all the courses required for the master’s and doctoral stages at school. I continued to maintain and improve the network services of USTC LUG (Linux User Group), and also developed the course evaluation community of USTC with my then girlfriend and my roommate.

Although my technical skills were okay, my software engineering concepts were poor. Almost every day someone came to report bugs or propose new requirements, and I often modified the code directly on the production server, breaking other features when changing one. At that time, Professor James Zhang said that my work lacked the concept of versioning. Looking back, it also lacked the concept of automated testing. Every time I finished changing something and ran it and felt there was no problem, I would go online, which could easily affect other functions. The Flask framework used by the course evaluation community was chosen by my roommate, and this framework has not decayed after 7 years and easily supports the development of new features; while the code I wrote based on the framework during my undergraduate studies quickly became difficult to maintain.

Above: Professor James Zhang explaining the network architecture of USTC to us, the stable operation and continuous evolution of the network require good architectural supportAbove: Professor James Zhang explaining the network architecture of USTC to us, the stable operation and continuous evolution of the network require good architectural support

With the efforts of generations of LUG buddies, more and more excellent buddies have joined LUG, making LUG a club where computer technology elites gather in the school, and many students have subsequently joined the MSRA joint training program. The network services developed by LUG and the course evaluation community are the things I am most proud of during my time at USTC.

The most direct effect of my experience in maintaining servers in the Technology Department of the Youth Institute and LUG is that it made my voice very loud, and people in the office often asked me to lower my voice during meetings. This is because the noise in the machine room is as high as 80 decibels, and you can’t hear clearly if your voice is low. Secondly, it made me understand Linux and networks better, and I felt more comfortable doing system research.

Above: A corner of the USTC LUG activity room, this is a display board made by Li MiaomiaoAbove: A corner of the USTC LUG activity room, this is a display board made by Li Miaomiao

My First SIGCOMM Paper

My first paper was written under the guidance of Dr. Tan. This was not just a paper, but also the flagship project of the entire network group that year, with many researchers and interns involved. In terms of contribution, the first author of this project should be Dr. Tan, not me.

The Challenge of FPGA Programming

As early as the beginning of 2015, Dr. Tan collaborated with Microsoft’s FPGA team to explore how to use FPGA to accelerate data center networks. Dr. Tan proposed the idea of using FPGA to accelerate network functions. Network functions, simply put, are various middleboxes in the network, such as firewalls, load balancers, encryption, etc. These network functions are traditionally implemented in software, with relatively high latency and low throughput. FPGA is a type of programmable hardware that can process network packets, achieving microsecond-level latency and tens of Gbps bandwidth.

But the problem is that FPGA programming is complex, hard to read, write, debug, and modify. During my undergraduate studies in computer architecture, the lab required us to implement a pipeline CPU in Verilog and verify it on FPGA. At that time, I was in a group with Guo Jiahua. Relying on his strong programming ability, we spent a month writing 3000 lines of Verilog code, but the CPU we implemented still had many bugs. The teacher of the computer architecture lab was a retired old man, we called him “the grandfather of Loongson”, because the father of Loongson, Hu Weiwu, was his student. The old man told us that Hu Weiwu was very good, he built a decent CPU with a breadboard and digital circuits. The old man said our pipeline CPU was not bad and could be kept as a reference for future students. Unfortunately, this hellish lab course has reportedly been cancelled.

Our paper is a framework to simplify FPGA programming in the network field. At that time, there were many HLS (High-Level Synthesis) tools in the industry that could translate high-level languages like C into Verilog. The most famous ones were Vivado HLS and Altera OpenCL. Since we were using Altera’s FPGA card (Altera has now been acquired by Intel), we used OpenCL. At that time, the maturity of Altera OpenCL was not high. A small piece of C code could sometimes cause the compiler to crash; sometimes the compiler would hang and not produce results for a long time; sometimes the compiled Verilog would produce incorrect results; sometimes it could execute, but the performance was poor because there was no pipelining, everything was executed serially.

We couldn’t modify the source code of Altera OpenCL, so what could we do? Dr. Tan guided me to figure out what operations would cause problems and what operations could run normally, and then find a usable subset of OpenCL primitives. OpenCLMicrobench is a small part of the examples I tested at that time. This further highlights the significance of the programming framework we plan to do. The user’s code is compiled into Altera OpenCL through our framework, then compiled into Verilog by Altera OpenCL’s compiler, and finally the traditional FPGA synthesis tool. Once upon a time, browsers like IE, Firefox, and Chrome had significant differences and many pitfalls in accessing DOM elements. Frameworks like jQuery provided a unified programming abstraction to shield browser differences.

In FPGA programming, if there is a memory read-write dependency in a piece of code, that is, write first and then read, if the code is generated directly, you need to wait for the write operation to complete before starting the read operation. Because FPGA memory access cannot be completed in one cycle, the entire program cannot be pipelined. There is a classic method in computer architecture to solve this problem, which is to introduce a high-speed register to cache the recently written data. If it is found that the data to be read is exactly the data being written, the latest result is obtained from the register. There are many similar optimizations, and those interested can read our paper. In fact, the optimizations listed in the paper are only a part of them.

Essentially, our framework proposes a simple subset of the C language, which does not support loops with unbounded number of iterations, does not support recursive function calls, and does not support pointers. After I started working and developed a deep learning compiler, I learned that the academic world of compilation has a technical term for such code, SCOP (Static Control Part). The polyhedral compilation technology used in our MindSpore AKG also requires the code to be a static control region. In FPGA high-level synthesis, such code can be fully unrolled and inlined, turning into a block of code without any loops, which can then be converted into a block of combinational logic. After inserting registers at the appropriate places, it becomes a fully pipelined digital logic. This digital logic can accept an input data block every clock cycle, process a fragment (we call it flit) of the data packet, so the throughput of the entire pipeline is the clock frequency multiplied by the fragment size.

Not long after the publication of the ClickNP paper, the new version of Altera OpenCL was able to automatically perform some of the optimizations we made. This is because many of these optimizations are well-known in the compiler and architecture fields, but as an HLS product, OpenCL also needs to iterate continuously and add various optimizations.

From idea to system

Initially, we planned to make this programming framework in the form of a C++ library. That’s because we didn’t think about doing these compiler optimizations ourselves, we just wanted to wrap a network interface and host communication interface. But without compiler optimization, the performance of OpenCL is so poor that it can’t be used. It’s hard for a function call in a C++ library to perform complex code transformations, so it’s impossible to implement compiler optimization, so we had to go down the road of writing our own compiler.

In fact, even before I finished my first year of graduate courses and returned to MSRA, Tan Bo had already written the first 1000+ lines of code for ClickNP by hand. This includes a parser for network function configuration files similar to Click, written in lex and yacc; communication code between the host and FPGA; and several basic network function elements. Based on these framework codes, we developed the entire system. Two months before returning to MSRA, I asked Tan Bo what research preparations needed to be done, and Tan Bo told me that nothing needed to be prepared, just come directly.

As the head of the network group, Tan Bo personally wrote so much code, which really shocked me. Today, as the head of hundreds of people, he still cares about technical details when discussing projects with us, and even grabbed the computer to write some APIs by hand. Whether it was when doing ClickNP or now, changes to the API must be reviewed and cannot be changed casually. Sometimes I want to add a syntax to ClickNP to solve the current problem, Tan Bo considers more whether this syntax is consistent with the overall abstraction, whether a better abstraction can be introduced to solve a series of problems. Now as a small manager, I have basically left the code work and write some demo codes that do not merge into the main code repository. Leaving the front line of code will lead to the detailed design of the project off track, the definition and modification of the API is too casual, just like building a building I only responsible for making the sand table, the result of the construction drawings and the sand table is far apart, the function, performance indicators have shrunk.

We mainly had four interns involved in the development of the ClickNP system, I was mainly responsible for the compiler, senior interns Peng Yanqing and Luo Renqian were responsible for developing network function elements and applications, they also helped me a lot in the compiler; He Tong was responsible for developing high-performance communication pipelines between FPGA and CPU using Verilog. Researcher Dr. Luo Layong is a senior expert in FPGA, and did early exploration of Altera OpenCL for us. Of course, there are many professional hardware engineers from the Catapult team at Microsoft headquarters, they encapsulated network, PCIe and other hardware capabilities (Hard IP) into easy-to-use interfaces, provided to Altera OpenCL in the form of data flow, this is the key to ClickNP being able to process network packets and communicate with the host CPU via PCIe.

From idea to system is not an easy thing. From the official start of system implementation in July to the official submission at the end of January next year, it’s only half a year. As of the submission, the code repository had more than 10,000 lines of code and more than 2000 commits. Although MSRA values work-life balance, in the face of the deadline, we all put all our energy into system implementation. My long-distance girlfriend complained to me at the time that the time I spent chatting with her every day was less than 15 minutes. She was not in a good state at the time, often needed to vent, but I did not invest much time and energy to accompany her. Coincidentally, later when I was working on KV-Direct with Ruan Zhenyuan, his girlfriend joked that I was her love rival, because Ruan Zhenyuan was so invested in our KV-Direct that he forgot about her.

Since ancient times, it has been difficult to balance loyalty and filial piety, and work-life balance is also easier said than done. I will never forget what Du Zide told us at the NOI 2009 winter camp: you two hundred people are the higher level of programming among middle school students, you are the hope of China’s computer future. Ten years later, no matter where you are, don’t forget your responsibilities and mission.

In the process of implementing ClickNP, Tan Bo always reminded me of a few points:

  1. Focus. I always think divergently, when solving problem A, I think of problem B and C. At this time, you need to focus, first solve the problem at hand, and then solve other problems, especially those problems that may not be within the scope of the current research work.
  2. Convergence. Another manifestation of my divergent thinking is that I proposed solutions A, B, and C for the same problem but suffered from choice paralysis, unable to decide how to do it, feeling that none of the solutions were perfect, and pressing down on the gourd floated up the ladle. At this time, you need to converge, don’t be a perfectionist. First systematically analyze the pros and cons of several solutions, according to the scene and demand, choose a solution to start implementation.
  3. Analytical thinking. For example, it’s not enough to just list the experimental results, but to think about the reasons behind the results, and do some additional experiments to confirm or falsify your guess. Whether it’s at MSRA or in my current work, I find that many people lack the habit of analytical thinking. Every time at the group meeting, some new experimental results are put out, just like dealing with chores, there is no analysis of how to compare with the results talked about last week.
  4. Distinguish between scientific problems and engineering problems. For example, the FPGA can’t automatically respond to ARP messages, so the switch can’t automatically establish layer 3 routing rules, the early version of ClickNP can only do experiments with two machines directly connected, if the switch is passed in the middle, you need to manually configure the switch’s layer 2 forwarding table. Tan Bo told me not to waste precious research energy on such engineering problems, if manual switch configuration can solve the problem, don’t spend time implementing the function of FPGA responding to ARP messages, this does not help our main research work.

Above: A testbed with two directly connected hosts on the workstation, directly connected between two FPGA network cardsAbove: A testbed with two directly connected hosts on the workstation, directly connected between two FPGA network cards

Writing, completely overhauled

We probably started writing the first draft of the paper 10 days before the deadline. At this time, the first draft of the paper only had a few chapters such as related work, background, and the experiments were not yet completed. We not only need to decide which experiments should be supplemented based on the experimental results, but also need to write out the content of the paper in the design and experimental parts.

The first draft of the paper I wrote was basically completely overturned and rewritten by Tan Bo. As a beginner, the paper I wrote is a narrative, narrating the design of each component of the system in a straightforward manner. This is not acceptable. It is said that the average time for a reviewer to read an article is only 45 minutes, and there is no time to see every detail in the paper, let alone help an article with a chaotic logic to refine the core point and sort out the logic.

A paper must be able to be summarized in one sentence, conveying a key message; there are several key innovative points that are eye-catching, and the system design is developed around these key innovative points. A paper is a very logical article, and it must not be confusing. Reviewers are also good at logic, and it is impossible to try to deceive the sky on key logic. Some people will list some papers that are not directly related as arguments to support the challenges of the article, and it is easy to question whether the motivation is sufficient.

A paper needs to be a logically linked story, first deriving the design goals from the current problems, then the system design to solve these key challenges, followed by the system implementation corresponding to these system designs, and finally the experimental verification of these technical innovations. Challenges, designs, implementations, and experiments need to correspond one by one, avoid writing challenges but not solving them, designing a set of exquisite mechanisms but not knowing what problems they are used to solve, or not having experimental verification. If there are challenges that have not been resolved, they should be put into the future work section to talk about. It is also not advisable to spoil the solution in advance in the challenge section, or to propose a new challenge while talking about the system design.

Therefore, many papers read like “eight-part essays”. Experienced readers can get the core argument of the article with a glance, and can summarize the key content of the article in less than an hour, because what content should appear in what position is exquisite, and there is no need to read the entire article word by word. If you really want to read every detail clearly, a system paper probably takes a day.

Some people write papers, often a person is assigned to write tasks, one person writes a chapter, and submits it after the draft is combined, which can easily lead to confusion of logic, because the key points that everyone wants to highlight are different. Therefore, our network group generally writes challenges, designs, and experiments first, and then refines the core innovations to write the introduction and abstract. The challenge, design, and experiment sections also need to be adjusted according to the needs of the overall logic.

How to highlight the key innovations in the design as needed? The first is to have a total-sub structure. For example, at the beginning of the chapter, use a small section to summarize the entire design, indicating which innovation in which section is to solve which challenge. There should also be a architecture diagram corresponding to it. When we research literature, we often need to match an architecture diagram. If there is no architecture diagram in the paper, it is very annoying.

Secondly, remember to review the challenges in the process of describing the design, and explain the purpose of each design, don’t imagine the reader too smart. Readers often have such doubts: What problem does this design solve? How to balance between the advantages and disadvantages of the scheme? Many readers hope to see such speculation in the text, not just listing the final design results. When the reader wants to design a similar system, the business requirements and hardware conditions may have changed, and the design choices at this time may not be the same as when the text was written. This is the review opinion of KV-Direct. Some of our parameter settings depend on the hardware configuration. The reviewer asked us to discuss whether our method can still work under different hardware conditions, so we added Section 6.3 and Table 5 in the final draft of KV-Direct.

It is worth noting that explaining the design does not mean writing the article into a narrative of the research journey. Some students first write immature solutions proposed in the early stage of research and their shortcomings, and then introduce the real solution, hoping to have an “attractive” effect. But this can easily confuse the reviewer, because not every reviewer has time to read the paper word by word, to understand the winding logic in it. My shallow view is that the paper is not a novel, and the logic should be simple and direct.

When writing related work, I found a work that compiled Click to FPGA ten years ago, as if facing a great enemy, isn’t this the same as what we did? Tan Bo taught me calmly: There is nothing new under the sun. If you find an article particularly new, it probably only means that you don’t know enough about related work. Compared with that work, the performance of the network has increased by an order of magnitude, and there are many challenges to be solved in the compilation, so our proposed programming framework is deeply optimized for FPGA HLS, rather than throwing C code directly into the compiler regardless of performance; and we also support communication between FPGA and the host.

From submission to acceptance

Three or four days before the deadline, we stayed up for several nights together, working for 44 hours in the last two days, and Tan Bo also stayed up with us. This seems to have become a habit of our group. Every paper only starts writing two weeks in advance, making it very tense; no matter how the paper is prepared, whether it can catch up with the deadline, everyone always has to stay up for a few nights together.

For this paper, Tan Bo was full of confidence, “If this can’t be accepted, I don’t know what kind of paper can be accepted by SIGCOMM”. The application of FPGA in data centers is a hot research field, and this paper is the first one published by Microsoft on the application of FPGA in the network field, which is very timely. At the same time, this is the first programming framework that uses a high-level language to implement network functions on FPGA and achieves a high performance of 40 Gbps.

Figure above: Review scores of the ClickNP paperFigure above: Review scores of the ClickNP paper

On April 26, 2016, our paper was accepted, but my girlfriend proposed to break up, so it was a mix of joy and sorrow at that time. Generally, when someone in our group gets a paper accepted, they have to treat everyone to a meal. But because I was in a bad mood, I didn’t treat everyone to a meal, and I’m very sorry for that.

At that time, the conference chair contacted us and said that our paper was selected for the experience track, but the paper looked like it was for the research track, so it was reviewed according to the research track process, and asked us to change the track, otherwise it would not be accepted. Tan Bo was surprised, why would the track be selected incorrectly. I didn’t understand the difference between the two tracks at the time, and when I submitted the paper, I found that the experience track option was checked, but I didn’t bring it up, thinking that Tan Bo had chosen it. Tan Bo quickly wrote a letter to the conference chair to clarify. Tan Bo told me, if you find a possible problem, you must bring it up, don’t make your own decisions, otherwise if the conference chair doesn’t kindly switch the track, you’ll suffer in silence.

From receiving the acceptance notice to submitting the camera ready (final) version, we only had a few weeks. We needed to sort out the reviewers’ comments and respond to them; modify the paper according to the reviewers’ comments, first submit the revised draft; the reviewers propose modifications again, and finally submit the final draft. In the few days when I was submitting the revised draft, I was dealing with the breakup at school, fortunately, the process of modifying the camera ready version doesn’t require much creativity. When I first started revising, I followed the reviewers’ comments obediently, one of which thought that we were not the “first” programming framework to implement network functions on FPGA with a high-level language. But Tan Bo believed that we should still retain some form of “the first”, and thus added the qualifier “achieving 40 Gbps performance” in the final version.

First time on a plane, first time abroad

That year’s SIGCOMM conference happened to coincide with the Olympics, both in Brazil. Just a few months before the SIGCOMM conference, the Zika virus outbreak occurred in Brazil, and everyone was afraid to go to Brazil, especially many researchers from Microsoft, who decided not to attend that year’s SIGCOMM. The SIGCOMM organizers also felt this concern and moved the conference from the big city of Salvador to the island city of Florianopolis, because the population density here is lower, the temperature is relatively lower, and there are fewer mosquitoes (the Zika virus is mainly transmitted by mosquitoes).

That year, although there were not many people in China who could publish long papers at SIGCOMM, there were still many universities and companies going to SIGCOMM to observe, as well as participate in Poster and session. For example, Tsinghua and Huawei both sent more than 10 people to attend that year’s SIGCOMM. Because no one from Microsoft was going with me to the conference, I went with Zhang Yuchao from Tsinghua, and asked her to help me book the flight through the school, the round-trip cost was as high as more than 18,000 yuan.

Interestingly, that was my first time going abroad, my first time on a plane, my first time on a plane was a long journey of more than 40 hours, transferring from Europe to South America. When transferring at Rio de Janeiro airport, the Olympic closing ceremony was about to be held, and armed gendarmes could be seen patrolling everywhere, because there had been shootings a few days earlier.

Since then, I have also found that I have a high tolerance for jet lag and pressure difference, after going there and sleeping a big sleep, I adjusted to the jet lag, and I hardly felt anything in my ears when the plane took off and landed. Being insensitive to jet lag may be related to my irregular life during my PhD, often getting up at noon or even in the afternoon, but I can also get up in the morning for meetings if needed. Until today, I need to get up at eight o’clock on weekday mornings to go to work, woken up by an alarm clock; but on weekends, I still sleep until eleven or twelve o’clock.

Figure above: Time imprint on the beachFigure above: Time imprint on the beach
Figure above: Me on the beach (Thanks to Zhang Yuchao for the photo)Figure above: Me on the beach (Thanks to Zhang Yuchao for the photo)
Figure above: Graffiti on the beach pathFigure above: Graffiti on the beach path
Figure above: SIGCOMM 2016 demo displayFigure above: SIGCOMM 2016 demo display
Figure above: SIGCOMM 2016 demo of a four-layer load balancer, with a simple GUI made with NW.jsFigure above: SIGCOMM 2016 demo of a four-layer load balancer, with a simple GUI made with NW.js
Figure above: Welcome dinnerFigure above: Welcome dinner

At that time, in order to prevent the Zika virus infection, we brought various mosquito repellent measures, including mosquito repellent bracelets and mosquito repellent water, and the first thing we did when we entered the room was to look for mosquitoes like detectives. However, we had a week-long conference and seemed to see few mosquitoes.

For the first time attending an academic conference, I was very serious. After the conference, I wrote 《The Weathervane of Network Technology - SIGCOMM 2016》, which was published by the official WeChat account of MSRA. After the second paper KV-Direct was published at SOSP 2017, I also wrote 《SOSP: The Weathervane of Computer System Research》, also published by the official WeChat account of MSRA. Later, more and more domestic universities began to systematically summarize conference records, such as the IPADS Institute of Shanghai Jiaotong University, which always sends students and teachers to attend top system conferences and makes good summaries of each work.

ClickNP also has a significant impact within the company. I demonstrated the accepted but not yet officially published ClickNP at the MSRA Student Techfest. Dean Hong came over and stayed at the booth for two minutes with interest, asking me several good questions. Dean Hong said that this was the first time he saw a demonstration of ClickNP and felt that the effect was good. ClickNP also won the best presentation award of that year.

Above: Demonstrating ClickNP at MSRA Student TechfestAbove: Demonstrating ClickNP at MSRA Student Techfest

Above: Demonstrating ClickNP to Dean Hong at MSRA Student TechfestAbove: Demonstrating ClickNP to Dean Hong at MSRA Student Techfest

The ClickNP project has not been open-sourced and has always been used within our network research group. Although it incubated subsequent works such as MP-RDMA, KV-Direct, MELO, etc., outsiders only know about a paper. Currently, this paper has 279 citations (Google Scholar data as of January 23, 2023), which has given many inspirations to researchers in the field of programmable network cards, but its code is sealed in the server. This is my biggest regret at MSRA. I hope that what I do can be used by thousands of people, just like the network service I did in LUG.

Gathered, we are a fire; scattered, we are stars across the sky

Luo Renqian is a student jointly trained by MSRA and USTC, two years junior to me. After completing the ClickNP project, Renqian said he didn’t want to do FPGA anymore. It seems that the difficulty of writing, debugging, and modifying FPGA is quite lethal. When he finished his first year of research at school and returned to MSRA, the direction of joint training shifted from network to AI. Renqian stayed at MSRA after his Ph.D. graduation and became a researcher. His total citation count on Google Scholar is even higher than mine.

Peng Yanqing, also two years junior to me, came to MSRA for a one-year internship in his senior year from the ACM class of Shanghai Jiaotong University. After graduating from undergraduate, he went to Utah State University for his Ph.D., no longer doing network research, but studying databases. He has also produced several excellent papers and currently works at Meta.

He Tong, after interning in our project, was admitted to UCLA for a master’s degree and currently works at Google.

Dr. Luo Layong, after completing our project, has been using FPGA to accelerate network research and is currently a senior technical expert in a domestic Internet company.

Dr. Xu Ningyi, as a senior FPGA expert and the head of the hardware research group, guided us in designing the overall hardware architecture in the ClickNP project. He has since been dedicated to the research of AI hardware architecture and is currently a professor at the Qingyuan Institute of Shanghai Jiaotong University.

Including Professor Xiong Yongqiang and Dr. Cheng Peng who continue to stay at MSRA, Tan Bo and I at Huawei, and my academic advisor Professor Chen Enhong, the 10 main participants at that time came from 5 different units (4 schools), and now belong to 7 units. It’s true that: Gathered, we are a fire; scattered, we are stars across the sky. In the following text, we will see more teachers, friends, and experts. It can be said that: Friends are kept in mind, though they are far away, they seem near.

This article is the first in the series “Five Years of Ph.D. at MSRA” (From Novice to the First SIGCOMM Paper), to be continued…

Comments