Ruan Zhenyuan: Undergraduate Achieves USTC's Breakthrough in Top-tier Papers
(Reprinted from USTC Innovation Foundation)
The top international academic conference in the field of computer systems, SOSP 2017 (Symposium on Operating Systems Principles), was recently held in Shanghai. Since the first SOSP in 1967, most of the content in textbooks on operating systems and distributed systems has come from the SOSP conference. Therefore, researchers in the system field generally regard publishing papers at SOSP as an honor. Among the 39 papers accepted by SOSP this year, only two first authors are from mainland China, including the KV-Direct system co-authored by Li Bojie, a third-year doctoral student, and Ruan Zhenyuan, a senior undergraduate student from the University of Science and Technology of China (USTC). This is also the first time that USTC has published a paper at SOSP. As an undergraduate, how did Ruan Zhenyuan step by step achieve USTC’s “breakthrough from zero” at the SOSP conference?
Supercomputing Hongyan Team: Pursuing Ultimate Performance
With his accumulation in computer programming competitions in middle school, Ruan Zhenyuan joined the High Performance Computing Laboratory of Professor An Hong of the School of Computer Science and Technology of USTC in his early undergraduate years. Since the establishment of the Supercomputing Hongyan Team in 2012, 17 teams and 102 participants have been organized to participate in high-performance computing competitions at home and abroad, all of which have achieved excellent results. The Supercomputing Hongyan Team, with its advanced experimental infrastructure and the opportunity to attend the International Conference on High Performance Computing (SC) in the United States, has attracted a group of the best undergraduates in the field of computer science and technology.
The supercomputing competition is a comprehensive exercise for students’ system design and implementation capabilities. From application optimization, software framework to hardware platform, the participating students need to design and assemble a small supercomputing system with the manufacturer, and optimize the performance of a series of specified applications as high as possible within the power limit of 3000W. These applications span multiple disciplines, and the Infiniband network, GPU accelerator card, high-performance server and other hardware used are top-level configurations that students rarely see. In the short three days of the competition, they have to assemble and debug the system, divide the work and run applications, which poses a high challenge to students’ ability to learn new knowledge, analyze performance bottlenecks, debug and solve faults, and team cooperation.
For five consecutive years, the undergraduates trained by the Supercomputing Hongyan Team, Wang Yuanrong (0911), Lan Wuwei (1011), He Songtao (1111), Zhang Zhishuai (1211), and Ruan Zhenyuan (1311) have won the highest honor for USTC undergraduates - the Guo Moruo Scholarship. Ruan Zhenyuan, as the technical backbone and system administrator of the Supercomputing Hongyan Team in 2015, won the third place in the 2015 International College Student Supercomputing Competition, and served as the coach of the Supercomputing Hongyan Team in 2016, contributing to the technical accumulation for the 2016 team that won both the total score and Linpack performance championships. Ruan Zhenyuan himself also won the championship in the 2014 International Student RDMA Programming Competition and the championship in the National Parallel Computing Competition jointly organized by the China Computer Society and Intel in 2015. With these achievements, Ruan Zhenyuan not only won the Guo Moruo Scholarship, but also the more rare Outstanding Undergraduate Award of the China Computer Society, the Outstanding Undergraduate Graduate Award of Anhui Province, etc. The undergraduate thesis guided by Professor An Hong was also rated as an excellent graduation thesis.
After achieving some success in the student high-performance computing competition, Ruan Zhenyuan began to further challenge more professional academic research. This time, he chose the UCLA CSST summer research project funded by the USTC Innovation Foundation.
UCLA Overseas Exchange: The Beginning of Academic Research
Ruan Zhenyuan studied under Professor Jason Cong (Cong Jingsheng), a leading figure in the field of computer architecture at UCLA, and his research topic was the performance prediction model of Apache Spark. During the short two-month exchange, Ruan Zhenyuan not only needed to study courses, but also had to complete the entire process of a paper from topic selection, research, design, implementation to submission with a postdoctoral fellow. Completing a paper in two months and making a progress report at a group meeting every week is challenging even for an experienced researcher. Ruan Zhenyuan, with his excellent computer system foundation and strong “coding power” (the ability to quickly write code to implement algorithms), spent several weeks with the cooperating postdoctoral fellow to clear up the related work and background knowledge, and made great progress before the end of the exchange. After returning to China, Ruan Zhenyuan continued to improve the follow-up work and submitted it to the important conference in the field of computer architecture, IPDPS’17.
After the exchange at UCLA, Ruan Zhenyuan learned the entire process of doing scientific research, and with the experience of hardware architecture and application optimization during the supercomputing competition at USTC, he looked like a professional researcher rather than a clueless undergraduate. During the exchange at UCLA, although the mentor Professor Jason Cong is Chinese, he required academic exchanges to be conducted in English, which was a good exercise for English proficiency. Through the personal experience of exchange at American universities, Ruan Zhenyuan strengthened his belief in studying abroad and pursuing a doctoral degree. The scholarship and funding provided by the USTC Innovation Foundation allowed Ruan Zhenyuan to focus on this summer project. The USTC Innovation Foundation once reported on Ruan Zhenyuan’s exchange experience: “Overseas Exchange Scholarship” Exchange Experience - Ruan Zhenyuan.
Ruan Zhenyuan formed a deep friendship with Professor Jason Cong, and almost without hesitation, he chose to continue his doctoral studies with Professor Jason Cong, who also praised the professional abilities of USTC students. A year later, Ruan Zhenyuan recommended Cui Tianyi (1400) from the USTC Innovation Pilot Class to participate in the same UCLA CSST exchange program, and Zhang Chen, an excellent doctoral graduate jointly trained by Professor Jason Cong and Peking University, also returned to China to join Microsoft Research Asia.
After the exchange at UCLA, Ruan Zhenyuan embarked on a new journey - Microsoft Research Asia, which has a long-term cooperation with the University of Science and Technology of China.
Microsoft Internship: Relay of USTC Alumni
The Wireless and Networking Group at Microsoft Research Asia is hailed as a banner in the Chinese network academia. For five consecutive years, USTC undergraduates Lan Wuwei (1011), He Songtao (1111), Li Yishuai (1200), Ruan Zhenyuan (1311), and Cui Tianyi (1400), who won the Guo Moruo Scholarship, came to the Wireless and Networking Group at Microsoft Research Asia for internships. He Songtao published his first-authored paper at the top conference in the field of mobile computing, MOBICOM 2015, during his internship and won the Best Demonstration Award. Two years later, Ruan Zhenyuan also published his first-authored paper at the top conference in the system field, SOSP 2017.
The process of Ruan Zhenyuan coming to Microsoft for an internship is full of stories of USTC alumni helping each other. In 2013, Li Bojie (1000), a junior, was recommended by his class teacher, Professor Huang Songyun, and after layers of selection, was recruited by Dr. Zhang Yongguang, the chief researcher of the Wireless and Networking Group at Microsoft Research Asia, as a jointly trained doctoral student. During his senior year, under the guidance of senior researcher Tan Kun at Microsoft Research Asia, Li Bojie conducted research on high-performance network packet processing and submitted it to SIGCOMM’14. Although it was not accepted, he gained experience in academic research. In the second half of the year, Li Bojie studied the software architecture of fault-tolerant routers and used it as his undergraduate thesis. In this research, Li Bojie came into contact with valuable programmable switch materials, laying the foundation for his doctoral research on programmable data center networks.
In 2015, Luo Renqian (1211) was also admitted to the joint training program by the University of Science and Technology of China and Microsoft. Under the guidance of Dr. Tan Kun, Li Bojie collaborated with Luo Renqian, Peng Yanqing (undergraduate of Shanghai Jiaotong University ACM class), and others to publish the paper ClickNP at the top conference in computer networks, SIGCOMM 2016, achieving a “breakthrough from zero” for USTC at the SIGCOMM conference. ClickNP is a framework for programming in high-level languages on reconfigurable hardware (FPGA), solving the problem of difficult FPGA programming that has plagued the network academia for decades, enabling software engineers to quickly develop efficient FPGA hardware accelerators for processing network packets, and improving the performance of the most efficient network processing software at the time by 10 times.
After ClickNP was submitted to SIGCOMM, Luo Renqian recommended his classmate, an expert in programming languages, Li Yishuai (1200), to Microsoft for a three-month internship. Luo Renqian and Li Yishuai cooperated to implement the RDMA protocol with a programmable network card. Luo Renqian was responsible for the hardware implementation, and Li Yishuai was responsible for the programming interface, which was used as the graduation thesis. Before the end of Li Yishuai’s internship, he recommended his junior, Ruan Zhenyuan (1311), to Microsoft for an internship, planning to continue to advance this RDMA project.
Because Ruan Zhenyuan had experience using RDMA and optimizing system performance during the USTC supercomputing competition, although Ruan Zhenyuan did not join the joint doctoral training program, Dr. Tan Kun still exceptionally approved a one-year internship period. Ruan Zhenyuan was evaluated as “very professional” by Dr. Xiong Yongqiang, a senior and smart supervisor, at his first group meeting. Li Bojie (1000), Xiao Wencong (Beihang 2010), Lu Yuanwei (0902), and other Microsoft joint doctoral students who collaborated with Ruan Zhenyuan on the SOSP paper all consider him one of the strongest undergraduates they have ever seen. In fact, in addition to the KV-Direct system published at SOSP, Ruan Zhenyuan also helped Lu Yuanwei and other students publish a paper at the first Asia-Pacific Network Symposium (APNet’17), ranking third as an author.
Tenfold Acceleration: A Milestone in Key-Value Storage
Due to the company’s reorganization, starting from October 2016, Ruan Zhenyuan studied under Dr. Zhang Lintao, the chief researcher of the Systems Group at Microsoft Research Asia, and his research focus shifted from networks to systems, starting to study programmable hardware-accelerated key-value storage. Although key-value storage is very unfamiliar to people outside the computer industry, it is an indispensable part of cloud server systems like Taobao, WeChat, and Weibo. Like on “Double Eleven” when the whole nation collectively starts shopping, or when Lu Han’s Weibo clicks explode, it poses a huge challenge to the entire system, with the demand for key-value storage access reaching tens of billions per second. What Ruan Zhenyuan and his collaborators are doing is a high-performance distributed system infrastructure KV-Direct, which can achieve up to 1.22 billion memory key-value accesses per second with only 357W of power, improving the system’s performance by 10 times, and is hailed as a “milestone in the performance of general key-value storage”. To give a vivid example, if the operation of grabbing a train ticket during the Spring Festival travel rush is a key-value access, then a single KV-Direct server can almost enable the entire nation to grab a ticket per second per person.
When designing and implementing the KV-Direct system, Ruan Zhenyuan, Li Bojie, and Dr. Zhang Lintao had a common doubt: what are the challenges and innovations in the system. Challenges and innovations are essential elements of a high-quality paper. In the summer of 2016, Cui Tianyi (1400) interned at the Wireless and Networking Group at Microsoft Research Asia, developing a very high-performance encrypted network connection (HTTPS) accelerator. Although he achieved the second place in the global Microsoft Hackathon, it was not worth publishing at a top conference due to insufficient innovation. Fortunately, the KV-Direct system is not as simple as imagined.
In classic key-value processing systems (as shown in Figure a above), network requests are processed on the server-side CPU, which is relatively slow. Therefore, some newer systems (as shown in Figure b above) have moved request processing from the server to the client, relieving the bottleneck of the server-side CPU, but increasing the cost of network transmission. When multiple clients compete for the same resource, they need to coordinate and synchronize with each other. To this end, KV-Direct proposes to replace the CPU with a programmable logic gate array (FPGA) for network request processing (as shown in Figure c above). The request is still processed on the server side, but a more suitable hardware architecture is used.
However, key-values are stored in memory, and the CPU accesses memory with high bandwidth and low latency, while FPGA accesses memory through the PCIe bus, which reduces the bandwidth by about 10 times and increases the latency by about 10 times. The design of the classic key-value storage system is no longer applicable. To save access bandwidth and hide memory latency, Ruan Zhenyuan and his collaborators proposed a series of optimization methods, achieving key-value storage performance close to the hardware limit. Due to the limited interface bandwidth of a single FPGA accelerator card, a custom server was built with 10 FPGA accelerator cards plugged in, which is the system shown in the previous photo of a server full of cards.
During the half-year implementation process, Ruan Zhenyuan overcame various difficulties in the FPGA programming framework, stepped into countless pitfalls, fixed countless bugs, and burned out six FPGA boards before finally achieving the expected performance. Two weeks before the SOSP submission deadline, only the basic performance of the system was measured, but there was not enough time to integrate it with real-world applications. Ruan Zhenyuan, Li Bojie, and others had to work overtime to write the paper under the guidance of their mentor Zhang Lintao and with the help of classmates Xiao Wencong and Lu Yuanwei. They finally revised a presentable version on the eve of the submission deadline. As you can imagine, many reviewers thought that not evaluating the performance under real application scenarios was a major flaw in the paper. Despite this, the outstanding performance and ingenious design of the KV-Direct system won over the reviewers, and the SOSP program committee finally decided to accept the paper.
The More You Walk, the Wider the Road
In October 2017, the world-renowned SOSP conference was held in Shanghai. To improve the efficiency of conference questioning, SOSP’17 pioneered online questioning. Participants submit questions online and vote during the presentation, and the speaker answers the question with the highest number of votes after each presentation. The 39 papers presented at the three-day conference covered all aspects of the system, and each paper only had one opportunity for online questioning. Among the more than 850 participants, the questions proposed by Ruan Zhenyuan were selected at least 5 times, reflecting his broad knowledge and keen insight into the essence of the paper.
In September 2017, Ruan Zhenyuan began his doctoral studies at the University of California, Los Angeles (UCLA). Compared to many doctoral students who are worried about graduation, Ruan Zhenyuan had already published two papers before enrolling, one of which, KV-Direct, achieved a “breakthrough from zero” for USTC at the SOSP conference. In contrast, the Guo Moruo Scholarship, which he won as the top GPA student in the School of Computer Science, is a small honor: the Guo Moruo Scholarship is awarded every year, but a “breakthrough from zero” is rare, and a “breakthrough from zero” created by an undergraduate student is even rarer. From the USTC Supercomputing Hongyan Team, to the UCLA CSST overseas exchange program funded by the newly established Alumni Foundation, to the joint training program at Microsoft Research Asia, Ruan Zhenyuan’s research path is getting wider and wider.