Bojie Li
2017-12-21
USTC Blog has been completely shut down, with a lifespan of five years.
WordPress is really too bloated, I’ve been wanting to migrate to a static blog, so I took this opportunity to migrate to Hexo.
The hexo-migrator-wordpress
migration tool is not perfect, issues with formatting need to be slowly fixed. Also, the comments have been lost.
2017-11-13
(Reprinted from USTC Innovation Foundation)
The top international academic conference in the field of computer systems, SOSP 2017 (Symposium on Operating Systems Principles), was recently held in Shanghai. Since the first SOSP in 1967, most of the content in textbooks on operating systems and distributed systems has come from the SOSP conference. Therefore, researchers in the system field generally regard publishing papers at SOSP as an honor. Among the 39 papers accepted by SOSP this year, only two first authors are from mainland China, including the KV-Direct system co-authored by Li Bojie, a third-year doctoral student, and Ruan Zhenyuan, a senior undergraduate student from the University of Science and Technology of China (USTC). This is also the first time that USTC has published a paper at SOSP. As an undergraduate, how did Ruan Zhenyuan step by step achieve USTC’s “breakthrough from zero” at the SOSP conference?
2017-11-10
(Reprinted from Microsoft Research Asia)
Since its inception in 1967, SOSP (Symposium on Operating Systems Principles) has been held every two years for 50 years. From the UNIX system in 1969 to MapReduce, BigTable, and GFS in the early 21st century, a series of the most influential works in the systems field have been published at SOSP and its biennial sibling conference OSDI. If you compile the most influential papers (Hall of Fame Award) from SOSP and OSDI over the years, it would be more than half of the textbooks on operating systems and distributed systems. As the highest academic conference in the systems field, SOSP and OSDI only accept 30 to 40 high-quality papers each year, so being able to publish a paper at SOSP is an honor for system researchers.
2017-11-02
(This is the closing speech of my lecture at the University of Science and Technology of China in November 2017, copied from Einstein’s “Motivation for Exploration”)
Most research in the field of systems can be divided into two categories: one is the emergence of new hardware, such as our programmable network cards, RDMA network cards, NVMe and NVM in the field of high-speed storage, SGX, TSX instruction extensions in CPUs, which can bring many new possibilities to system design; the other is the emergence of new application scenarios, such as deep learning that we are all talking about today, which brings many new challenges to system design. However, if there are only these two types of research in the field of systems, it will not become a respected research field, just as there can be no forest with only grass. Because for new hardware or new application scenarios, even if there are no scientists who specialize in system research, engineers will come up with ways to utilize these possibilities and meet these challenges.
So what attracts so many smart people into the field of system research? I think Einstein’s
In addition to this negative motivation, there is also a positive one. People always want to draw a simplified and easy-to-understand picture of the world in the most appropriate way; so he tries to replace the world of experience with his own world system and conquer it. This is what painters, poets, speculative philosophers, and natural scientists do, each in their own way. System researchers must strictly control the subject of their research, that is, to describe the most common modules in real systems. Attempting to reproduce complex systems in the real world with the precision and completeness of system researchers is beyond human intelligence. The basic abstractions that form the foundation of the system, such as IP for networks, SQL for databases, and files for operating systems, should be universally valid for a wide range of hardware architectures and application scenarios. With these basic abstractions, it is possible to construct a complete system through pure deduction. In this construction process, engineers can add the complexity of the real world, which may lose some of the good properties of the basic abstraction, but we can still understand the behavior of the entire system through a deductive process that does not exceed human reason.
The highest mission of system researchers is to obtain these universal basic abstractions, from which high-performance, scalable, highly available, and easy-to-program systems can be established by deductive methods. There is no logical path to these basic abstractions, only through intuition based on understanding of experience. This means that a good system researcher must first be an experienced system engineer. Because of this methodological uncertainty, it can be assumed that there will be many equally valid system abstractions. This view is valid both theoretically and practically. However, the development of the system field shows that at the same time, under the same hardware constraints and application scenarios, there is always one that seems much better than the others. This is what Leibniz very aptly described as “pre-established harmony”. The desire to see this pre-established harmony is the source of the infinite perseverance and patience of system researchers.
2017-10-29
Performance of in-memory key-value store (KVS) continues to be of great importance as modern KVS goes beyond the traditional object-caching workload and becomes a key infrastructure to support distributed main-memory computation in data centers. Recent years have witnessed a rapid increase of network bandwidth in data centers, shifting the bottleneck of most KVS from the network to the CPU. RDMA-capable NIC partly alleviates the problem, but the primitives provided by RDMA abstraction are rather limited. Meanwhile, programmable NICs become available in data centers, enabling in-network processing. In this paper, we present KV-Direct, a high performance KVS that leverages programmable NIC to extend RDMA primitives and enable remote direct key-value access to the main host memory.
We develop several novel techniques to maximize the throughput and hide the latency of the PCIe connection between the NIC and the host memory, which becomes the new bottleneck. Combined, these mechanisms allow a single NIC KV-Direct to achieve up to 180 M key-value operations per second, equivalent to the throughput of tens of CPU cores. Compared with CPU based KVS implementation, KV-Direct improves power efficiency by 3x, while keeping tail latency below 10 µs. Moreover, KV-Direct can achieve near linear scalability with multiple NICs. With 10 programmable NIC cards in a commodity server, we achieve 1.22 billion KV operations per second, which is almost an order-of-magnitude improvement over existing systems, setting a new milestone for a general-purpose in-memory key-value store.
2017-10-28
(Reprinted from Microsoft Research Asia)
The international academic conference on computer system architecture, SOSP 2017 (Symposium on Operating Systems Principles), is currently being held in Shanghai. As one of the top academic conferences in the field of computer systems, if a paper is fortunate enough to be included, its influence is self-evident. Not long ago, a paper on memory key-value storage by Bojie Li, a doctoral student jointly trained by Microsoft Research Asia and the University of Science and Technology of China, was included in the conference. For most people outside the computer industry, “memory key-value storage” is a blind spot in brain knowledge, a deep sea for the ship of curiosity. But for Bojie Li, born in 1992, this has become a part of his life. His growth story can be told from this word that is strange to us but very familiar to him.
2017-09-02
Driven by explosive demand on computing power and slowdown of Moore’s law, cloud providers have started to deploy FPGAs into datacenters for workload offloading and acceleration. In this paper, we propose an operating system for FPGA, called Feniks, to facilitate large scale FPGA deployment in datacenters.
Feniks provides abstracted interface for FPGA accelerators, so that FPGA developers can get rid of underlying hardware details. In addition, Feniks also provides (1) development and runtime environment for accelerators to share an FPGA chip in efficient way; (2) direct access to server’s resource like storage and coprocessor over PCIe bus; (3) an FPGA resource allocation framework throughout a datacenter.
We implemented an initial prototype of Feniks on Catapult Shell and Altera Stratix V FPGA. Our experiments show that device-to-device communication over PCIe is feasible and efficient. A case study shows multiple accelerators can share an FPGA chip independently and efficiently.
2017-08-04
(Reposted from Microsoft Research Asia)
From June 19-20, 2017, the open source technology event LinuxCon + ContainerCon + CloudOpen (LC3) was held in China for the first time. The two-day agenda was packed, including 17 keynote speeches, 88 technical reports from 8 sub-venues, and technical exhibitions and hands-on experiments from companies like Microsoft. LinuxCon attracted many international and domestic internet giants, telecom giants, and thousands of industry professionals, including Linux founder Linus Torvalds, to gather and focus on industry trends.
2017-08-03
The First Asia-Pacific Workshop on Networking (APNet’17) Invited Talk:
Implementing ClickNP: Highly Flexible and High-Performance Network Processing with FPGA + CPU
Abstract: ClickNP is a highly flexible and high-performance network processing platform with reconfigurable hardware, published in SIGCOMM’16. This talk will share our implementation experience of the ClickNP system, both before and after paper submission. Throughout 8 months, we developed 100 elements and 5 network functions for the SIGCOMM paper, resulting in 1K commits and 20K lines of code. After the paper submission, ClickNP continues to develop and extends to a general-purpose FPGA programming framework in our research team, resulting in 300 elements, 86 application projects and 80K lines of code.
(1) Although with high-level languages, programming FPGA is still much more challenging than CPU. We had hard times to understand the behavior and pitfalls of black-box compilers, and shared our findings by enforcing coding style in the ClickNP language design and providing optimizations in the ClickNP compiler.
(2) OpenCL host to kernel communication model is a poor fit for network processing. This talk will elaborate internals of the high performance communication channel between CPU and FPGA.
(3) FPGA compilation takes hours, run-time debugging is hard, and simulation is inaccurate. For case study, we show how we identified and resolved a deadlock bug in the L4 load balancer, leveraging ClickNP debugging functionalities.
2017-08-03
Limited by the small on-chip memory, hardware-based transport typically implements go-back-N loss recovery mechanism, which costs very few memory but is well-known to perform inferior even under small packet loss ratio. We present MELO, an efficient selective retransmission mechanism for hardware-based transport, which consumes only a constant small memory regardless of the number of concurrent connections. Specifically, MELO employs an architectural separation between data and meta data storage and uses a shared bits pool allocation mechanism to reduce meta data on-chip memory footprint. By only adding in average 23B extra on-chip states for each connection, MELO achieves up to 14.02x throughput while reduces 99% tail FCT by 3.11x compared with go-back-N under certain loss ratio.