Bojie Li
2018-01-01
Fault tolerance is critical for distributed applications. Many request serving and batch processing frameworks have been proposed to simplify programming of fault tolerant distributed systems, which basically ask the programmers to separate states from computation and store states in a fault-tolerant system. However, many existing applications (e.g. Node.js, Memcached and Python in Tensorflow) do not support fault tolerance, and fault tolerant systems are often slower than their non-fault-tolerant counterparts. In this work, we take up the challenge of achieving transparent and efficient fault tolerance for general distributed applications. Challenges include process migration, deterministic replay and distributed snapshot.
2018-01-01
To improve performance and reduce CPU overhead for network applications, programmable switches and NICs have been introduced in data centers to offload virtualized network functions, transport protocols, key-value stores, distributed consensus and resource disaggregation. Compared to general-purpose processors, programmable switches and NICs have more limited resources and only support a more constrained programming model. To this end, developers typically split a network function into a data plane to process common-case packets and a control plane to handle the remaining cases. The data plane function is then implemented in a packet processing language (e.g. P4) and offloaded into hardware.
Writing packet programs for network application offloading could be hard labor. First, even if the protocol specification (or source code) is available, the developer needs to read the thousand-page book (or code) and figure out which part are the common cases. Second, many implementations have subtle variations from the specification, so the developer often needs to examine packet traces and reverse-engineer the implementationspecific behaviors manually. Third, the offloaded function needs rewrite when the application is updated (e.g. regular expressions in a firewall).
We design P4Coder, a system to automatically synthesis the data plane by learning the behavior of a reference network application. No formal specification or source code is required. The developer only needs to design a few data-plane test cases and run the reference application. P4Coder captures the input and output packets, and searches for a packet program to produce identical output packets for the sequence of input packets. Obviously, passing the test cases does not imply that the program will generalize correctly for other inputs.
2018-01-01
Servers in data centers host increasing varieties of PCIe devices, e.g. GPUs, NVMe SSDs, NICs, accelerator cards and FPGAs. For high throughput and low latency, CPU-bypass direct communication among PCIe devices (e.g. GPU-Direct, NVMe-OF) is flourishing. However, many PCIe devices are designed to only talk to drivers on CPU, while the PCIe register and DMA interface is intricate and potentially undocumented. In order to capture PCIe packets and debug PCIe protocol implementations, developers need PCIe protocol analyzers which are expensive (~$250K), hard to deploy in production environment and unable to modify PCIe TLP packets that pass through.
In this work, we design and implement a transparent PCIe debugger and gateway with a commodity FPGA-based PCIe board. PCIe gateway captures packets bump-in-the-wire between a target PCIe device (e.g. NIC) and CPU. Because PCIe has fixed routing, it is impossible to perform ARP-spoofing-like attack on PCIe fabric. However, we can spoof the device driver to redirect the PCIe traffic to go through our PCIe gateway. The communication between a PCIe device and CPU falls in two categories according to the initiator.
2018-01-01
Analytical database queries are critical to support business decisions. Because these queries involve complicated computation over a large corpus of data, their execution typically takes minutes to hours. When information in the database is updated, the user needs to re-execute the query on the current snapshot of database, which again takes a long time and the result reflects a stale snapshot. In this rapidly changing world, business intelligence should react to information updates in real-time.
To this end, we design ReactDB, a new database with fast analytical queries and reactive to database updates.
ReactDB is reactive in two ways. First, cached analytical queries are reactive to updates in the database. We observe that many analytical queries are repetitive. So we cache intermediate results of frequent analytical queries. When data updates, the cached results and ongoing transactions are updated incrementally in real-time. This enables cached queries to complete immediately. The user may even subscribe to an analytical query and receive an updated query result whenever the database updates.
Second, in ReactDB, physical data layout and indexes are reactive to data access pattern. Different queries need different physical data layouts and indexes for efficient access. Traditionally, they need to be manually tuned by the DBA, which may be suboptimal for certain workloads.
2018-01-01
Accelerators such as GPUs, TPUs and FPGAs are deployed at scale in data centers to accelerate many online serving applications, e.g. machine learning inference, image processing, encryption and compression. These applications typically receive requests from network, do pre-processing, call a computationally intensive routine, do post-processing and finally send response to network. With accelerators, the computationally intensive routine is replaced by an RPC to the accelerator device. Here the challenge arises: what the CPU should do while waiting for the accelerator?
The traditional approach is to relinquish the CPU after sending the offloading request and the OS scheduler will switch to another thread. However, context switch in Linux takes several microseconds. A fine-grained offloaded task also ranges from several to tens of microseconds, which would soon complete and wake up the thread again. The context switch overhead not only wastes CPU, but also adds thread wake up latency to request processing latency. A second approach is to busy wait until the offloaded task completes, which obviously wastes CPU. A third approach is to rewrite the application to do other jobs within the thread while waiting for the accelerator. In this work, we build a library to transparently find and execute non-conflict jobs within the thread, without modifying application code.
2017-12-21
USTC Blog has been completely shut down, with a lifespan of five years.
WordPress is really too bloated, I’ve been wanting to migrate to a static blog, so I took this opportunity to migrate to Hexo.
The hexo-migrator-wordpress
migration tool is not perfect, issues with formatting need to be slowly fixed. Also, the comments have been lost.
2017-11-13
(Reprinted from USTC Innovation Foundation)
The top international academic conference in the field of computer systems, SOSP 2017 (Symposium on Operating Systems Principles), was recently held in Shanghai. Since the first SOSP in 1967, most of the content in textbooks on operating systems and distributed systems has come from the SOSP conference. Therefore, researchers in the system field generally regard publishing papers at SOSP as an honor. Among the 39 papers accepted by SOSP this year, only two first authors are from mainland China, including the KV-Direct system co-authored by Li Bojie, a third-year doctoral student, and Ruan Zhenyuan, a senior undergraduate student from the University of Science and Technology of China (USTC). This is also the first time that USTC has published a paper at SOSP. As an undergraduate, how did Ruan Zhenyuan step by step achieve USTC’s “breakthrough from zero” at the SOSP conference?
2017-11-10
(Reprinted from Microsoft Research Asia)
Since its inception in 1967, SOSP (Symposium on Operating Systems Principles) has been held every two years for 50 years. From the UNIX system in 1969 to MapReduce, BigTable, and GFS in the early 21st century, a series of the most influential works in the systems field have been published at SOSP and its biennial sibling conference OSDI. If you compile the most influential papers (Hall of Fame Award) from SOSP and OSDI over the years, it would be more than half of the textbooks on operating systems and distributed systems. As the highest academic conference in the systems field, SOSP and OSDI only accept 30 to 40 high-quality papers each year, so being able to publish a paper at SOSP is an honor for system researchers.
2017-11-02
(This is the closing speech of my lecture at the University of Science and Technology of China in November 2017, copied from Einstein’s “Motivation for Exploration”)
Most research in the field of systems can be divided into two categories: one is the emergence of new hardware, such as our programmable network cards, RDMA network cards, NVMe and NVM in the field of high-speed storage, SGX, TSX instruction extensions in CPUs, which can bring many new possibilities to system design; the other is the emergence of new application scenarios, such as deep learning that we are all talking about today, which brings many new challenges to system design. However, if there are only these two types of research in the field of systems, it will not become a respected research field, just as there can be no forest with only grass. Because for new hardware or new application scenarios, even if there are no scientists who specialize in system research, engineers will come up with ways to utilize these possibilities and meet these challenges.
So what attracts so many smart people into the field of system research? I think Einstein’s
In addition to this negative motivation, there is also a positive one. People always want to draw a simplified and easy-to-understand picture of the world in the most appropriate way; so he tries to replace the world of experience with his own world system and conquer it. This is what painters, poets, speculative philosophers, and natural scientists do, each in their own way. System researchers must strictly control the subject of their research, that is, to describe the most common modules in real systems. Attempting to reproduce complex systems in the real world with the precision and completeness of system researchers is beyond human intelligence. The basic abstractions that form the foundation of the system, such as IP for networks, SQL for databases, and files for operating systems, should be universally valid for a wide range of hardware architectures and application scenarios. With these basic abstractions, it is possible to construct a complete system through pure deduction. In this construction process, engineers can add the complexity of the real world, which may lose some of the good properties of the basic abstraction, but we can still understand the behavior of the entire system through a deductive process that does not exceed human reason.
The highest mission of system researchers is to obtain these universal basic abstractions, from which high-performance, scalable, highly available, and easy-to-program systems can be established by deductive methods. There is no logical path to these basic abstractions, only through intuition based on understanding of experience. This means that a good system researcher must first be an experienced system engineer. Because of this methodological uncertainty, it can be assumed that there will be many equally valid system abstractions. This view is valid both theoretically and practically. However, the development of the system field shows that at the same time, under the same hardware constraints and application scenarios, there is always one that seems much better than the others. This is what Leibniz very aptly described as “pre-established harmony”. The desire to see this pre-established harmony is the source of the infinite perseverance and patience of system researchers.
2017-10-29
Performance of in-memory key-value store (KVS) continues to be of great importance as modern KVS goes beyond the traditional object-caching workload and becomes a key infrastructure to support distributed main-memory computation in data centers. Recent years have witnessed a rapid increase of network bandwidth in data centers, shifting the bottleneck of most KVS from the network to the CPU. RDMA-capable NIC partly alleviates the problem, but the primitives provided by RDMA abstraction are rather limited. Meanwhile, programmable NICs become available in data centers, enabling in-network processing. In this paper, we present KV-Direct, a high performance KVS that leverages programmable NIC to extend RDMA primitives and enable remote direct key-value access to the main host memory.
We develop several novel techniques to maximize the throughput and hide the latency of the PCIe connection between the NIC and the host memory, which becomes the new bottleneck. Combined, these mechanisms allow a single NIC KV-Direct to achieve up to 180 M key-value operations per second, equivalent to the throughput of tens of CPU cores. Compared with CPU based KVS implementation, KV-Direct improves power efficiency by 3x, while keeping tail latency below 10 µs. Moreover, KV-Direct can achieve near linear scalability with multiple NICs. With 10 programmable NIC cards in a commodity server, we achieve 1.22 billion KV operations per second, which is almost an order-of-magnitude improvement over existing systems, setting a new milestone for a general-purpose in-memory key-value store.