Bojie Li (李博杰)
2018-01-01
Fault tolerance is critical for distributed applications. Many request serving and batch processing frameworks have been proposed to simplify programming of fault tolerant distributed systems, which basically ask the programmers to separate states from computation and store states in a fault-tolerant system. However, many existing applications (e.g. Node.js, Memcached and Python in Tensorflow) do not support fault tolerance, and fault tolerant systems are often slower than their non-fault-tolerant counterparts. In this work, we take up the challenge of achieving transparent and efficient fault tolerance for general distributed applications. Challenges include process migration, deterministic replay and distributed snapshot.
2018-01-01
To improve performance and reduce CPU overhead for network applications, programmable switches and NICs have been introduced in data centers to offload virtualized network functions, transport protocols, key-value stores, distributed consensus and resource disaggregation. Compared to general-purpose processors, programmable switches and NICs have more limited resources and only support a more constrained programming model. To this end, developers typically split a network function into a data plane to process common-case packets and a control plane to handle the remaining cases. The data plane function is then implemented in a packet processing language (e.g. P4) and offloaded into hardware.
Writing packet programs for network application offloading could be hard labor. First, even if the protocol specification (or source code) is available, the developer needs to read the thousand-page book (or code) and figure out which part are the common cases. Second, many implementations have subtle variations from the specification, so the developer often needs to examine packet traces and reverse-engineer the implementationspecific behaviors manually. Third, the offloaded function needs rewrite when the application is updated (e.g. regular expressions in a firewall).
We design P4Coder, a system to automatically synthesis the data plane by learning the behavior of a reference network application. No formal specification or source code is required. The developer only needs to design a few data-plane test cases and run the reference application. P4Coder captures the input and output packets, and searches for a packet program to produce identical output packets for the sequence of input packets. Obviously, passing the test cases does not imply that the program will generalize correctly for other inputs.
2018-01-01
Servers in data centers host increasing varieties of PCIe devices, e.g. GPUs, NVMe SSDs, NICs, accelerator cards and FPGAs. For high throughput and low latency, CPU-bypass direct communication among PCIe devices (e.g. GPU-Direct, NVMe-OF) is flourishing. However, many PCIe devices are designed to only talk to drivers on CPU, while the PCIe register and DMA interface is intricate and potentially undocumented. In order to capture PCIe packets and debug PCIe protocol implementations, developers need PCIe protocol analyzers which are expensive (~$250K), hard to deploy in production environment and unable to modify PCIe TLP packets that pass through.
In this work, we design and implement a transparent PCIe debugger and gateway with a commodity FPGA-based PCIe board. PCIe gateway captures packets bump-in-the-wire between a target PCIe device (e.g. NIC) and CPU. Because PCIe has fixed routing, it is impossible to perform ARP-spoofing-like attack on PCIe fabric. However, we can spoof the device driver to redirect the PCIe traffic to go through our PCIe gateway. The communication between a PCIe device and CPU falls in two categories according to the initiator.
2018-01-01
Analytical database queries are critical to support business decisions. Because these queries involve complicated computation over a large corpus of data, their execution typically takes minutes to hours. When information in the database is updated, the user needs to re-execute the query on the current snapshot of database, which again takes a long time and the result reflects a stale snapshot. In this rapidly changing world, business intelligence should react to information updates in real-time.
To this end, we design ReactDB, a new database with fast analytical queries and reactive to database updates.
ReactDB is reactive in two ways. First, cached analytical queries are reactive to updates in the database. We observe that many analytical queries are repetitive. So we cache intermediate results of frequent analytical queries. When data updates, the cached results and ongoing transactions are updated incrementally in real-time. This enables cached queries to complete immediately. The user may even subscribe to an analytical query and receive an updated query result whenever the database updates.
Second, in ReactDB, physical data layout and indexes are reactive to data access pattern. Different queries need different physical data layouts and indexes for efficient access. Traditionally, they need to be manually tuned by the DBA, which may be suboptimal for certain workloads.
2018-01-01
Accelerators such as GPUs, TPUs and FPGAs are deployed at scale in data centers to accelerate many online serving applications, e.g. machine learning inference, image processing, encryption and compression. These applications typically receive requests from network, do pre-processing, call a computationally intensive routine, do post-processing and finally send response to network. With accelerators, the computationally intensive routine is replaced by an RPC to the accelerator device. Here the challenge arises: what the CPU should do while waiting for the accelerator?
The traditional approach is to relinquish the CPU after sending the offloading request and the OS scheduler will switch to another thread. However, context switch in Linux takes several microseconds. A fine-grained offloaded task also ranges from several to tens of microseconds, which would soon complete and wake up the thread again. The context switch overhead not only wastes CPU, but also adds thread wake up latency to request processing latency. A second approach is to busy wait until the offloaded task completes, which obviously wastes CPU. A third approach is to rewrite the application to do other jobs within the thread while waiting for the accelerator. In this work, we build a library to transparently find and execute non-conflict jobs within the thread, without modifying application code.
2017-12-21
USTC Blog 彻底关闭了,享年五岁。
WordPress 实在太臃肿了,早就想迁移到静态博客,趁这个机会迁移到了 Hexo。
hexo-migrator-wordpress
迁移工具并不完美,排版方面的问题需要慢慢修。另外评论也丢掉了。
2017-11-13
(转载自 科大新创公益基金会)
计算机系统领域的顶级国际学术会议SOSP 2017(操作系统原理大会)前不久在上海举行。自1967年首届SOSP以来,操作系统和分布式系统教科书里大半的内容都出自SOSP会议。因此,系统领域的研究者普遍把在SOSP上发表论文视作一种荣誉。今年SOSP收录的39篇论文中,仅有两篇的第一作者来自中国大陆,其中就有中国科学技术大学三年级博士生李博杰和大四本科生阮震元合著的KV-Direct系统。这也是中国科学技术大学首次在SOSP上发表论文。作为本科生,阮震元是如何一步步实现科大在SOSP会议上“零的突破”的呢?
2017-11-10
(转载自 微软亚洲研究院)
SOSP(操作系统原理大会)自1967年创办以来,两年一届,已经有50个年头了。从1969年的UNIX系统到21世纪初的MapReduce、BigTable、GFS,系统领域一系列最有影响的工作都是发表在SOSP以及与它隔年举行的兄弟会议OSDI上。如果把SOSP和OSDI历年最具影响力(Hall of Fame Award)的论文汇集成册,就是大半本操作系统和分布式系统的教科书。作为系统领域的最高学术会议,SOSP和OSDI每年只收录30至40篇高质量论文,因而能够在SOSP上发表论文是系统研究者的荣誉。
2017-11-02
(2017 年 11 月在中国科学技术大学演讲的结束语,抄了 爱因斯坦《探索的动机》)
大多数系统领域的研究可以归为两类:一类是有一种新的硬件,比如我们的可编程网卡,还有RDMA网卡、高速存储领域的NVMe和NVM、CPU里面的SGX、TSX指令扩展,可以给系统设计带来很多新的可能;一类是有一种新的应用场景,比如我们今天都在讲的深度学习,就给系统设计带来了很多新的挑战。但是如果系统领域只有这两类研究,那么它就不会成为一个受人尊敬的研究领域,正如只有蔓草而不能成为森林。因为对于新的硬件或者新的应用场景,即使没有专门研究系统的科学家,工程师也会想出办法去利用这些可能,应对这些挑战。
那么是什么把如此多的聪明人引进系统研究的领域呢? 我觉得爱因斯坦的<探索的动机》讲得很好,就化用过来了。首先是一种消极的动机,就是叔本华所说的,把人们引向艺术和科学的最强烈的动机之一,是要逃避日常生活中令人厌恶的粗俗和使人绝望的沉闷,是要摆脱人们自己反复无常的欲望的桎梏。在工程项目里,总有许多非技术的因素,有许多历史遗留的问题,有许多工具和硬件的bug。一个工程项目里大多数的时间都是在做这些并不需要很多创造力的事情,而目前的AI还不足以帮我们做好,因此修养有素的系统工程师就会渴望逃避这种复杂性,进入一种客观知觉的世界。
除了这种消极的动机,还有一种积极的动机。人们总想以最适当的方式画出一幅简化的和易领悟的世界图像;于是他就试图用他的这种世界体系来代替经验的世界,并来征服它。这就是画家、诗人、思辨哲学家和自然科学家所做的,他们都按自己的方式去做。系统研究者对于所研究的主题必须加以严格的控制,也就是描述现实系统中最通用的模块。企图以系统研究者的精密性与完备性来重现现实世界的复杂系统,这不是人类智力所能及的。作为系统基础的基本抽象,比如IP之于网络、SQL之于数据库、文件之于操作系统,应当对一大类硬件体系结构和一大类应用场景普遍有效。有了这些基本抽象,就有可能借助单纯的演绎构建一个完整的系统。在这个构建的过程中,工程师可以把现实世界的复杂性加入进去,此时可能损失一些基本抽象所具有的美好性质,但我们仍然可以通过不超过人类理智的演绎过程理解整个系统的行为。
系统研究者的最高使命是得到那些普遍的基本抽象,由此高性能、可扩放、高可用、易编程的系统就能用演绎的方法建立起来。要通向这些基本抽象,没有逻辑的道路,只有通过那种以对经验的理解为基础的直觉。这意味着一个好的系统研究员必须首先是一个有经验的系统工程师。由于有这种方法论上的不确定性,人们可以假定,会有许多个同样站得住脚的系统抽象。这种看法无论在理论上还是现实中都是成立的。但是,系统领域的发展表明,在同一时期,在同样的硬件限制和应用场景下,总有一个显得比别的高明得多。这就是莱布尼兹非常中肯的表述过的“先定的和谐”。渴望看到这种先定的和谐,是系统研究者无穷的毅力和耐心的源泉。
2017-10-29
Performance of in-memory key-value store (KVS) continues to be of great importance as modern KVS goes beyond the traditional object-caching workload and becomes a key infrastructure to support distributed main-memory computation in data centers. Recent years have witnessed a rapid increase of network bandwidth in data centers, shifting the bottleneck of most KVS from the network to the CPU. RDMA-capable NIC partly alleviates the problem, but the primitives provided by RDMA abstraction are rather limited. Meanwhile, programmable NICs become available in data centers, enabling in-network processing. In this paper, we present KV-Direct, a high performance KVS that leverages programmable NIC to extend RDMA primitives and enable remote direct key-value access to the main host memory.
We develop several novel techniques to maximize the throughput and hide the latency of the PCIe connection between the NIC and the host memory, which becomes the new bottleneck. Combined, these mechanisms allow a single NIC KV-Direct to achieve up to 180 M key-value operations per second, equivalent to the throughput of tens of CPU cores. Compared with CPU based KVS implementation, KV-Direct improves power efficiency by 3x, while keeping tail latency below 10 µs. Moreover, KV-Direct can achieve near linear scalability with multiple NICs. With 10 programmable NIC cards in a commodity server, we achieve 1.22 billion KV operations per second, which is almost an order-of-magnitude improvement over existing systems, setting a new milestone for a general-purpose in-memory key-value store.