Bojie Li (李博杰)
2018-02-14
2018-01-03
2018-01-01
Fault tolerance is critical for distributed applications. Many request serving and batch processing frameworks have been proposed to simplify programming of fault tolerant distributed systems, which basically ask the programmers to separate states from computation and store states in a fault-tolerant system. However, many existing applications (e.g. Node.js, Memcached and Python in Tensorflow) do not support fault tolerance, and fault tolerant systems are often slower than their non-fault-tolerant counterparts. In this work, we take up the challenge of achieving transparent and efficient fault tolerance for general distributed applications. Challenges include process migration, deterministic replay and distributed snapshot.
2018-01-01
To improve performance and reduce CPU overhead for network applications, programmable switches and NICs have been introduced in data centers to offload virtualized network functions, transport protocols, key-value stores, distributed consensus and resource disaggregation. Compared to general-purpose processors, programmable switches and NICs have more limited resources and only support a more constrained programming model. To this end, developers typically split a network function into a data plane to process common-case packets and a control plane to handle the remaining cases. The data plane function is then implemented in a packet processing language (e.g. P4) and offloaded into hardware.
Writing packet programs for network application offloading could be hard labor. First, even if the protocol specification (or source code) is available, the developer needs to read the thousand-page book (or code) and figure out which part are the common cases. Second, many implementations have subtle variations from the specification, so the developer often needs to examine packet traces and reverse-engineer the implementationspecific behaviors manually. Third, the offloaded function needs rewrite when the application is updated (e.g. regular expressions in a firewall).
We design P4Coder, a system to automatically synthesis the data plane by learning the behavior of a reference network application. No formal specification or source code is required. The developer only needs to design a few data-plane test cases and run the reference application. P4Coder captures the input and output packets, and searches for a packet program to produce identical output packets for the sequence of input packets. Obviously, passing the test cases does not imply that the program will generalize correctly for other inputs.
2018-01-01
Servers in data centers host increasing varieties of PCIe devices, e.g. GPUs, NVMe SSDs, NICs, accelerator cards and FPGAs. For high throughput and low latency, CPU-bypass direct communication among PCIe devices (e.g. GPU-Direct, NVMe-OF) is flourishing. However, many PCIe devices are designed to only talk to drivers on CPU, while the PCIe register and DMA interface is intricate and potentially undocumented. In order to capture PCIe packets and debug PCIe protocol implementations, developers need PCIe protocol analyzers which are expensive (~$250K), hard to deploy in production environment and unable to modify PCIe TLP packets that pass through.
In this work, we design and implement a transparent PCIe debugger and gateway with a commodity FPGA-based PCIe board. PCIe gateway captures packets bump-in-the-wire between a target PCIe device (e.g. NIC) and CPU. Because PCIe has fixed routing, it is impossible to perform ARP-spoofing-like attack on PCIe fabric. However, we can spoof the device driver to redirect the PCIe traffic to go through our PCIe gateway. The communication between a PCIe device and CPU falls in two categories according to the initiator.
2018-01-01
Analytical database queries are critical to support business decisions. Because these queries involve complicated computation over a large corpus of data, their execution typically takes minutes to hours. When information in the database is updated, the user needs to re-execute the query on the current snapshot of database, which again takes a long time and the result reflects a stale snapshot. In this rapidly changing world, business intelligence should react to information updates in real-time.
To this end, we design ReactDB, a new database with fast analytical queries and reactive to database updates.
ReactDB is reactive in two ways. First, cached analytical queries are reactive to updates in the database. We observe that many analytical queries are repetitive. So we cache intermediate results of frequent analytical queries. When data updates, the cached results and ongoing transactions are updated incrementally in real-time. This enables cached queries to complete immediately. The user may even subscribe to an analytical query and receive an updated query result whenever the database updates.
Second, in ReactDB, physical data layout and indexes are reactive to data access pattern. Different queries need different physical data layouts and indexes for efficient access. Traditionally, they need to be manually tuned by the DBA, which may be suboptimal for certain workloads.
2018-01-01
Accelerators such as GPUs, TPUs and FPGAs are deployed at scale in data centers to accelerate many online serving applications, e.g. machine learning inference, image processing, encryption and compression. These applications typically receive requests from network, do pre-processing, call a computationally intensive routine, do post-processing and finally send response to network. With accelerators, the computationally intensive routine is replaced by an RPC to the accelerator device. Here the challenge arises: what the CPU should do while waiting for the accelerator?
The traditional approach is to relinquish the CPU after sending the offloading request and the OS scheduler will switch to another thread. However, context switch in Linux takes several microseconds. A fine-grained offloaded task also ranges from several to tens of microseconds, which would soon complete and wake up the thread again. The context switch overhead not only wastes CPU, but also adds thread wake up latency to request processing latency. A second approach is to busy wait until the offloaded task completes, which obviously wastes CPU. A third approach is to rewrite the application to do other jobs within the thread while waiting for the accelerator. In this work, we build a library to transparently find and execute non-conflict jobs within the thread, without modifying application code.
2017-12-21
USTC Blog 彻底关闭了,享年五岁。
WordPress 实在太臃肿了,早就想迁移到静态博客,趁这个机会迁移到了 Hexo。
hexo-migrator-wordpress
迁移工具并不完美,排版方面的问题需要慢慢修。另外评论也丢掉了。
2017-11-13
(转载自 科大新创公益基金会)
计算机系统领域的顶级国际学术会议SOSP 2017(操作系统原理大会)前不久在上海举行。自1967年首届SOSP以来,操作系统和分布式系统教科书里大半的内容都出自SOSP会议。因此,系统领域的研究者普遍把在SOSP上发表论文视作一种荣誉。今年SOSP收录的39篇论文中,仅有两篇的第一作者来自中国大陆,其中就有中国科学技术大学三年级博士生李博杰和大四本科生阮震元合著的KV-Direct系统。这也是中国科学技术大学首次在SOSP上发表论文。作为本科生,阮震元是如何一步步实现科大在SOSP会议上“零的突破”的呢?