Doctoral Thesis: High Performance Data Center Systems Based on Programmable Network Cards
Doctoral Thesis from University of Science and Technology of China, Author: Bojie Li
Chinese Version: High Performance Data Center Systems Based on Programmable Network Cards (PDF, 8 MB)
AI Translated Unofficial English Version: High Performance Data Center Systems with Programmable Network Interface Cards (PDF, 8 MB)
Publication Date: 2019-05-26.
Abstract
Data centers are the infrastructure supporting various Internet services in today’s world, facing challenges from both hardware and applications. On the hardware side, the performance improvement of general processors is gradually slowing down; on the application side, the demand for computing power from big data and machine learning is increasing day by day. Unlike easily parallelizable Web services, big data and machine learning require more communication between computing nodes, which has driven the rapid improvement of data center network performance and also put higher demands on the performance of shared data storage. However, the network and storage infrastructure of data centers mainly use software processing on general processors, whose performance lags behind the rapidly growing performance of network, storage, and customized computing hardware, increasingly becoming the bottleneck of the system. At the same time, flexibility is also an important requirement in cloud-based data centers. In order to provide both high performance and flexibility, programmable network cards have been widely deployed in data centers in recent years, using customizable hardware such as Field Programmable Gate Arrays (FPGAs) to accelerate virtual networks.
This paper aims to explore high-performance data center systems based on programmable network cards. Programmable network cards can accelerate not only virtual networks but also network functions, data structures, operating systems, etc. For this purpose, this paper implements full-stack acceleration of computing, network, and memory storage nodes in cloud computing data centers with FPGA programmable network cards.
Firstly, this paper proposes to accelerate virtual network functions in cloud computing with programmable network cards, and designs and implements the first high-flexibility, high-performance network function processing platform ClickNP accelerated by FPGA in commercial servers. To simplify FPGA programming, this paper designs a C-like ClickNP language and a modular programming model, and develops a series of optimization techniques to fully utilize the massive parallelism of FPGA; implements the ClickNP development toolchain, which can be integrated with various commercial high-level synthesis tools; designs and implements more than 200 network elements based on ClickNP, and builds various network functions with these elements. Compared with CPU-based software network functions, the throughput of ClickNP is increased by 10 times, and the latency is reduced to 1/10.
Secondly, this paper proposes to accelerate remote data structure access with programmable network cards. Based on the ClickNP programming framework, this paper designs and implements a high-performance memory key-value storage system KV-Direct, which bypasses the CPU on the server side and uses programmable network cards to directly access data structures in remote host memory through PCIe. By extending the memory operation semantics of one-sided RDMA to key-value operation semantics, KV-Direct solves the problem of high communication and synchronization overhead when operating data structures with one-sided RDMA. Taking advantage of the reconfigurable characteristics of FPGA, KV-Direct allows users to implement more complex data structures. Faced with the performance challenges of lower PCIe bandwidth and higher latency between the network card and host memory, through a series of performance optimizations such as hash tables, memory allocators, out-of-order execution engines, load balancing and caching, vector operations, etc., KV-Direct achieves 10 times the energy efficiency of CPU and microsecond-level latency, and is the first general-purpose key-value storage system with a single-machine performance reaching 1 billion times per second.
Finally, this paper proposes to provide socket communication primitives for applications by combining programmable network cards with user-mode runtime libraries, thereby bypassing the operating system kernel. This paper designs and implements a user-mode socket system SocksDirect, which is fully compatible with existing applications, can achieve throughput and latency close to hardware limits, has scalable multi-core performance, and maintains high performance under high concurrent loads. Communication within and between hosts is implemented using shared memory and RDMA respectively. To support a high number of concurrent connections, this paper implements an RDMA programmable network card based on KV-Direct. By eliminating a series of overheads such as inter-thread synchronization, buffer management, large data copying, process awakening, etc., SocksDirect improves throughput by 7 to 20 times compared to Linux, reduces latency to 1/17 to 1/35, and reduces the HTTP latency of the Web server to 1/5.5.
Keywords
Data center; Programmable network card; Field Programmable Gate Array; Network function virtualization; Key-value storage; Network protocol stack
Abstract
Data centers are the infrastructure that hosts Internet services all around the world. Data centers face challenges on hardware and application. On the hardware side, performance improvement of general processors is slowing down. On the application side, big data and machine learning impose increasing computational power requirements. Different from Web services that are easy to parallelize, big data and machine learning require more communication among compute nodes, which pushes the performance of data center network to improve rapidly, and also proposes higher requirements for shared data storage performance. However, networking and storage infrastructure services in data centers still mainly use software processing on general processors, whose performance lags behind the rapidly increasing performance of hardware in networking, storage and customized computing. As a result, software processing becomes a bottleneck in data center systems. In the meantime, in cloud data centers, flexibility is also of great importance. To provide high performance and flexibility at the same time, recent years witnessed large scale deployment of programmable NICs (Network Interface Cards) in data centers, which use customized hardware such as FPGAs to accelerate network virtualization services. This thesis aims to explore high performance data center systems with programmable NICs. Besides accelerating network virtualization, programmable NICs can also accelerate network functions, data structures and operating systems. For this purpose, this thesis proposes a system that uses FPGA-based programmable NIC for full stack acceleration of compute, network and in-memory storage nodes in cloud data centers. First, this thesis proposes to accelerate virtualized network functions in the cloud with programmable NICs. This thesis proposes ClickNP, the first FPGA accelerated network function processing platform on commodity servers with high flexibility and high performance. To simplify FPGA programming, this thesis designs a C-like ClickNP language and a modular programming model, and also develops optimization techniques to fully exploit the massive parallelism inside FPGA. The ClickNP tool-chain integrates with multiple commercial high-level synthesis tools. Based on ClickNP, this thesis designs and implements more than 200 network elements, and constructs various network functions using the elements. Compared to CPU-based software network functions, ClickNP improves throughput by 10 times and reduces latency to 1/10.
Second, this thesis proposes to accelerate remote data structure access with programmable NICs. This thesis designs and implements KV-Direct, a high performance in-memory key-value storage system based on ClickNP programming framework. KV-Direct bypasses CPU on the server side and uses programmable NICs to directly access data structures in remote host memory via PCIe. KV-Direct extends memory semantics of one-sided RDMA to key-value semantics and therefore avoid the communication and synchronization overheads in data structure operations. KV-Direct further leverages the reconfigurability of FPGA to enable users to implement more complicated data structures. To tackle with the performance challenge of limited PCIe bandwidth and high latency between NIC and host memory, this thesis design a series of optimizations including hash table, memory allocator, out-of-order execution engine, load balancing, caching and vector operations. KV-Direct achieves 10 times power efficiency than CPU and microsecond scale latency. KV-Direct is the first general key-value storage system that achieves 1 billion operations per second performance on a single server.
Lastly, this thesis proposes to co-design programmable NICs and user-space libraries to provide kernel-bypass socket communication primitives for applications. This thesis designs and implements SocksDirect, a user-space socket system that is fully compatible with existing applications, achieves throughput and latency that are close to hardware limits, has scalable performance for multi-cores, and preserves high performance with many concurrent connections. SocksDirect uses shared memory and RDMA for intra-host and inter-host communication, respectively. To support many concurrent connections, SocksDirect implements an RDMA programmable NIC based on KV-Direct. SocksDirect further removes overheads such as thread synchronization, buffer management, large payload copying and process wakeup. Compared to Linux, SocksDirect improves throughput by 7 to 20 times, reduces latency to 1/17 to 1/35, and reduces HTTP latency of Web servers to 1/5.5.
Keywords
Data Center; Programmable NIC; FPGA; Network Function Virtualization; Key-Value Store; Networking Stack
Download Thesis
Chinese Version: High Performance Data Center Systems Based on Programmable Network Cards (PDF, 8 MB)
AI Translated Unofficial English Version: High Performance Data Center Systems with Programmable Network Interface Cards (PDF, 8 MB)
Download Slides
Click here to download defense PPT (Chinese) (PPTX, 3 MB)