Bojie Li
2023-06-20
KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC
Bojie Li, Zhenyuan Ruan, Wencong Xiao, Yuanwei Lu, Yongqiang Xiong, Andrew Putnam, Enhong Chen and Lintao Zhang.
Proceedings of the 26th Symposium on Operating Systems Principles (SOSP ‘17). [PDF] [Slides]
Transcription with Whisper.
2023-06-19
ClickNP: Highly Flexible and High-Performance Network Processing with Reconfigurable Hardware
Bojie Li, Kun Tan, Layong (Larry) Luo, Yanqing Peng, Renqian Luo, Ningyi Xu, Yongqiang Xiong, Peng Cheng and Enhong Chen.
Proceedings of the 2016 ACM SIGCOMM Conference (SIGCOMM ‘16). [PDF] [Slides]
Transcription with Whisper.
2023-06-14
Polling and interrupt has long been a trade-off in RDMA systems. Polling has lower latency but each CPU core can only run one thread. Interrupt enables time sharing among multiple threads but has higher latency. Many applications such as databases have hundreds of threads, which is much larger than the number of cores. So, they have to use interrupt mode to share cores among threads, and the resulting RDMA latency is much higher than the hardware limits. In this paper, we analyze the root cause of high costs in RDMA interrupt delivery, and present FastWake, a practical redesign of interrupt-mode RDMA host network stack using commodity RDMA hardware, Linux OS, and unmodified applications. Our first approach to fast thread wake-up completely removes interrupts. We design a per-core dispatcher thread to poll all the completion queues of the application threads on the same core, and utilize a kernel fast path to context switch to the thread with an incoming completion event. The approach above would keep CPUs running at 100% utilization, so we design an interrupt-based approach for scenarios with power constraints. Observing that waking up a thread on the same core as the interrupt is much faster than threads on other cores, we dynamically adjust RDMA event queue mappings to improve interrupt core affinity. In addition, we revisit the kernel path of thread wake-up, and remove the overheads in virtual file system (VFS), locking, and process scheduling. Experiments show that FastWake can reduce RDMA latency by 80% on x86 and 77% on ARM at the cost of < 30% higher power utilization than traditional interrupts, and the latency is only 0.3~0.4 𝜇s higher than the limits of underlying hardware. When power saving is desired, our interrupt-based approach can still reduce interrupt-mode RDMA latency by 59% on x86 and 52% on ARM.
Publication
Bojie Li, Zihao Xiang, Xiaoliang Wang, Han Ruan, Jingbin Zhou, and Kun Tan. FastWake: Revisiting Host Network Stack for Interrupt-mode RDMA. In 7th Asia-Pacific Workshop on Networking (APNET 2023), June 29–30, 2023, Hong Kong, China. [Paper PDF] [Slides PPTX] [Slides PDF] [Video] [Talk Transcript]
People
- Bojie Li, Technical Expert at Computer Networking and Protocol Lab, Huawei.
- Zihao Xiang, Senior Developer at Computer Networking and Protocol Lab, Huawei.
- Xiaoliang Wang, Associate Professor, Nanjing University.
- Han Ruan, Senior Technical Planning Expert at Computer Networking and Protocol Lab, Huawei.
- Jingbin Zhou, Director of Computer Networking and Protocol Lab, Huawei.
- Kun Tan, Director of Distributed and Parallel Software Lab, Huawei.
2023-06-11
(This article is a compilation of the author’s speech at Peking University on December 12, 2022, first converting the conference recording into a draft using iFlytek’s voice recognition, then polishing it with GPT-4, correcting errors in voice recognition, and finally manually adding some new thoughts)
- Part 1: The New Golden Age of Computer Networks (Part 1): Data Centers
- Part 2: The New Golden Age of Computer Networks (Part 2): Wide Area Networks
Wireless networks are a very broad field, corresponding to Huawei’s two major product lines, one is wireless, and the other is consumer BG. Wireless mainly refers to the familiar 5G and Wi-Fi, while consumer BG includes various smart terminals including mobile phones.
At the beginning of the last chapter on wide area networks, we mentioned that the current transmission protocols do not fully utilize the bandwidth of wireless networks and wide area networks, resulting in many applications actually unable to experience the hundreds of Mbps high bandwidth claimed by 5G and Wi-Fi. This is what we often call the “last mile” problem. As the performance of wireless networks becomes closer and closer to wired networks, some optimizations originally applicable to data centers will apply to wireless networks. Previously, when we talked about distributed systems, we thought of data centers, but now there are so many terminal devices and smart home devices at home, which also form a distributed system. In the future, a family may be a mini data center.
2023-05-28
(This article is a compilation of my speech at Peking University on December 12, 2022, first converted into a draft using iFlytek’s voice recognition technology, then polished and corrected using GPT-4, and finally supplemented with some new thoughts manually)
Wide Area Networks (WANs) mainly fall into two categories of communication modes, one is end-cloud communication, and the other is inter-cloud communication. Let’s start with end-cloud.
End-Cloud Networks
When we generally mention WANs, we assume they are uncontrollable, as the network equipment of the operators is not under our control, and there are a large number of other users accessing concurrently, making it difficult to achieve determinism. But many of today’s applications require a certain degree of determinism, such as video conferencing, online games, where users will feel lag if the delay is too high. How to reconcile this contradiction? That’s our topic today.
As we mentioned in the previous chapter on data center networks, the bandwidth actually perceived by applications is much less than the physical bandwidth, hence there is room for optimization. We know that the theoretical bandwidth of both 5G and Wi-Fi is hundreds of Mbps or even Gbps, and many home broadband bandwidths are also hundreds of Mbps or even reaching gigabit levels. In theory, 100 MB of data can be transmitted in one or two seconds. But how many times have we downloaded an app from the app store and it finished in one or two seconds for a 100 MB app? Another example, a compressed 4K HD video only requires a transmission speed of 15~40 Mbps, which sounds far from reaching the theoretical limit of bandwidth, but how many network environments can smoothly watch 4K HD videos? This is partly a problem with the end-side wireless network and partly a problem with the WAN. There is still a long way to go to make good use of the theoretical bandwidth.
When I was interning at Microsoft, the Chinese restaurant on the second floor of the Microsoft Building was called “Cloud + Client”, and the backdrop at the sky garden on the 12th floor also read cloud first, mobile first. Data centers and smart terminals were indeed the two hottest fields from 2010 to 2020. Unfortunately, Microsoft’s mobile end never took off. Huawei, on the other hand, has strong capabilities on both the end and cloud sides, giving it a unique advantage in end-cloud collaborative optimization.
2023-05-27
(This article is organized based on the speech I delivered at Peking University on December 12, 2022, first converting the conference recording into a draft using iFlytek’s speech recognition, then polishing it with GPT-4 to correct errors from voice recognition, and finally manually adding some new thoughts)
I am very grateful to Professor Huang Qun and Professor Xu Chenren for the invitation, and it is an honor to come to Peking University to give a guest lecture for their computer networking course. I heard that you are all the best students at Peking University, and I could only dream of attending Peking University in my days. It is truly an honor to have the opportunity to share with you some of the latest developments in the academic and industrial fields of computer networking today.
Turing Award winner David Patterson gave a very famous speech in 2019 called “A New Golden Age for Computer Architecture”, which talked about the end of Moore’s Law for general-purpose processors and the historic opportunity for the rise of Domain-Specific Architectures (DSA). What I am going to talk about today is that computer networking has also entered a new golden age.
The computer networks we come into contact with daily mainly consist of three parts: wireless networks, wide area networks, and data center networks. They provide the communication foundation for a smart world of interconnected things.
Among them, the terminal devices of wireless networks include mobile phones, PCs, watches, smart homes, smart cars, and various other devices. These devices usually access the network through wireless means (such as Wi-Fi or 5G). After passing through 5G base stations and Wi-Fi hotspots, the devices will enter the wide area network. There are also some CDN servers in the wide area network, which belong to edge data centers. Next, the devices will enter the data center network. In the data center network, there are many different types of devices, such as gateways, servers, etc.
Today, I will introduce you to data center networks, wide area networks, and terminal wireless networks. First, let’s look at data center networks. The biggest change in data center networks is the evolution from simple networks designed for simple web services to networks designed for large-scale heterogeneous parallel computing, performing tasks traditionally handled by supercomputers, such as AI, big data, and high-performance computing.
2023-05-27
** (Article from the WeChat public account of the Intelligent Manufacturing Society, original link, many thanks to the Intelligent Manufacturing Society for their excellent questions and editing) **
What impact will AI ultimately have on the technology and life of human society?
With the release of GPT4, the performance of large model AI has once again refreshed the public’s imagination. The content produced by AIGC is becoming more realistic and refined. With the continuous deepening of data cleaning and training, AI’s understanding of natural language has also shown great progress. From passively accepting data “feeding” to actively asking questions to the world, perhaps, the “artificial intelligence life” in science fiction is no longer far away from us.
Anxiety is inevitable, and “AI unemployment” seems to be really happening in some industries. On May 18, 2023, local time, BT, the largest telecom operator in the UK, said it would cut 40,000 to 55,000 jobs between 2028 and 2030. The layoffs will include BT’s direct employees and third-party employees, reducing the company’s total number of employees by 31-42%. Currently, BT has about 130,000 employees.
BT’s boss Philip Jansen publicly stated that after completing fiber optic deployment, digitizing work methods, adopting artificial intelligence (AI), and simplifying its structure, it will rely on fewer labor and significantly reduced cost base, “the new BT Group will be a leaner enterprise with a brighter future”. Looking back at home, some Internet technology companies have also shown related trends, especially the art outsourcing positions of game companies, which can be described as “disaster areas”.
Speaking of this issue, Li Bojie, assistant scientist at Huawei 2012 Lab, said that some of the public’s anxieties have been magnified by the media. AI technology is not a flood beast that replaces humans, but rather, it liberates productivity and shapes more new jobs. “For example, if we look at the past industrial revolution, people who used to do farming now have to use machines. The education they need, as well as the changes in society, economy, and people’s lifestyle, are all very big.”
Li Bojie believes that after AI technology becomes popular and becomes a new production tool, more industries and occupations will be generated in response. “For example, after the computer, there is no need for a copyist to copy things hard, right? AI is the same, some industries directly involve people, it can’t replace, like the service industry, right? But some things that follow the rules and do fixed patterns, AI can simplify a lot of labor.”
As a researcher of data center network technology closely related to AI, Li Bojie has put forward many views and thoughts on AI. Below is the conversation record of Xiao Zhi, the main writer of the Intelligent Manufacturing Society, and Li Bojie:
2023-05-25
This is an old article of mine from 5 years ago. It was a winter night in early 2018 when I set up a large pot by myself in the Thirteen Tombs and sent a small part of human knowledge to Sirius, 8.6 light-years away. The story behind this can be found here. Today, what we are concerned with is that sending messages to potential extraterrestrial civilizations obviously requires making them recognize that the message is sent by an intelligent life form, and also making them understand it.
A very basic question is, how do you prove your level of intelligence in the message? In other words, if I were an intelligent life form monitoring cosmic signals, how would I determine whether a bunch of signals contains intelligence? Since intelligence is not a matter of having or not having, but rather more or less, how do we measure the degree of intelligence contained in these signals? I think my thoughts from 5 years ago are still interesting, so I’m organizing and posting them.
The message is just a string. Imagine we could intercept all communications from aliens and concatenate them into a long message. How much intelligence does it contain? This is not an easy question to answer.
Current technology generally tries to decode the message and then see if it expresses basic information from sciences such as mathematics, physics, astronomy, logic, etc. The Arecibo message from 1974 was encoded in this way, hoping to attract the attention of extraterrestrial civilizations. I tried to find a purely computational method to measure the level of intelligence contained in the message.
2023-04-20
This WeChat Moment from yesterday sparked a lot of discussion both inside and outside the company, and many people reached out to me.
I considered it from two perspectives: innovation and commercialization.
2023-01-29
Long read warning: Part two of the “Five Years of PhD at MSRA” series, about 13,000 words, to be continued…
KV-Direct, published at SOSP 2017, was my second paper (as first author). Since the first SIGCOMM paper ClickNP was done with Bo Tan guiding me step by step, KV-Direct was also the first paper I led on my own.
What to Do After SIGCOMM
After submitting the SIGCOMM paper, Bo Tan said that for the next project, I needed to come up with the direction on my own.
Compiler or Application?
We were well aware that ClickNP still had many issues, with the current support for compilation optimization being too simple. We hoped to enhance the compiler’s reliability from the perspective of programming languages. At the same time, we used ClickNP as a common platform for network research within our group to incubate more research ideas.
Naturally, I explored along two directions, one was to extend ClickNP to make it easier to program and more efficient; the other was to use the ClickNP platform to develop new types of network functions to accelerate various middlewares in the network. At that time, we were exploring many middlewares in parallel, such as encryption/decryption, machine learning, message queues, layer 7 (HTTP) load balancers, key-value stores, all of which could be accelerated with FPGAs.
To improve the programmability of ClickNP, I started looking for good talents from the school to join the MSRA internship. Yi Li was interested in programming languages and formal methods during his undergraduate studies. He was the first student I recommended to intern at MSRA. At the start of the spring semester, Yi Li came to MSRA for his internship, coinciding with the completion of his undergraduate thesis. He proposed several key optimizations for the ClickNP system, added some syntax to simplify programming, and corrected some awkward syntax.
However, due to the workload, we did not do a major overhaul of the compilation framework, still using simple syntax-directed translation without using professional compiler frameworks like clang, nor intermediate languages. Therefore, each time a new compilation optimization was added, it seemed rather ad-hoc.
Encountering various strange issues with OpenCL, I had the idea of creating a high-level synthesis (HLS) tool myself, generating Verilog directly from OpenCL. My idea was simple: for application scenarios in the networking domain, what we do is to unroll all loops in a piece of C code into a large block of combinational logic. By inserting registers at appropriate positions, it could become a pipeline with extremely high throughput, capable of processing an input every clock cycle. If the code accesses global states, then such loop dependencies determine the maximum number of registers on the dependency path, which is the upper limit of the clock frequency.
However, Bo Tan disagreed with my idea of creating an HLS tool ourselves, because we were not professional FPGA researchers. Such work lacked innovation, more about filling the “gaps” of existing HLS tools, a engineering problem, difficult to publish top-tier papers in either FPGA or networking fields.
Due to frequent issues with FPGA card programming, I ended up plugging and unplugging FPGA cards in the server room every day, sometimes debugging on-site. Thus, like my undergraduate days in the minor academy’s server room, I often spent hours in the server room, enduring the cold air and noise over 80 decibels.