Bojie Li
2023-08-05
(This article was first published on Zhihu)
Most major internet companies are deploying RDMA technology, with the main scenarios being storage and AI/HPC, divided into two technical routes, RoCEv2 and Infiniband.
RoCEv2 is RDMA over Ethernet, which runs the RDMA protocol on the traditional data center Ethernet network. The history of Infiniband (IB) is even longer, with HPC high-performance computing clusters from the 1980s all using IB.
The current leader in RDMA network cards is Mellanox, acquired by NVIDIA. It can be said that RoCEv2 is the community version of RDMA, and Infiniband is the enterprise version of RDMA. The advantage of the community version is openness, with many configurable things, but this is also its disadvantage, only network experts can handle it. Moreover, a large-scale RoCEv2 cluster is not something that a network expert can handle alone, it requires a team to solve PFC storm problems and various strange problems with network cards and switches. Of course, if there are only a few machines and a switch, and the network cards are all the same model, such a small-scale cluster using RoCEv2 will basically not encounter any problems.
The RDMA circle is very small, basically all have a certain academic background, if you have never heard of the above problems, then it is better to use IB honestly, spend a little more money, simple and easy. I heard that some AI companies think that buying A100/H100 is enough, they can’t even distinguish between the SXM version and the PCIe version, and they don’t know that they need to buy IB network cards and switches to achieve large-scale training, thinking that connecting with a regular 10G network is enough, this kind of company is best to find a seller of AI cluster solutions to match the IB network cards, switches and network topology, don’t try to show off, don’t try to save money by touching RoCEv2.
Most of OpenAI’s GPU clusters currently use Infiniband, and now some small and medium-sized AI companies also use IB. Most of the newly built GPU clusters of large companies use RoCEv2, because these large factories need to support a scale of tens of thousands of cards, and IB cannot scale up to this level, and cost is very important for such large-scale companies. Some large factories have already started to develop their own network cards. Another reason is that large factories have professional network teams, and it is difficult to optimize such a closed thing as IB, how can these network experts adjust performance and write PPTs.
2023-08-05
(This article was first published on Zhihu)
Cache Coherency (CC) can be divided into two scenarios:
- CC between the CPU and device within the host
- CC across hosts
CC between the CPU and device within the host
I believe that CC between the CPU and device within the host is very necessary. When I was interning at Microsoft in 2017, I used an FPGA to create a memory block attached to the PCIe’s bar space. I was able to run a Linux system on this bar space, but the startup process that should have taken only 3 seconds took 30 minutes, which is 600 times slower than host memory. This is because PCIe does not support CC, and the CPU’s direct access to device memory can only be uncacheable, and each memory access has to go through PCIe to FPGA, which is extremely inefficient.
Therefore, the current PCIe bar space can only be used for the CPU to issue MMIO commands to the device, and data transfer must be carried out through device DMA. Therefore, whether it is an NVMe disk or an RDMA network card, they must follow the complex process of doorbell-WQE/command-DMA, as shown in the figure below.
2023-07-04
In November 2012, my blog was born with USTC Blog. In May 2013, my blog got its independent domain bojieli.com. In January 2015, the blog enabled a new domain ring0.me, ring0 is the highest privilege level in the x86 architecture, which signifies my relentless pursuit of underlying system technology.
Today, I registered the premium domain 01.me. 0 and 1 are the only two digits in binary, I chose this domain in the hope of devoting myself to the AGI (Artificial General Intelligence) business, to make a small contribution to silicon-based life based on 0 and 1.
01.me this domain also has certain investment value, 01.org is the official website of Intel Open Source, 01.ai is the official website of Li Kaifu’s AI startup company Zero One Wanwu, 01.com was sold at a high price of $1,820,000 in 2017 (of course, the value of .me and .com cannot be mentioned in the same day).
For the convenience of sharing articles on WeChat and other domestic platforms, this website also has two domestically filed domains bojieli.com and boj.life. After the new registration domain 60-day protection period of the registry is over, I may consider moving 01.me to a domestic registrar for filing.
2023-06-20
KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC
Bojie Li, Zhenyuan Ruan, Wencong Xiao, Yuanwei Lu, Yongqiang Xiong, Andrew Putnam, Enhong Chen and Lintao Zhang.
Proceedings of the 26th Symposium on Operating Systems Principles (SOSP ‘17). [PDF] [Slides]
Transcription with Whisper.
2023-06-19
ClickNP: Highly Flexible and High-Performance Network Processing with Reconfigurable Hardware
Bojie Li, Kun Tan, Layong (Larry) Luo, Yanqing Peng, Renqian Luo, Ningyi Xu, Yongqiang Xiong, Peng Cheng and Enhong Chen.
Proceedings of the 2016 ACM SIGCOMM Conference (SIGCOMM ‘16). [PDF] [Slides]
Transcription with Whisper.
2023-06-14
Polling and interrupt has long been a trade-off in RDMA systems. Polling has lower latency but each CPU core can only run one thread. Interrupt enables time sharing among multiple threads but has higher latency. Many applications such as databases have hundreds of threads, which is much larger than the number of cores. So, they have to use interrupt mode to share cores among threads, and the resulting RDMA latency is much higher than the hardware limits. In this paper, we analyze the root cause of high costs in RDMA interrupt delivery, and present FastWake, a practical redesign of interrupt-mode RDMA host network stack using commodity RDMA hardware, Linux OS, and unmodified applications. Our first approach to fast thread wake-up completely removes interrupts. We design a per-core dispatcher thread to poll all the completion queues of the application threads on the same core, and utilize a kernel fast path to context switch to the thread with an incoming completion event. The approach above would keep CPUs running at 100% utilization, so we design an interrupt-based approach for scenarios with power constraints. Observing that waking up a thread on the same core as the interrupt is much faster than threads on other cores, we dynamically adjust RDMA event queue mappings to improve interrupt core affinity. In addition, we revisit the kernel path of thread wake-up, and remove the overheads in virtual file system (VFS), locking, and process scheduling. Experiments show that FastWake can reduce RDMA latency by 80% on x86 and 77% on ARM at the cost of < 30% higher power utilization than traditional interrupts, and the latency is only 0.3~0.4 𝜇s higher than the limits of underlying hardware. When power saving is desired, our interrupt-based approach can still reduce interrupt-mode RDMA latency by 59% on x86 and 52% on ARM.
Publication
Bojie Li, Zihao Xiang, Xiaoliang Wang, Han Ruan, Jingbin Zhou, and Kun Tan. FastWake: Revisiting Host Network Stack for Interrupt-mode RDMA. In 7th Asia-Pacific Workshop on Networking (APNET 2023), June 29–30, 2023, Hong Kong, China. [Paper PDF] [Slides PPTX] [Slides PDF] [Video] [Talk Transcript]
People
- Bojie Li, Technical Expert at Computer Networking and Protocol Lab, Huawei.
- Zihao Xiang, Senior Developer at Computer Networking and Protocol Lab, Huawei.
- Xiaoliang Wang, Associate Professor, Nanjing University.
- Han Ruan, Senior Technical Planning Expert at Computer Networking and Protocol Lab, Huawei.
- Jingbin Zhou, Director of Computer Networking and Protocol Lab, Huawei.
- Kun Tan, Director of Distributed and Parallel Software Lab, Huawei.
2023-06-11
(This article is a compilation of the author’s speech at Peking University on December 12, 2022, first converting the conference recording into a draft using iFlytek’s voice recognition, then polishing it with GPT-4, correcting errors in voice recognition, and finally manually adding some new thoughts)
- Part 1: The New Golden Age of Computer Networks (Part 1): Data Centers
- Part 2: The New Golden Age of Computer Networks (Part 2): Wide Area Networks
Wireless networks are a very broad field, corresponding to Huawei’s two major product lines, one is wireless, and the other is consumer BG. Wireless mainly refers to the familiar 5G and Wi-Fi, while consumer BG includes various smart terminals including mobile phones.
At the beginning of the last chapter on wide area networks, we mentioned that the current transmission protocols do not fully utilize the bandwidth of wireless networks and wide area networks, resulting in many applications actually unable to experience the hundreds of Mbps high bandwidth claimed by 5G and Wi-Fi. This is what we often call the “last mile” problem. As the performance of wireless networks becomes closer and closer to wired networks, some optimizations originally applicable to data centers will apply to wireless networks. Previously, when we talked about distributed systems, we thought of data centers, but now there are so many terminal devices and smart home devices at home, which also form a distributed system. In the future, a family may be a mini data center.
2023-05-28
(This article is a compilation of my speech at Peking University on December 12, 2022, first converted into a draft using iFlytek’s voice recognition technology, then polished and corrected using GPT-4, and finally supplemented with some new thoughts manually)
Wide Area Networks (WANs) mainly fall into two categories of communication modes, one is end-cloud communication, and the other is inter-cloud communication. Let’s start with end-cloud.
End-Cloud Networks
When we generally mention WANs, we assume they are uncontrollable, as the network equipment of the operators is not under our control, and there are a large number of other users accessing concurrently, making it difficult to achieve determinism. But many of today’s applications require a certain degree of determinism, such as video conferencing, online games, where users will feel lag if the delay is too high. How to reconcile this contradiction? That’s our topic today.
As we mentioned in the previous chapter on data center networks, the bandwidth actually perceived by applications is much less than the physical bandwidth, hence there is room for optimization. We know that the theoretical bandwidth of both 5G and Wi-Fi is hundreds of Mbps or even Gbps, and many home broadband bandwidths are also hundreds of Mbps or even reaching gigabit levels. In theory, 100 MB of data can be transmitted in one or two seconds. But how many times have we downloaded an app from the app store and it finished in one or two seconds for a 100 MB app? Another example, a compressed 4K HD video only requires a transmission speed of 15~40 Mbps, which sounds far from reaching the theoretical limit of bandwidth, but how many network environments can smoothly watch 4K HD videos? This is partly a problem with the end-side wireless network and partly a problem with the WAN. There is still a long way to go to make good use of the theoretical bandwidth.
When I was interning at Microsoft, the Chinese restaurant on the second floor of the Microsoft Building was called “Cloud + Client”, and the backdrop at the sky garden on the 12th floor also read cloud first, mobile first. Data centers and smart terminals were indeed the two hottest fields from 2010 to 2020. Unfortunately, Microsoft’s mobile end never took off. Huawei, on the other hand, has strong capabilities on both the end and cloud sides, giving it a unique advantage in end-cloud collaborative optimization.
2023-05-27
(This article is organized based on the speech I delivered at Peking University on December 12, 2022, first converting the conference recording into a draft using iFlytek’s speech recognition, then polishing it with GPT-4 to correct errors from voice recognition, and finally manually adding some new thoughts)
I am very grateful to Professor Huang Qun and Professor Xu Chenren for the invitation, and it is an honor to come to Peking University to give a guest lecture for their computer networking course. I heard that you are all the best students at Peking University, and I could only dream of attending Peking University in my days. It is truly an honor to have the opportunity to share with you some of the latest developments in the academic and industrial fields of computer networking today.
Turing Award winner David Patterson gave a very famous speech in 2019 called “A New Golden Age for Computer Architecture”, which talked about the end of Moore’s Law for general-purpose processors and the historic opportunity for the rise of Domain-Specific Architectures (DSA). What I am going to talk about today is that computer networking has also entered a new golden age.
The computer networks we come into contact with daily mainly consist of three parts: wireless networks, wide area networks, and data center networks. They provide the communication foundation for a smart world of interconnected things.
Among them, the terminal devices of wireless networks include mobile phones, PCs, watches, smart homes, smart cars, and various other devices. These devices usually access the network through wireless means (such as Wi-Fi or 5G). After passing through 5G base stations and Wi-Fi hotspots, the devices will enter the wide area network. There are also some CDN servers in the wide area network, which belong to edge data centers. Next, the devices will enter the data center network. In the data center network, there are many different types of devices, such as gateways, servers, etc.
Today, I will introduce you to data center networks, wide area networks, and terminal wireless networks. First, let’s look at data center networks. The biggest change in data center networks is the evolution from simple networks designed for simple web services to networks designed for large-scale heterogeneous parallel computing, performing tasks traditionally handled by supercomputers, such as AI, big data, and high-performance computing.
2023-05-27
** (Article from the WeChat public account of the Intelligent Manufacturing Society, original link, many thanks to the Intelligent Manufacturing Society for their excellent questions and editing) **
What impact will AI ultimately have on the technology and life of human society?
With the release of GPT4, the performance of large model AI has once again refreshed the public’s imagination. The content produced by AIGC is becoming more realistic and refined. With the continuous deepening of data cleaning and training, AI’s understanding of natural language has also shown great progress. From passively accepting data “feeding” to actively asking questions to the world, perhaps, the “artificial intelligence life” in science fiction is no longer far away from us.
Anxiety is inevitable, and “AI unemployment” seems to be really happening in some industries. On May 18, 2023, local time, BT, the largest telecom operator in the UK, said it would cut 40,000 to 55,000 jobs between 2028 and 2030. The layoffs will include BT’s direct employees and third-party employees, reducing the company’s total number of employees by 31-42%. Currently, BT has about 130,000 employees.
BT’s boss Philip Jansen publicly stated that after completing fiber optic deployment, digitizing work methods, adopting artificial intelligence (AI), and simplifying its structure, it will rely on fewer labor and significantly reduced cost base, “the new BT Group will be a leaner enterprise with a brighter future”. Looking back at home, some Internet technology companies have also shown related trends, especially the art outsourcing positions of game companies, which can be described as “disaster areas”.
Speaking of this issue, Li Bojie, assistant scientist at Huawei 2012 Lab, said that some of the public’s anxieties have been magnified by the media. AI technology is not a flood beast that replaces humans, but rather, it liberates productivity and shapes more new jobs. “For example, if we look at the past industrial revolution, people who used to do farming now have to use machines. The education they need, as well as the changes in society, economy, and people’s lifestyle, are all very big.”
Li Bojie believes that after AI technology becomes popular and becomes a new production tool, more industries and occupations will be generated in response. “For example, after the computer, there is no need for a copyist to copy things hard, right? AI is the same, some industries directly involve people, it can’t replace, like the service industry, right? But some things that follow the rules and do fixed patterns, AI can simplify a lot of labor.”
As a researcher of data center network technology closely related to AI, Li Bojie has put forward many views and thoughts on AI. Below is the conversation record of Xiao Zhi, the main writer of the Intelligent Manufacturing Society, and Li Bojie: