Should AI Clusters Use RoCEv2 or Infiniband?
(This article was first published on Zhihu)
Most major internet companies are deploying RDMA technology, with the main scenarios being storage and AI/HPC, divided into two technical routes, RoCEv2 and Infiniband.
RoCEv2 is RDMA over Ethernet, which runs the RDMA protocol on the traditional data center Ethernet network. The history of Infiniband (IB) is even longer, with HPC high-performance computing clusters from the 1980s all using IB.
The current leader in RDMA network cards is Mellanox, acquired by NVIDIA. It can be said that RoCEv2 is the community version of RDMA, and Infiniband is the enterprise version of RDMA. The advantage of the community version is openness, with many configurable things, but this is also its disadvantage, only network experts can handle it. Moreover, a large-scale RoCEv2 cluster is not something that a network expert can handle alone, it requires a team to solve PFC storm problems and various strange problems with network cards and switches. Of course, if there are only a few machines and a switch, and the network cards are all the same model, such a small-scale cluster using RoCEv2 will basically not encounter any problems.
The RDMA circle is very small, basically all have a certain academic background, if you have never heard of the above problems, then it is better to use IB honestly, spend a little more money, simple and easy. I heard that some AI companies think that buying A100/H100 is enough, they can’t even distinguish between the SXM version and the PCIe version, and they don’t know that they need to buy IB network cards and switches to achieve large-scale training, thinking that connecting with a regular 10G network is enough, this kind of company is best to find a seller of AI cluster solutions to match the IB network cards, switches and network topology, don’t try to show off, don’t try to save money by touching RoCEv2.
Most of OpenAI’s GPU clusters currently use Infiniband, and now some small and medium-sized AI companies also use IB. Most of the newly built GPU clusters of large companies use RoCEv2, because these large factories need to support a scale of tens of thousands of cards, and IB cannot scale up to this level, and cost is very important for such large-scale companies. Some large factories have already started to develop their own network cards. Another reason is that large factories have professional network teams, and it is difficult to optimize such a closed thing as IB, how can these network experts adjust performance and write PPTs.
The specific comparison of the two technologies, RoCEv2 and Infiniband, is as follows:
RoCEv2 | Infiniband | |
---|---|---|
Bandwidth | Currently up to 200 Gbps | Currently up to 400 Gbps |
Direct connection latency (RTT) | 1.6 us | 1.6 us |
Single-hop switch latency (one-way) | 500 ns | 100-150 ns |
End-to-end latency through 3-hop switch (RTT) | 4.6 us | 2.5 us |
Top device manufacturer | NVIDIA (Mellanox) | NVIDIA (Mellanox) |
Device cost | Lower | Higher |
Can it share hardware with Ethernet | Yes | No |
Scale | Can reach hundreds of thousands of servers in the entire data center | Up to about ten thousand servers |
Flow control | Generally requires PFC (requires administrator to configure PFC threshold) | Credit-based flow control (basically no configuration required) |
Congestion control | DCQCN (Note: Old RoCE cards have poor congestion control) | ECN-based protocol similar to DCQCN |
Multi-path routing | Randomly selects routes based on source port number ECMP, each stream only goes one way | Normally the same as RoCEv2, supports adaptive routing to automatically change routes in case of failure |
Packet retransmission | Go-back-N (Note: Very old RoCE cards are Go-back-0, very poor) | Go-back-N |
Operation and maintenance difficulty | High, need to configure PFC threshold yourself, solve PFC storm problems caused by faulty cards and switches, if not the same model of network card also need to solve interconnection problems | Low, basically maintenance-free |
Applicable scenarios | Large-scale internet companies, large-scale AI/HPC clusters | Small and medium-scale AI/HPC clusters |
Deployment of RDMA
RDMA is widely deployed in large factories, with a large number of academic papers. The main scenarios are storage and AI/HPC.
Microsoft
Microsoft is at the forefront, with RDMA network cards deployed on all servers in Microsoft data centers, forming a RoCEv2 network, mainly used for accessing cloud storage, and even RDMA is used for cross-AZ communication at the 100km level across data centers in the same region. Currently, RDMA traffic has exceeded traditional Ethernet traffic, accounting for 70% of the total data center traffic.
Microsoft’s RDMA deployment can refer to the NSDI 2023 paper co-authored by Bai Wei and dozens of other authors, which received full marks from all review experts during the paper review
APNet 2023 presentation PPT
(Includes the content of the NSDI 2023 paper and some other large-scale RDMA deployment research by Microsoft, as well as some future directions to be researched)
Microsoft has a lot of basic research on RoCEv2 technology, including:
DCQCN in 2015, which has now become the standard congestion control protocol for NVIDIA (Mellanox) network cards: Congestion Control for Large-Scale RDMA Deployments
Large-scale RDMA deployment and PFC deadlock issues in 2016: RDMA over Commodity Ethernet at Scale
Self-developed network card based on FPGA and RDMA chip in 2016, forming a data center acceleration plane: A Cloud-Scale Acceleration Architecture
It can be said that most internet companies in China that do RDMA have a lot to do with Microsoft.
Alibaba
Alibaba also has a lot of accumulation in RDMA deployment and has published many top conference papers.
For example, Alibaba Cloud Storage deploys RDMA: When Cloud Storage Meets RDMA
The overall architecture of Alibaba Cloud Storage: From Luna to Solar: The Evolutions of the Compute-to-Storage Networks in Alibaba Cloud
Congestion control: HPCC: High Precision Congestion Control
Alibaba also has an eRDMA product, which is elastic RDMA on the cloud, and the underlying layer reuses the VPC network: What is eRDMA_Cloud Server ECS-Alibaba Cloud Help Center
Huawei
Huawei also has a lot of accumulation in RDMA deployment, has its own RDMA smart network card, and has many achievements in the academic world:
For example, MasQ, RDMA virtualization based on Mellanox network card
ByteDance
ByteDance also has a lot of accumulation in RDMA deployment, and has recently published a lot of papers:
SRNIC: A Scalable Architecture for RDMA NIC
Collie: Finding Performance Anomalies in RDMA Subsystems
Hostping: Diagnosing Intra-host Network Bottlenecks in RDMA Servers