Should AI Clusters Use RoCEv2 or Infiniband?

(This article was first published on Zhihu)

Most major internet companies are deploying RDMA technology, with the main scenarios being storage and AI/HPC, divided into two technical routes, RoCEv2 and Infiniband.

RoCEv2 is RDMA over Ethernet, which runs the RDMA protocol on the traditional data center Ethernet network. The history of Infiniband (IB) is even longer, with HPC high-performance computing clusters from the 1980s all using IB.

The current leader in RDMA network cards is Mellanox, acquired by NVIDIA. It can be said that RoCEv2 is the community version of RDMA, and Infiniband is the enterprise version of RDMA. The advantage of the community version is openness, with many configurable things, but this is also its disadvantage, only network experts can handle it. Moreover, a large-scale RoCEv2 cluster is not something that a network expert can handle alone, it requires a team to solve PFC storm problems and various strange problems with network cards and switches. Of course, if there are only a few machines and a switch, and the network cards are all the same model, such a small-scale cluster using RoCEv2 will basically not encounter any problems.

The RDMA circle is very small, basically all have a certain academic background, if you have never heard of the above problems, then it is better to use IB honestly, spend a little more money, simple and easy. I heard that some AI companies think that buying A100/H100 is enough, they can’t even distinguish between the SXM version and the PCIe version, and they don’t know that they need to buy IB network cards and switches to achieve large-scale training, thinking that connecting with a regular 10G network is enough, this kind of company is best to find a seller of AI cluster solutions to match the IB network cards, switches and network topology, don’t try to show off, don’t try to save money by touching RoCEv2.

Most of OpenAI’s GPU clusters currently use Infiniband, and now some small and medium-sized AI companies also use IB. Most of the newly built GPU clusters of large companies use RoCEv2, because these large factories need to support a scale of tens of thousands of cards, and IB cannot scale up to this level, and cost is very important for such large-scale companies. Some large factories have already started to develop their own network cards. Another reason is that large factories have professional network teams, and it is difficult to optimize such a closed thing as IB, how can these network experts adjust performance and write PPTs.

The specific comparison of the two technologies, RoCEv2 and Infiniband, is as follows:

	RoCEv2	Infiniband
Bandwidth	Currently up to 200 Gbps	Currently up to 400 Gbps
Direct connection latency (RTT)	1.6 us	1.6 us
Single-hop switch latency (one-way)	500 ns	100-150 ns
End-to-end latency through 3-hop switch (RTT)	4.6 us	2.5 us
Top device manufacturer	NVIDIA (Mellanox)	NVIDIA (Mellanox)
Device cost	Lower	Higher
Can it share hardware with Ethernet	Yes	No
Scale	Can reach hundreds of thousands of servers in the entire data center	Up to about ten thousand servers
Flow control	Generally requires PFC (requires administrator to configure PFC threshold)	Credit-based flow control (basically no configuration required)
Congestion control	DCQCN (Note: Old RoCE cards have poor congestion control)	ECN-based protocol similar to DCQCN
Multi-path routing	Randomly selects routes based on source port number ECMP, each stream only goes one way	Normally the same as RoCEv2, supports adaptive routing to automatically change routes in case of failure
Packet retransmission	Go-back-N (Note: Very old RoCE cards are Go-back-0, very poor)	Go-back-N
Operation and maintenance difficulty	High, need to configure PFC threshold yourself, solve PFC storm problems caused by faulty cards and switches, if not the same model of network card also need to solve interconnection problems	Low, basically maintenance-free
Applicable scenarios	Large-scale internet companies, large-scale AI/HPC clusters	Small and medium-scale AI/HPC clusters

Deployment of RDMA

RDMA is widely deployed in large factories, with a large number of academic papers. The main scenarios are storage and AI/HPC.

Microsoft

Microsoft is at the forefront, with RDMA network cards deployed on all servers in Microsoft data centers, forming a RoCEv2 network, mainly used for accessing cloud storage, and even RDMA is used for cross-AZ communication at the 100km level across data centers in the same region. Currently, RDMA traffic has exceeded traditional Ethernet traffic, accounting for 70% of the total data center traffic.

Microsoft’s RDMA deployment can refer to the NSDI 2023 paper co-authored by Bai Wei and dozens of other authors, which received full marks from all review experts during the paper review

NSDI 2023 presentation video

APNet 2023 presentation PPT
(Includes the content of the NSDI 2023 paper and some other large-scale RDMA deployment research by Microsoft, as well as some future directions to be researched)

APNet 2023 presentation video

Microsoft has a lot of basic research on RoCEv2 technology, including:

DCQCN in 2015, which has now become the standard congestion control protocol for NVIDIA (Mellanox) network cards: Congestion Control for Large-Scale RDMA Deployments

Large-scale RDMA deployment and PFC deadlock issues in 2016: RDMA over Commodity Ethernet at Scale

Self-developed network card based on FPGA and RDMA chip in 2016, forming a data center acceleration plane: A Cloud-Scale Acceleration Architecture

It can be said that most internet companies in China that do RDMA have a lot to do with Microsoft.

Alibaba

Alibaba also has a lot of accumulation in RDMA deployment and has published many top conference papers.

For example, Alibaba Cloud Storage deploys RDMA: When Cloud Storage Meets RDMA

The overall architecture of Alibaba Cloud Storage: From Luna to Solar: The Evolutions of the Compute-to-Storage Networks in Alibaba Cloud

Congestion control: HPCC: High Precision Congestion Control

Alibaba also has an eRDMA product, which is elastic RDMA on the cloud, and the underlying layer reuses the VPC network: What is eRDMA_Cloud Server ECS-Alibaba Cloud Help Center