(This article was first published on Zhihu, and then reposted on Microsoft Research Asia)

We are not using FPGA to replace CPU, but using FPGA to accelerate the computing tasks suitable for it, and other tasks are still completed on the CPU, allowing FPGA and CPU to work together.

This answer will cover three questions:

  1. Why use FPGA, what are the characteristics compared to CPU, GPU, ASIC (dedicated chip)?
  2. Where is Microsoft’s FPGA deployed? How do FPGAs communicate with each other and with CPUs?
  3. What role should FPGA play in the future cloud computing platform? Is it just a computing accelerator card like GPU?

Why use FPGA?

As we all know, the Moore’s Law of general-purpose processors (CPUs) is in its twilight years, while the scale of machine learning and web services is growing exponentially. People use custom hardware to accelerate common computing tasks, but the rapidly changing industry requires these custom hardware to be reprogrammable to perform new types of computing tasks. FPGA (Field Programmable Gate Array) is a hardware-reconfigurable architecture that has been used as a small batch substitute for dedicated chips (ASICs) for many years, but in recent years it has been massively deployed in data centers of companies like Microsoft and Baidu to provide both powerful computing capabilities and sufficient flexibility.

Comparison of computing performance and flexibility of different architecturesComparison of computing performance and flexibility of different architectures

Why is FPGA fast? “It’s all about good contrast with peers”. CPU and GPU both belong to the Von Neumann architecture, with instruction decoding execution and shared memory. The reason why FPGA is more energy efficient than CPU and even GPU is essentially the benefits brought by the architecture without instructions and no need to share memory.

In the Von Neumann architecture, because the execution unit (such as CPU core) may execute any instruction, it needs to have instruction storage, decoder, various instruction calculators, and branch jump processing logic. Because the control logic of the instruction stream is complex, it is impossible to have too many independent instruction streams, so the GPU uses SIMD (Single Instruction Multiple Data) to allow multiple execution units to process different data at the same pace, and the CPU also supports SIMD instructions. The function of each logic unit in the FPGA is determined at the time of reprogramming (burning), and no instructions are needed.

There are two purposes for using memory in the Von Neumann architecture. One is to save the state, and the other is to communicate between execution units. Since the memory is shared, access arbitration is needed; in order to take advantage of access locality, each execution unit has a private cache, which requires maintaining the consistency of the cache between execution parts. For the need to save the state, the registers and on-chip memory (BRAM) in the FPGA belong to their own control logic, without unnecessary arbitration and cache. For communication needs, the connection between each logic unit of the FPGA and the surrounding logic units is determined at the time of reprogramming (burning), and there is no need to communicate through shared memory.

After saying so much at a height of three thousand feet, how does FPGA actually perform? Let’s look at compute-intensive tasks and communication-intensive tasks separately.

Examples of compute-intensive tasks include matrix operations, image processing, machine learning, compression, asymmetric encryption, Bing search sorting, etc. These tasks are generally offloaded by the CPU to the FPGA for execution. For these tasks, the integer multiplication performance of the Altera (it seems to be called Intel now, I’m still used to calling it Altera…) Stratix V FPGA we are currently using is basically equivalent to that of a 20-core CPU, and the floating-point multiplication performance is basically equivalent to that of an 8-core CPU, which is an order of magnitude lower than that of a GPU. The next-generation FPGA we are about to use, Stratix 10, will be equipped with more multipliers and hardware floating-point computing components, so theoretically it can reach the computing power equivalent to the current top GPU computing card.

FPGA integer multiplication capability (estimated value, not using DSP, estimated based on logic resource occupancy)FPGA integer multiplication capability (estimated value, not using DSP, estimated based on logic resource occupancy)

FPGA floating point multiplication capability (estimated value, float16 uses soft core, float 32 uses hard core)FPGA floating point multiplication capability (estimated value, float16 uses soft core, float 32 uses hard core)

In the data center, the core advantage of FPGA compared to GPU is latency. For tasks like Bing search sorting, in order to return search results as quickly as possible, it is necessary to minimize the latency of each step. If you use a GPU for acceleration, in order to fully utilize the computing power of the GPU, the batch size cannot be too small, and the latency will be at the millisecond level. If you use FPGA for acceleration, you only need microsecond-level PCIe latency (our current FPGA is used as a PCIe accelerator card). In the future, after Intel launches Xeon + FPGA connected via QPI, the latency between CPU and FPGA can be reduced to less than 100 nanoseconds, which is no different from accessing main memory.

Why is the latency of FPGA so much lower than that of GPU? This is essentially a difference in architecture. FPGA has both pipeline parallelism and data parallelism, while GPU has almost only data parallelism (pipeline depth is limited). For example, if there are 10 steps to process a data packet, FPGA can build a 10-stage pipeline, different stages of the pipeline are processing different data packets, and each data packet is processed after passing through 10 stages. As soon as a data packet is processed, it can be output immediately. The data parallel method of GPU is to make 10 computing units, each computing unit is also processing different data packets, but all computing units must follow a unified pace and do the same thing (SIMD, Single Instruction Multiple Data). This requires 10 data packets to be input and output together, increasing the latency of input and output. When tasks arrive one by one instead of in batches, pipeline parallelism can achieve lower latency than data parallelism. Therefore, for stream computing tasks, FPGA has a natural advantage in latency over GPU.

Comparison of the order of magnitude of CPU, GPU, FPGA, and ASIC for compute-intensive tasks (using 16-bit integer multiplication as an example, the numbers are only estimates of the order of magnitude)Comparison of the order of magnitude of CPU, GPU, FPGA, and ASIC for compute-intensive tasks (using 16-bit integer multiplication as an example, the numbers are only estimates of the order of magnitude)

ASIC dedicated chips are impeccable in terms of throughput, latency, and power consumption, but Microsoft did not adopt them, I think for two reasons:

  1. The computing tasks of the data center are flexible and changeable, and the development cost of ASIC is high and the cycle is long. It’s not easy to deploy a large number of acceleration cards for a certain neural network, and the result is that another neural network is more popular, and the money is wasted. FPGA can update logic functions in a few hundred milliseconds. The flexibility of FPGA can protect investment. In fact, Microsoft’s current FPGA play is very different from the original idea.
  2. The data center is rented to different tenants. If some machines have neural network acceleration cards, some machines have Bing search acceleration cards, and some machines have network virtualization acceleration cards, task scheduling and server operation and maintenance will be troublesome. Using FPGA can maintain the homogeneity of the data center.

Next, let’s look at communication-intensive tasks. Compared with compute-intensive tasks, communication-intensive tasks are not very complex in processing each input data, basically simple calculations and then output, at this time communication often becomes a bottleneck. Symmetric encryption, firewalls, and network virtualization are examples of communication-intensive tasks.

Comparison of the order of magnitude of CPU, GPU, FPGA, and ASIC for communication-intensive tasks (using 64-byte network packet processing as an example, the numbers are only estimates of the order of magnitude)Comparison of the order of magnitude of CPU, GPU, FPGA, and ASIC for communication-intensive tasks (using 64-byte network packet processing as an example, the numbers are only estimates of the order of magnitude)

For communication-intensive tasks, FPGA has a greater advantage over CPU and GPU. In terms of throughput, the transceivers on FPGA can directly connect to 40 Gbps or even 100 Gbps network cables to process data packets of any size at line speed; while CPU needs to receive data packets from the network card before it can process them, many network cards cannot process 64-byte small data packets at line speed. Although high performance can be achieved by plugging in multiple network cards, the number of PCIe slots supported by CPU and motherboard is often limited, and the network card and switch themselves are also expensive.

In terms of latency, the network card receives data packets to the CPU, and the CPU sends them to the network card. Even using a high-performance data packet processing framework like DPDK, the latency is 4~5 microseconds. A more serious problem is that the latency of the general-purpose CPU is not stable enough. For example, when the load is high, the forwarding latency may rise to tens of microseconds or even higher (as shown in the figure below); the clock interrupt and task scheduling in modern operating systems also increase the uncertainty of latency.

Comparison of forwarding latency of ClickNP (FPGA), Dell S6000 switch (commercial switch chip), Click DPDK (CPU) and Linux (CPU), error bar represents 5% and 95%. Source: [5]Comparison of forwarding latency of ClickNP (FPGA), Dell S6000 switch (commercial switch chip), Click DPDK (CPU) and Linux (CPU), error bar represents 5% and 95%. Source: [5]

Although GPU can also process data packets with high performance, GPU does not have a network port, which means that data packets need to be received by the network card first, and then GPU can do the processing. In this way, the throughput is limited by the CPU and/or network card. Not to mention the latency of the GPU itself.

So why not integrate these network functions into the network card, or use a programmable switch? The flexibility of ASIC is still a hard injury. Although there are increasingly powerful programmable switch chips, such as Tofino that supports P4 language, ASIC still can’t do complex stateful processing, such as a certain custom encryption algorithm.

In summary, the main advantage of FPGA in the data center is its stable and extremely low latency, suitable for stream-based compute-intensive tasks and communication-intensive tasks.

Microsoft’s Practice of Deploying FPGA

In September 2016, Wired magazine published an article titled “Microsoft Bets Its Future on FPGA” [3], telling the past and present of the Catapult project. Immediately afterwards, Doug Burger, the head of the Catapult project, demonstrated FPGA accelerated machine translation with Microsoft CEO Satya Nadella at the Ignite 2016 conference. The total computing power of the demonstration was 103 million T ops, or 1.03 Exa-op, equivalent to 100,000 top GPU computing cards. The power consumption of a FPGA (plus on-board memory and network interface, etc.) is about 30 W, only increasing the total server power consumption by one tenth. As long as the scale is large enough, concerns about the high price of FPGA will be unnecessary.

Microsoft’s deployment of FPGA was not smooth sailing. The question of where to deploy the FPGA has roughly gone through three stages:

  1. Dedicated FPGA clusters, filled with FPGAs
  2. One FPGA per machine, connected by a dedicated network
  3. One FPGA per machine, placed between the network card and the switch, sharing the server network

The first stage is a dedicated cluster, filled with FPGA accelerators, like a supercomputer composed of FPGAs. The picture below is the earliest BFB experimental board, with 6 FPGAs on a PCIe card, and 4 PCIe cards plugged into each 1U server.

The deployment method like a supercomputer means that there is a dedicated cabinet full of servers with 24 FPGAs (left in the picture below). This method has several problems:

  1. FPGAs on different machines cannot communicate, and the scale of the problem that the FPGA can handle is limited to the number of FPGAs on a single server;
  2. Other machines in the data center have to send tasks to this cabinet, forming an in-cast, and it is difficult to stabilize the network latency.
  3. The FPGA dedicated cabinet constitutes a single point of failure, as long as it breaks, no one can accelerate;
  4. The server with the FPGA is customized, and cooling and maintenance have increased trouble.

A less aggressive way is to deploy a server full of FPGAs on one side of each cabinet (in the picture above). This avoids the above problems (2)(3), but (1)(4) is still not resolved.

In the second stage, in order to ensure the homogeneity of the servers in the data center (this is also an important reason for not using ASIC), an FPGA is inserted on each server (right in the picture above), and the FPGAs are connected through a dedicated network. This is also the deployment method used in the paper published by Microsoft at ISCA’14.

The FPGA uses Stratix V D5, with 172K ALMs, 2014 M20K on-chip memories, and 1590 DSPs. The board has an 8GB DDR3-1333 memory, a PCIe Gen3 x8 interface, and two 10 Gbps network interfaces. The FPGAs between a cabinet are connected by a dedicated network, a group of 10G ports are connected in a ring of 8, and another group of 10G ports are connected in a ring of 6, without using a switch.

Such a cluster of 1632 servers and 1632 FPGAs has doubled the overall performance of Bing’s search result sorting (in other words, it has saved half of the servers). As shown in the figure below, every 8 FPGAs are threaded into a chain, and the 10 Gbps dedicated network cable mentioned above is used for communication. These 8 FPGAs each have their own duties, some are responsible for extracting features from documents (yellow), some are responsible for calculating feature expressions (green), and some are responsible for calculating document scores (red).

Both local and remote FPGAs can reduce search latency, and the communication latency of remote FPGAs can be ignored compared to search latency. Source: [4]

FPGA has been successfully deployed in Bing, and the Catapult project continues to expand within the company. The department with the most servers inside Microsoft is the cloud computing Azure department. The urgent problem that the Azure department needs to solve is the overhead brought by network and storage virtualization. Azure sells virtual machines to customers and needs to provide network functions such as firewalls, load balancing, tunnels, NAT, etc. for the virtual machine’s network. Since the physical storage of cloud storage is separated from the computing node, it is necessary to move the data from the storage node through the network, and also to compress and encrypt it.

In the era of 1 Gbps network and mechanical hard drives, the CPU overhead of network and storage virtualization is not worth mentioning. As the speed of the network and storage becomes faster and faster, the network is up to 40 Gbps, and the throughput of a SSD can also reach 1 GB/s, and the CPU is gradually overwhelmed. For example, the Hyper-V virtual switch can only handle about 25 Gbps of traffic, and cannot reach 40 Gbps line speed, and the performance is even worse when the packet size is small; AES-256 encryption and SHA-1 signature, each CPU core can only handle 100 MB/s, which is only one-tenth of the throughput of a SSD.

In order to accelerate network functions and storage virtualization, Microsoft deploys FPGAs between network cards and switches. As shown in the figure below, each FPGA has a 4 GB DDR3-1333 DRAM, which is connected to a CPU socket through two PCIe Gen3 x8 interfaces (physically it is a PCIe Gen3 x16 interface, because the FPGA does not have a x16 hard core, it is logically used as two x8). The physical network card (NIC) is just a regular 40 Gbps network card, only used for communication between the host and the network.

The FPGA (SmartNIC) virtualizes a network card for each virtual machine, and the virtual machine directly accesses this virtual network card through SR-IOV. The data plane function originally in the virtual switch has been moved to the FPGA, and the virtual machine does not need to participate in the sending and receiving of network data packets, nor does it need to pass through the physical network card (NIC). This not only saves CPU resources that can be sold, but also improves the network performance of the virtual machine (25 Gbps), and reduces the network latency between virtual machines in the same data center by 10 times.

This is the third-generation architecture of Microsoft’s FPGA deployment, and it is also the architecture currently used for large-scale deployment of “one FPGA per server”. The original intention of FPGA to reuse the host network is to accelerate the network and storage, and the more far-reaching impact is to expand the network connection between FPGAs to the scale of the entire data center, making a truly cloud-scale “supercomputer”. In the second-generation architecture, the network connection between FPGAs is limited to within the same rack, and it is difficult to scale up the dedicated network interconnection between FPGAs, and the overhead is too high to forward through the CPU.

In the third-generation architecture, FPGAs communicate with each other through LTL (Lightweight Transport Layer). The latency within the same rack is within 3 microseconds; within 8 microseconds, it can reach 1000 FPGAs; within 20 microseconds, it can reach all FPGAs in the same data center. Although the second-generation architecture has lower latency within 8 machines, it can only access 48 FPGAs through the network. To support a wide range of FPGA communication, the LTL in the third-generation architecture also supports the PFC flow control protocol and the DCQCN congestion control protocol.

The relationship of logic modules within the FPGA, where each Role is user logic (such as DNN acceleration, network function acceleration, encryption), and the outside part is responsible for communication between each Role and between Role and peripherals. Source: [4]

The data center acceleration plane composed of FPGAs, between the network switching layer (TOR, L1, L2) and traditional server software (software running on the CPU). Source: [4]

The data center acceleration plane between the network switching layer and traditional server software is formed by FPGAs interconnected by high-bandwidth, low-latency networks. In addition to the network and storage virtualization acceleration required by each server providing cloud services, the remaining resources on the FPGA can also be used to accelerate tasks such as Bing search and deep neural networks (DNN).

For many types of applications, as the scale of distributed FPGA accelerators expands, their performance improvement is super-linear. For example, CNN inference, when only one FPGA is used, due to insufficient on-chip memory to hold the entire model, it is necessary to constantly access the model weights in DRAM, and the performance bottleneck is in DRAM; if the number of FPGAs is sufficient, each FPGA is responsible for one layer in the model or several features in one layer, so that the model weights are completely loaded into the on-chip memory, eliminating the performance bottleneck of DRAM and fully exerting the performance of the FPGA computing unit. Of course, too much disassembly will also lead to an increase in communication overhead. The key to splitting tasks into distributed FPGA clusters is to balance computation and communication.

From the neural network model to the FPGA on HaaS. Utilizing the parallelism within the model, different layers and different features of the model are mapped to different FPGAs. Source: [4]

At the MICRO’16 conference, Microsoft proposed the concept of Hardware as a Service (HaaS), that is, treating hardware as a schedulable cloud service, making the centralized scheduling, management, and large-scale deployment of FPGA services possible.

From the first-generation dedicated server cluster filled with FPGAs, to the second-generation FPGA accelerator card cluster connected by a dedicated network, to the current large-scale FPGA cloud that reuses the data center network, three ideas guide our route:

  1. Hardware and software are not mutually exclusive, but cooperative;
  2. Must have flexibility, that is, the ability defined by software;
  3. Must have scalability.

The role of FPGA in cloud computing

Finally, I would like to talk about my personal thoughts on the role of FPGA in cloud computing. As a third-year doctoral student, my research at Microsoft Research Asia tries to answer two questions:

  1. What role should FPGA play in the network interconnection system of cloud scale?
  2. How to efficiently and scalably program the heterogeneous system of FPGA CPU?

My main regret about the FPGA industry is that the mainstream use of FPGA in data centers, from internet giants other than Microsoft, to the two major FPGA manufacturers, and then to academia, is mostly to treat FPGA as a compute-intensive task accelerator card like GPU. But is FPGA really suitable for doing GPU things? As mentioned earlier, the biggest difference between FPGA and GPU lies in the architecture. FPGA is more suitable for low-latency stream processing, and GPU is more suitable for processing large batches of homogeneous data.

Because many people plan to use FPGA as a computing accelerator card, the high-level programming models launched by the two major FPGA manufacturers are also based on OpenCL, imitating the batch processing mode of GPU based on shared memory. The CPU needs to put it into the DRAM on the FPGA board first, then tell the FPGA to start execution, the FPGA puts the execution result back into the DRAM, and then notifies the CPU to fetch it. The CPU and FPGA can communicate efficiently through PCIe, so why go around the DRAM on the board? Perhaps it is a problem of engineering implementation. We found that writing DRAM, starting the kernel, and reading DRAM back and forth through OpenCL requires 1.8 milliseconds. But communicating through PCIe DMA only takes 1~2 microseconds.

The communication between multiple kernels in OpenCL is even more exaggerated. The default method is also through shared memory. As mentioned at the beginning of this article, FPGA has higher energy efficiency than CPU and GPU. The fundamental advantage in architecture is that there are no instructions and no need for shared memory. Using shared memory for communication between multiple kernels is unnecessary in the case of sequential communication (FIFO). Moreover, the DRAM on the FPGA is generally much slower than the DRAM on the GPU.

Therefore, we proposed the ClickNP network programming framework[5], using pipes (channels) instead of shared memory for communication between execution units (elements/kernels) and between execution units and host software. Applications that require shared memory can also be implemented based on pipes. After all, CSP (Communicating Sequential Process) and shared memory are theoretically equivalent. ClickNP is currently a framework based on OpenCL, limited by the C language description of hardware (of course, the development efficiency of HLS is indeed much higher than Verilog). The ideal hardware description language is probably not C.

Low-latency stream processing requires the most communication. However, due to the limitations of parallelism and the scheduling of the operating system, the CPU is not efficient in communication, and the latency is also unstable. In addition, communication inevitably involves scheduling and arbitration. Due to the limitations of single-core performance and inefficient inter-core communication, the CPU’s scheduling and arbitration performance is limited, while hardware is very suitable for this repetitive work. Therefore, my doctoral research defines FPGA as the “housekeeper” of communication, whether it is communication between servers, communication between virtual machines, communication between processes, or communication between CPU and storage devices, all can be accelerated by FPGA.

Xiao He is both the success and the failure. The lack of instructions is both the advantage and the weakness of FPGA. Every time you do something different, you have to occupy a certain amount of FPGA logic resources. If the tasks are complex and not repetitive, it will occupy a lot of logic resources, most of which are idle. At this time, it is better to use a von Neumann architecture processor. Many tasks in the data center have strong locality and repetitiveness: some are network and storage required by the virtualization platform, which are all communication; the other part is in customer computing tasks, such as machine learning, encryption and decryption. We first use FPGA for its best communication, and perhaps in the future, like AWS, we will rent FPGA as a computing accelerator card to customers.

Whether it’s communication, machine learning, or encryption and decryption, the algorithms are very complex. If you try to completely replace the CPU with FPGA, it will inevitably lead to a great waste of FPGA logic resources and increase the development cost of FPGA programs. A more practical approach is for FPGA and CPU to work together, with FPGA handling tasks with strong locality and repetitiveness, and CPU handling complex tasks.

When we use FPGA to accelerate more and more services such as Bing search and deep learning; when the data plane of basic components such as network virtualization and storage virtualization is controlled by FPGA; when the “data center acceleration plane” composed of FPGA becomes a natural moat between the network and servers… It seems that there is a feeling that FPGA will control the overall situation, and the computing tasks on the CPU will become fragmented and driven by FPGA. In the past, we were CPU-centric and offloaded repetitive computing tasks to FPGA; in the future, will it become FPGA-centric and offload complex computing tasks to the CPU? With the advent of Xeon FPGA, will the ancient SoC be reborn in the data center?

“Across the memory wall and reach a fully programmable world.”

References

[1] Large-Scale Reconfigurable Computing in a Microsoft Datacenter

https://www.microsoft.com/en-us/research/wp-content/uploads/2014/06/HC26.12.520-Recon-Fabric-Pulnam-Microsoft-Catapult.pdf

[2] A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, ISCA’14

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/Catapult_ISCA_2014.pdf

[3] Microsoft Has a Whole New Kind of Computer Chip—and It’ll Change Everything

[4] A Cloud-Scale Acceleration Architecture, MICRO’16

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/10/Cloud-Scale-Acceleration-Architecture.pdf

[5] ClickNP: Highly Flexible and High-performance Network Processing with Reconfigurable Hardware

[6] Daniel Firestone, SmartNIC: Accelerating Azure’s Network with. FPGAs on OCS servers.

Comments