The Vane of Network Technology——SIGCOMM 2016

(Reprinted from Microsoft Research Asia)

As the oldest top academic conference in the field of computer networks, ACM SIGCOMM has been held 37 times since 1977. The Association for Computing Machinery (ACM) Special Interest Group on Data Communication (SIGCOMM) proudly calls SIGCOMM its annual flagship conference on its homepage. Over the past 40 years, from the TCP congestion control protocol in computer network textbooks to Software Defined Networking (SDN) and Network Function Virtualization (NFV) in cloud data centers, SIGCOMM has witnessed the birth and development of many key technologies in computer networks.

SIGCOMM papers are known for their high quality, with only about 40 accepted each year, an acceptance rate of around 15%. Network researchers around the world regard publishing papers at SIGCOMM as an honor. Each paper undergoes rigorous double-blind review, for example, this year there were three rounds of review, the first round selected 99 out of 225 papers, the second round selected 66, and the third round selected 60 for Program Committee (PC) discussion, deciding on the final 39 accepted papers after a day and a half of meetings. Each accepted paper received an average of 8 review comments, spanning dozens of pages. Even if not ultimately accepted, the opinions of these expert reviewers are very helpful for subsequent improvements to the paper.

SIGCOMM Agenda

This year is the first time the SIGCOMM conference was held in South America, choosing Brazil, the country of the Olympics. The opening ceremony of the conference was chosen to be the day after the closing ceremony of the Olympics, which is August 22. Unfortunately, the Zika epidemic made the participants very worried, and the conference venue was moved from the northeastern city of Salvador to the southeastern island of Florianópolis. This year’s SIGCOMM conference lasted five days, with the first and last days being workshops and tutorials, and the middle three days being single track main meetings, with each paper giving a 20-minute oral presentation and 5 minutes of questioning. Also due to the threat of the epidemic, SIGCOMM exceptionally allowed remote presentations this year, i.e., pre-recorded presentation videos were played on site, and the questioning session was conducted via Skype. 21 posters, 18 demos, 12 main conference papers with accompanying posters and 8 industry demos were divided into three batches, interspersed in the tea break times of the main conference three days.

SIGCOMM venue’s beach and the Atlantic Ocean

Of the 39 papers accepted at this year’s SIGCOMM conference, Microsoft was involved in 11. Among them, the first authors were from Microsoft Research for 3 papers (ClickNP, ProjectToR, Via), the first authors were from Microsoft’s engineering department for 2 papers (Dynamic Pricing, RDMA), and 6 were collaborations with universities (Domino, 2DFQ, Control Plane Analysis, Don’t Mind the Gap, WebPerf, NetPoirot).

As with previous SIGCOMM conferences, Microsoft is the undisputed leader in the network academic community. On the one hand, Microsoft publishes papers to share “black technology” from Microsoft Research (such as ClickNP using FPGA to implement network functions) and data center operation experience (such as problems in large-scale deployment of RDMA); on the other hand, it shares real problems and data from large-scale network services with universities, making it easier for universities to find truly important problems and make impactful work.

Google, Facebook, Cisco and other network giants also published papers at SIGCOMM this year. Continuing the tradition of previous years, Huawei sent a strong team of more than a dozen employees to attend the SIGCOMM conference. Cisco set up a recruitment material distribution point at the venue, and Facebook also invited some paper authors to participate in salon activities.

Chinese Faces at SIGCOMM

This year, two papers from the SIGCOMM main conference came from mainland China (Microsoft Research Asia’s ClickNP and the Chinese Academy of Sciences’ CS2P), and two papers came from Hong Kong, China (Professor Chen Kai’s group at the Hong Kong University of Science and Technology’s CODA and Karuna). What’s more exciting is that 14 of the 39 reports in the main conference were made by Chinese (12 of which were first authors, and 2 were second authors), which sounds very familiar.

The academic community in mainland China also performed very well in posters and demos. Of the 21 posters, 8 came from mainland China. Tsinghua University had 3: PieBridge, SLA-NFV and FAST; Xi’an Jiaotong University had 3: SDN compiler, OpenFlow counter and flow table overflow problem; and there were also Conan from Nanjing University and task scheduling from National University of Defense Technology. Of the 18 demos, 4 came from mainland China, including Tsinghua’s SDN source address check, BUPT’s EasyApp, National University of Defense Technology’s FPGA deep packet detection, and Huawei Future Network Lab’s ADN (Application Driven Network).

Group photo of some Chinese participating in the poster session

Talk to professors!

Before coming to the conference, I also hesitated whether it was worth it to travel a long distance and risk my life to attend the conference. The papers can all be seen online, so why bother to travel a long way to attend the conference? My advisor told me that the purpose of attending a conference is to communicate face-to-face, make friends, and seek cooperation opportunities. Before I left and after I finished my own paper presentation, my advisor repeatedly reminded me to communicate more with professors and make more friends.

The welcome dinner on the first day of the conference (Welcome Reception) had no seats, everyone stood in the hall, ate the buffet, and chatted with the people around them. The conference dinner on the second day (Conference Banquet), the first half of the time was also eating oysters, drinking wine, looking for people to chat with, and the second half of the time was sitting down to eat. One of the conference organizers, Professor Marco Canini from King Abdullah University of Science and Technology, told us, what is the most important thing at a conference? Talk to professors! Unfortunately, my English speaking is not good, and I am not used to foreign culture, so most of the chatting was done among Chinese. Although sometimes I can switch to Chinese mode or foraging mode to rest for a while, it’s quite tiring to do an elevator pitch for my own paper a dozen times in one night.

Welcome Dinner

The breakfast every day is a buffet provided by the hotel. Enthusiastic participants are busy chatting around during breakfast, understanding each other’s research directions, and discussing academic issues. It was then that I discovered that the gender ratio in the online academic circle was frighteningly high, and there was often only one girl in the restaurant (excluding waitresses). To welcome about 20 female students among the 381 participants, this SIGCOMM also organized a special N2Women dinner.

Every year after the SIGCOMM conference dinner, there will be a joke session. This year’s joke was told by Professor Marco Canini, who “released” several applications of academic research. Instapaper, imitating the interface of Instagram, the pictures on the paper must look good; Snaptract, the abstract is the paper, if no one is interested in your abstract within 30 seconds, the paper will disappear; Trustnami, a trust system for researchers’ reputation and the real impact of papers, each citation can be positive or negative, and the weight of the citation is the h-index of the citer; Menta, an AI trained with big names in the network field, can tell you whether an idea is new, generate a list of related works, give comments on experimental results, and help name papers to avoid name collisions.

Conference Dinner

The student dinner on the third day was held at the Ataliba restaurant in the city center 30 kilometers away from the conference site. I finally got to eat the long-awaited Brazilian barbecue! But I was too inexperienced, I ate a lot of buffets at the beginning, and my stomach was almost overflowing when I was halfway through the barbecue… To be honest, it’s not much different from the Brazilian barbecue in China.

Student Dinner

SIGCOMM Research Hotspots

Next is academic time, summarizing three hot research directions I saw at the conference: high-performance network processing, traffic scheduling, and wireless backscatter.

Hotspot 1: High-performance network processing

At the National Science and Technology Innovation Conference on May 30 this year, following the speech of General Secretary Xi, Huawei President Ren Zhengfei said, “As we gradually approach the limits of Shannon’s theorem and Moore’s law, and the theory of large traffic and low latency has not yet been created, Huawei has felt lost and cannot find a direction… Major innovation is the survival rule of no man’s land.”

At this year’s SIGCOMM conference, there were several papers dedicated to achieving large traffic and low latency network functions. The needs of data center networks are changing with each passing day, and the programmability of network functions is therefore becoming more and more important. Finding a balance between the performance and programmability of network processing has become a research hotspot.

The hardware architecture for packet processing is roughly divided into three directions: hardware accelerators, network processors, and general-purpose processors (such as multi-core CPUs, many-core CPUs, GPUs). The hot hardware architectures at this SIGCOMM conference include programmable switch chips and reconfigurable hardware (FPGA) hardware accelerators, as well as multi-core general-purpose processors (CPUs). In terms of programming languages, P4 is about to dominate the world.

Architecture One: Programmable Switch Chip

Nick McKeown, a big name in the network field and a professor at Stanford University, founded Barefoot Networks, which released a programmable switch chip this year that can provide 6.5 Tb per second of packet processing speed and received $130 million in investment from companies like Google. This programmable switch chip uses open-source P4 language programming and can support flexible network protocols and forwarding rules.

Anirudh Sivaraman’s programmable switch chip architecture (Image source: Domino slides)

At this year’s SIGCOMM, Anirudh Sivaraman from MIT, under the guidance of rising star in the network field Mohammad Alizadeh, published two related papers in the same session, probably setting a record for SIGCOMM. The first one is to use a subset of C language (called Domino) to write “packet transactions” to describe the behavior of packet processing, compile to an instruction set called “atom”, which is implemented in the programmable switch chip, can process 1 billion packets per second, and the delay of each atomic instruction is below the nanosecond level. The second paper, based on the first one, proposes a general queue (PIFO) that allows arbitrary position enqueuing and sequential dequeuing, and implements a programmable packet scheduler on the switch chip.

An example of using a programmable switch chip is UnivMon by Zaoxing Liu from Johns Hopkins University, which proposes a universal probabilistic data structure (universal sketch) to monitor network traffic, such as traffic size distribution statistics, anomaly detection, intrusion detection, etc., which is more fine-grained than sample-based traffic monitoring and can be implemented in programmable switch chips using P4 language.

Architecture Two: FPGA

Microsoft has led the trend of using FPGA in data centers. Since Microsoft Research published a paper on FPGA acceleration of Bing search at the top conference in the field of computer architecture ISCA in 2014, internet giants such as Microsoft and Baidu have deployed FPGA on a large scale in data centers to accelerate deep learning, storage, network and other applications. Intel acquired FPGA industry giant Altera for $16.7 billion, hoping to integrate FPGA into CPU and continue to maintain its advantage in data centers and other fields.

Microsoft uses FPGA to accelerate network functions (Image source: ClickNP slides)

At the 2015 SIGCOMM conference, Albert Greenberg, director of the Microsoft Azure network department, received the SIGCOMM Lifetime Achievement Award and announced the SmartNIC, which integrates network cards and FPGA, in his award speech. With SmartNIC, we can offload the data plane and part of the control plane of various network virtualization applications to FPGA, greatly reducing the burden on the CPU. This year’s ClickNP at SIGCOMM is a framework developed based on this SmartNIC platform, using a C-like high-level language to modularize programming FPGA, and enabling fine-grained division of labor and cooperation between CPU and FPGA, so that software developers can easily implement large traffic (100 Gbps or 200 million packets per second), low latency (microsecond level) network functions.

The application of FPGA in the network field has a long history. More than ten years ago, there was the NetFPGA platform developed by Stanford University, but the FPGA hardware programming language is difficult to write and debug, and most software developers have been unable to use it for a long time. ClickNP uses the high-level synthesis (HLS) technology that has emerged in recent years, allowing software developers to program FPGAs in high-level languages just like on multicore processors.

ClickNP programming model (Image source: ClickNP slides)

Attentive readers may have noticed that the paper from MIT mentioned earlier also compiles high-level languages similar to C into hardware. They compile into the instruction set of programmable switch chips, and we compile into FPGA. The former has higher performance, but more restrictions on programming, suitable for use on switches; the latter can implement more complex network functions, but the FPGA clock frequency is not as good as dedicated chips, FPGA is suitable for accelerating virtualized network functions such as firewalls, encryption and decryption, load balancing, traffic scheduling on servers, and can also be used to accelerate other applications such as machine learning.

Coincidentally, the SLA-NFV of Professor Bi Jun’s laboratory at Tsinghua University also combines FPGA and CPU to implement network function virtualization. Different from ClickNP’s approach, ClickNP considers the high throughput and low latency of FPGA, prioritizes network functions on FPGA, and puts tasks that are not suitable for FPGA on CPU; while SLA-NFV considers the limitations of FPGA on-chip resources, prioritizes network functions on CPU, and uses FPGA to accelerate if CPU cannot achieve expected performance.

P4FPGA architecture (Image source: P4FPGA slides)

At this SIGCOMM conference, there are several papers using FPGA. Cornell University’s DTP (Datacenter Time Protocol) uses the physical layer features of the data center network to achieve extremely high-precision clock synchronization. In order to implement this clock synchronization protocol, Han Wang wrote thousands of lines of Bluespec code to modify the physical layer of NetFPGA. After this “practice” project, Han Wang implemented a compiler from P4 language to Bluespec, and this P4FPGA work was published on the first day of the NetPL workshop at the conference. The combination of P4 language and FPGA has also attracted the attention of the industry. The industrial demo of this SIGCOMM showed the compiler from P4 language to NetFPGA SUME by Xilinx.

In the poster and demo sessions, the National University of Defense Technology used FPGA to implement a 60 Gbps deep packet inspection system, based on the Aho-Corasick algorithm for string matching, using off-chip DRAM and on-chip cache to store finite state automata (DFA). Xi’an Jiaotong University used FPGA as the fast path of OpenFlow counter, only caching the counter increment on FPGA, batch refreshing to CPU cache and compressing storage, reducing the memory overhead of FPGA.

Architecture Three: Multicore CPU

As the most universal architecture, CPU is not willing to give up in terms of performance. In recent years, high-performance packet processing frameworks such as netmap and DPDK have used various engineering best practices, processing each packet on average only requires a few tens of CPU clock cycles, and a single CPU core can process over ten million packets per second. These best practices include polling instead of interrupts, user-mode drivers to avoid system calls and memory copying, using large pages and NUMA-aware memory allocation to reduce memory access, data structure alignment with cache lines, using lock-free queues, thread-exclusive CPUs to reduce process switching, and offloading packet fragmentation and checksum calculation operations to network card hardware.

This year at SIGCOMM, several papers and posters, demos use DPDK to efficiently process packets on CPU.

PISCES compiles P4 programs into Open vSwitch (Image source: PISCES slides)

Princeton University’s Nick Feamster group’s PISCES compiles P4 programs into C language code in Open vSwitch (the most popular open source virtual switch software), solving the trouble of adding functions in Open vSwitch. In the past, adding a TCP flag in Open vSwitch required modifying 20 files and 370 lines of code. With PISCES, only 4 lines of P4 code need to be modified.

Ericsson Research Institute’s ESwitch proposed a virtual switch architecture that compiles OpenFlow (the most popular network control plane protocol) into x86 machine code. Open vSwitch’s approach is to cache established connections, but for packets that do not hit the cache, there is a problem of slow flow table query, which also opens the door to denial of service attacks. ESwitch’s solution is to let the flow table inside the switch customize automatically. Although the user-specified OpenFlow forwarding rules are in a large table with complex functions, ESwitch can automatically split this large table into several small tables with simple functions, each small table is only responsible for specific matching functions, and the forwarding performance of the entire data plane has increased by several times to several hundred times.

In the demo session, Hungary’s Eötvös Loránd University demonstrated a compiler from P4 language to DPDK and Freescale network processor, which can process 10 million packets per second per core.

dpdkr poster

Italy’s Politecnico di Torino proposed the dpdkr network processing framework, which adds a direct communication pipeline between virtual machines for DPDK. This pipeline is completely transparent to applications and OpenFlow controllers, and communication between virtual machines on the same physical machine can bypass Open vSwitch to achieve acceleration.

Hotspot 2: Traffic Scheduling

Since Van Jacobson proposed the TCP congestion control protocol at SIGCOMM in 1988, congestion control and traffic scheduling have been enduring topics in the network field. An important invention of my mentor Tan Xun is the CTCP congestion control protocol, which has been applied in Windows operating systems from Vista to the present. In the past decade, the focus of congestion control and traffic scheduling research has shifted from wide area networks to data centers.

Data Center Congestion Control and Traffic Scheduling

Heterogeneous applications in virtualized data centers (Image source: 2DFQ slides)

Different applications in the data center have different bandwidth and latency requirements, for example, responding to search engine queries is obviously more urgent than backing up background logs. In order to meet the diverse needs of applications with limited bandwidth, servers need to decide at what speed to send packets, and switches need to decide the queue order of packets from different connections, and which route to take when there are multiple paths. This corresponds to the three research fields of congestion control, traffic scheduling, and load balancing, thousands of papers have been published, and in recent years more and more papers have combined these three aspects.

A dazzling array of congestion control and traffic scheduling protocols (Image source: NUMFabric slides)

The various congestion control protocols are not all fair, for example, when DCTCP and traditional TCP Cubic share the same network link, DCTCP will “rudely” occupy most of the bandwidth. Different customers’ virtual machines in the data center may use different operating systems and network protocol stacks, how to ensure fairness between them? If the customer’s virtual machine is using an old congestion control protocol, can it use the new protocol more suitable for data center networks without requiring the customer to upgrade?

Translation of congestion control protocols (Image source: Virtualized Congestion Control paper)

This year’s SIGCOMM has two similar papers that independently propose and solve this problem. One is “Virtualized Congestion Control” (VCC) by Stanford University and VMWare, and the other is “AC/DC TCP” by the University of Wisconsin-Madison and IBM Research. Their idea is to translate congestion control protocols in the virtualization layer (virtual switch), translating different congestion control protocols used in the virtual machine into a unified congestion control protocol. These translation schemes include directly reading and modifying virtual machine memory, modifying TCP headers, buffering, generating fake TCP ACKs, TCP proxies, etc., and the VCC paper compares the pros and cons of these schemes.

The primary goal of congestion control is to ensure fairness between network connections, while the primary goal of traffic scheduling is to maximize the benefits of the network as a whole. In traditional congestion control protocols, the source end dynamically adjusts the sending rate based on network congestion feedback information, and it takes multiple round trips to converge to the optimal sending rate, and the network bandwidth cannot be fully utilized before convergence. If each connection only has a few packets (such as accessing a web page), then the utilization rate of network bandwidth is relatively low.

The research group of Mohammad Alizadeh at Stanford University proposed a fast-converging fair bandwidth allocation protocol NUMFabric. In NUMFabric, the source end specifies the weight of each stream rather than the sending rate, and the switches in the network schedule through Weighted Fair Queueing (WFQ), which ensures fairness in the network in the sense of weighted max-min (i.e., prioritize the less demanding, and share the remaining resources evenly for the unmet demands). On top of this, NUMFabric dynamically adjusts the weight of each stream, quickly converging to maximize network benefits.

Ideal scheduling and scheduling generated by 2DFQ, WFQ, WF2Q (Image source: 2DFQ paper)

Although Weighted Fair Queueing ensures the fairness of network traffic, from the application’s perspective, traffic may become bursty. As shown in the above figure (c)(d), two large requests (such as database scans) occupy two CPU cores, and small requests (such as database primary key queries) are temporarily starved, and the latency of small requests will significantly increase. If the request latency is predictable, the ideal scheduling is as shown in figure (a), letting one core handle small requests and one core handle large requests. The 2DFQ proposed by Microsoft in collaboration with Brown University is such a scheduling strategy. After using 2DFQ, in most cases where request latency is predictable (unpredictable ones are treated as large requests), the response time of Microsoft Azure cloud storage has become much more stable.

Request response time after using WFQ, WF2Q, 2DFQ (Image source: 2DFQ slides)

The presentation of this paper did a very good job of visualizing WFQ, and I personally think it was one of the best presentations at this year’s SIGCOMM, and I recommend interested readers to download its presentation from the SIGCOMM conference homepage (the video will be even better when it is released).

Coflow concept (Image source: CODA slides)

In 2012, the research group of Ion Stoica at the University of California, Berkeley proposed the concept of Coflow, that is, a distributed task is composed of several parallel data streams, and the next phase of computation can only begin when these streams are all completed. Therefore, we are concerned not with the completion time of each stream, but with the completion time of a group of Coflows. However, software developers need to modify existing software to specify Coflow information. At this year’s SIGCOMM, Zhang Hong from the Hong Kong University of Science and Technology proposed CODA (COflows in the DArk), which can automatically identify Coflows from network traffic, and its scheduling strategy can tolerate a certain degree of recognition errors, so there is no need to make any modifications to existing software.

Another main conference paper in the field of traffic scheduling also comes from the research group of Professor Chen Kai at the Hong Kong University of Science and Technology. Some of the flows in the data center have a completion deadline, while others do not. If you simply set the flow with a completion deadline to high priority, then other flows will be starved. Chen Li’s Karuna paper proposes that flows with a completion deadline do not need to occupy all the bandwidth, just allocate the bandwidth needed to complete on time, and the remaining bandwidth can be given to flows without a completion deadline, minimizing the completion time of indefinite flows.

Wide Area Network Traffic Engineering

The previous articles were all about traffic scheduling within data centers, but wide area networks spanning across data centers also require traffic engineering. At SIGCOMM 2013, Google’s B4 and Microsoft’s SWAN shared their experiences with logically centralized large-scale wide area network traffic engineering.

PieBridge (Magpie Bridge) poster

At this year’s SIGCOMM, the PieBridge (Magpie Bridge) system, a collaboration between Zhang Yuchao of Tsinghua University and Baidu, uses a centralized scheduling P2P network to achieve efficient synchronization of massive data between different data centers.

Huawei ADN demo

The ADN (Application Driven Network) from Huawei’s Future Network Research Institute was the only demo at this SIGCOMM that brought three workstations to the scene. They put the workstations in suitcases and were exhausted from moving them. There are various types of applications in the carrier network, each with different quality of service requirements (such as high bandwidth, low latency, guaranteed bandwidth and latency), and need several types of network resources (such as wireless, wide area network, data center network). ADN divides the physical network into several virtual networks, maps applications to virtual networks, and uses customized network topology, routing protocols, and traffic scheduling strategies within the virtual network to meet the quality of service requirements of heterogeneous applications.

Traffic Scheduling and Economics

Traffic scheduling is not only a technical problem, but also an economic one. Several papers at this SIGCOMM conference explored new directions for traffic scheduling from an economic perspective.

In cross-data center network communication, what motivates customers to mark the true priority, bandwidth guarantee, and completion deadline of traffic? Microsoft found that 81% of wide area network customers are willing to delay transmission in exchange for lower prices; if they can get guarantees of tariffs, bandwidth, and completion deadlines at the start of data transmission, customers can accept dynamic pricing. Based on this, Pretium proposed a dynamic pricing-based traffic engineering strategy that motivates users to mark the true needs of traffic economically.

Video traffic occupies a large part of Internet traffic, and some Internet Service Providers (ISPs) have violated the principle of network neutrality and quietly throttled video traffic. At this year’s SIGCOMM, a Google paper surveyed the prevalence and impact of traffic policing worldwide. After analyzing 270 TB of traffic from Google CDN servers over 7 days and 800 million HTTP requests, they found that about 7% of connections were throttled, and the packet loss rate of throttled connections was 6 times that of normal connections, significantly affecting the quality of video playback. Google suggests that ISPs adopt traffic shaping instead of throttling, and content download servers should also actively throttle and pace.

A SIGCOMM paper jointly published by the Chinese Academy of Sciences, Carnegie Mellon University, and iQIYI this year coincides with Google’s suggestion for content download servers. CS2P uses machine learning-based end-to-end bandwidth estimation to select the best bitrate for videos. In the training phase, it first clusters based on user session features, then trains a Hidden Markov Model (HMM) within each cluster to predict end-to-end bandwidth. Online, it first determines the initial video bitrate based on the cluster belonging to the session features, then dynamically adjusts the bitrate based on feedback and the HMM model.

Network Cookies working principle (Image source: Network Cookies slides)

Stanford University suggests that instead of requiring ISPs to treat all traffic equally under the principle of network neutrality, it is better to give users the choice of differentiated services. Only users know what quality of service each application needs. For example, cloud sync and software updates are generally background traffic, but when a file is urgently needed, it requires high priority; video chat needs to guarantee bandwidth and minimize latency as much as possible. This paper proposes a cookie-based design, where users tag each application’s network request, and network devices provide differentiated quality of service based on these tags.

Hotspot 3: Wireless Backscatter

Although there were only 5 wireless-related papers at this year’s SIGCOMM, two of them won the best paper award (there were 3 best papers in total).

Wireless communication requires much more energy than sensing (Image source: Interscatter slides)

The power consumption of wireless network devices is often the biggest problem, and we hope to reduce power consumption while maintaining a high data transmission rate. The main source of power consumption for wireless devices is radio frequency signals, so researchers in recent years have hoped to use the energy carried by ubiquitous electromagnetic waves in space, that is, to transmit information by reflection.

Reflective TV signal communication demo (Image source: Youtube Ambient Backscatter demo)

In 2013, researchers at the University of Washington collected and reflected TV signals in the environment to achieve slow-speed communication for devices without power sources, winning the best paper award at SIGCOMM that year. These devices collect hundreds of microwatts of energy from the signal to drive the chip to operate, encode the information on the collected signal, and then transmit it. A special gateway device is needed to receive and decode. At SIGCOMM 2014, the team increased the communication speed of reflected TV signals by 100 times and the transmission distance by 8 times.

Also at SIGCOMM 2014, the team developed a technology for communicating by reflecting WiFi signals, which can modulate WiFi channels and communicate slowly with commercially available WiFi access points, no longer requiring a dedicated gateway device. At NSDI 2016, the team further invented Passive WiFi, which reflects continuously transmitted signals in the environment. Passive WiFi can communicate with commercially available WiFi access points according to the standard 802.11b protocol with 10,000 times lower power consumption than ordinary WiFi chips. The prototype of Passive WiFi was initially implemented on FPGA, then made into a chip, and is now commercialized.

Applications of Interscatter (Image source: Interscatter paper)

At this year’s SIGCOMM conference, the University of Washington’s latest black technology, Interscatter, achieves reflective communication between different types of wireless protocols, using only commercially available devices, no longer requiring devices that continuously emit signals into the environment in passive WiFi. In this way, implanted devices can reflect Bluetooth signals to generate WiFi signals, realizing the three sci-fi scenarios shown in the figure: (a) Invisible glasses measure medical features, (b) Brain-computer interface, (c) Credit cards communicate through reflecting mobile phone Bluetooth signals. Bluetooth devices send single-frequency Bluetooth signals, Interscatter devices reflect single-frequency signals to one side of the carrier frequency, generate 802.11b WiFi carriers, and modulate data on it. Similar to previous work, this time it is also a prototype implemented with FPGA.

Pengyu Zhang and Pan Hu from the University of Massachusetts also published FS-Backscatter, a practical reflective communication technology for low-power sensors, at this year’s SIGCOMM. In response to the problem of mutual interference between the reflected signal and the original signal, the solution of Interscatter is to reflect the Bluetooth signal to the WiFi band, while the solution of FS-Backscatter is to reflect the signal to the adjacent idle band. FS-Backscatter has realized the reflective communication of WiFi and Bluetooth protocols, and like Interscatter, it does not require additional hardware devices.

Comparison of active, backscatter, and passive wireless communication methods (Image source: Bradio slides)

Pan Hu, Pengyu Zhang and others also have another paper, Bradio, at this year’s SIGCOMM. Considering the huge difference in battery capacity of different devices, it dynamically switches between traditional active radio, reflective communication, and passive reception to save energy. Active radio is very power-consuming for both the sender and receiver, and the signal transmission distance is relatively long. The sender of reflective communication is very energy-saving, but the receiver is very power-consuming, and it is only suitable for short-distance communication. Passive reception is just the opposite of the power consumption of reflective communication, and the signal transmission distance is relatively long. Bradio chooses the working mode of the radio based on the device’s power and communication distance.

Before ending the SIGCOMM research hotspot section, I want to finally share a black technology from Microsoft Research: ProjectToR. In traditional data center networks, the network connections between cabinets are fixed, most of the connection bandwidth between cabinets is idle, and the connection bandwidth between a few cabinets is not enough. Therefore, in recent years, SIGCOMM has published many papers proposing reconfigurable data center cabinet interconnection solutions. Among them, Helios in 2010 and Mordia in 2013 use optical switches, Flyways in 2011 and 3D Beam forming in 2012 use 60 GHz radio, and FireFly in 2014 and ProjectToR this year use free space laser communication.

ProjectToR principle diagram (Image source: ProjectToR slides)

The sci-fi aspect of ProjectToR is that it uses a Digital Micromirror Device (DMD) to reflect lasers. The DMD is composed of hundreds of thousands of mirror arrays of 10 micrometers in size. Each mirror direction is fixed and whether to reflect light is determined by the value of the memory. Therefore, by modifying the value of the memory, the reflection direction of the DMD can be modified, just like the mirror has turned an angle. Each cabinet top has several laser transmission and reception devices. The DMD suspended in the data center changes the reflection direction and can establish an optical channel between any two cabinets. ProjectToR makes some lasers into a fixed network topology, and some lasers are used for opportunistic connections that are dynamically adjusted, and designs routing and traffic scheduling algorithms on this dynamic network topology.

How SIGCOMM papers are made

Although I am the first author of the ClickNP paper at SIGCOMM and the first to give an oral report at the conference, my mentor, Researcher Tan Kun, should receive more honor.

In May 2013, I interviewed to join the joint training intern program of USTC and Microsoft Asia Research Institute. This program recruits about 20 juniors from USTC each year to participate in a one-year internship. The undergraduate thesis is also completed at Microsoft. After two months of internship, about 7 people are selected to become joint training doctoral students. The joint training doctoral students study at USTC in the first year and do research at Microsoft Asia Research Institute for the next four years. In my senior year, under the guidance of Dr. Tan Kun, a senior researcher in the Wireless and Network Group, I participated in a research project on virtualized network functions, proposed a fault-tolerant software framework on a programmable router, and explored the flow table translation problem of a programmable router. These three researches did not publish papers, but they laid the knowledge foundation for data center networks, network function virtualization, and programmable routers.

Fate is always unpredictable. When I joined the joint training program, I definitely didn’t know that I would do FPGA programming research during my PhD. In July 2015, when I finished my first year of graduate courses and returned to Microsoft Asia Research Institute to continue the joint training doctoral program, Dr. Tan Kun told me, you are in charge of the ClickNP project. He has already designed the basic elements and channels of ClickNP, and even personally wrote 1,000 lines of compiler code and several elements. Dr. Lu Laying, our group’s FPGA expert, has basically figured out the temper of the Altera OpenCL, a high-level synthesis (HLS) tool, and wrote a packet sending tool. The mentor arranged for me and two senior students, Luo Renqian from the USTC Computer Department and Peng Yanqing from the Shanghai Jiaotong University ACM class, to do this project together, almost half of our group’s interns were put into ClickNP. Tongshi He from Beihang University was responsible for integrating the PCIe channel into the OpenCL framework.

The success of the ClickNP project is not only due to the team’s efforts, but also the course of history. The Moore’s Law of general-purpose processors has encountered bottlenecks, while the scale of data center computing and the flexibility of user needs are increasing day by day. Microsoft’s data center solution is programmable hardware, namely FPGA. Microsoft has developed a set of Catapult Shells as the operating system for FPGAs, accelerating Bing search, networking, storage, etc., and sharing research results with the academic community through multiple papers. The Microsoft Catapult team also collaborated with Altera to develop an OpenCL BSP suitable for the Catapult Shell, so that the Altera OpenCL framework can be used to program the FPGA, which is the basis of the ClickNP project.

Microsoft Catapult Project Homepage

The guidance from our mentor, Tan Kun, was gradual. At first, he asked us to port the packet sending tool written by Dr. Luo Laying to ClickNP and make it the first application on ClickNP. This work seems simple, but it was troublesome when we actually did it. The OpenCL tool is not mature yet, and we often encounter some bugs, so we record the pitfalls we have stepped on in the document.

In the second stage, the mentor guided us to implement various network functions separately. Luo Renqian was responsible for the hash table. Peng Yanqing was responsible for writing a powerful packet sending tool. I was responsible for adding new syntactic sugar to the compiler and refactoring some basic components from Click.

The hash table is the first stateful network application we implemented. Since we don’t understand the principle of the OpenCL compiler, a slight modification of a piece of code will result in poor performance. I did a series of microbenchmarks and summarized some guidelines for writing high-performance OpenCL code. Some of these principles were later written into the compiler to achieve automatic optimization.

When implementing the TCP checksum, because the entire packet needs to be read before the checksum can be calculated, and the checksum needs to be filled in the packet header, this requires the entire packet to be cached inside the element. When we were at a loss, mentor Tan Kun came up with the design of connecting two elements with two channels and caching the packets in the channel, which cleverly solved this problem. In order to achieve the expected performance, the two channels need to be able to read and write at the same time, so we made the biggest modification to the ClickNP language so far, abandoned the direct function call method, and wrote a simple C language syntax parser to generate intermediate C code.

The second stage ended in September, and we began to implement new network functions separately. Luo Renqian was responsible for the lookup table, including IP address prefix matching and TCAM, to make an OpenFlow firewall. TCAM is very resource-consuming in FPGA. Inspired by the ServerSwitch I learned during my senior year internship at Microsoft, I designed HashTCAM. Peng Yanqing was responsible for implementing the acceleration of network virtualization, that is, the encapsulation and decapsulation of the NVGRE tunnel protocol. I continued to add syntactic sugar to the compiler in response to the team’s needs, explored the use of off-chip DRAM in OpenCL, and implemented rate limiting and packet capture functions. In mid-October when these functions were completed, we felt that the implementation was almost done and we could start writing papers, but the mentor said we were still far behind.

During this period, we learned about Vivado HLS through Xilinx’s training, which also sparked a debate about whether to use OpenCL or HLS. In the end, we found that the programming model of OpenCL is not suitable for ClickNP’s streaming packet processing, so we decided to make ClickNP a cross-platform framework, not dependent on the programming model of OpenCL, and the backend can be either Altera OpenCL or Vivado HLS.

ClickNP Architecture

The fourth stage is to implement more network functions. With more convenient ClickNP syntax and more elements in the library, we are getting more and more handy in writing code, and the time to develop a network function has been shortened from a month to a week. Luo Renqian continued to optimize the firewall, Peng Yanqing implemented port scanning detection based on sketch (not written in the paper), and I implemented the AES and SHA-1 protocols required by the IPSec data plane, as well as the pFabric packet scheduler.

During this stage, high-performance CPU and FPGA communication was an unexpected gain. The original design of the PCIe channel was a stop-and-wait protocol, just to implement the CPU sending control signals to the FPGA, but I found that its send and receive links are actually full-duplex, and a slight modification can turn it into a pipeline mode, not only improving throughput, but also allowing the FPGA to actively send messages to the CPU. I worked with He Tongxue to implement a full-duplex PCIe channel, and I also added automatic batch functionality to the runtime libraries on the CPU and FPGA, with large batches and high throughput at high loads, and small batches and low latency at low loads.

Mentor Tan Kun elevated this communication mechanism to the height of principle, proposed the concept of CPU element, an element can be compiled to both CPU and FPGA, and the elements on both sides can communicate efficiently, ClickNP has a clear feature that distinguishes it from other FPGA packet processing frameworks. Several reviews of the SIGCOMM paper expressed appreciation for FPGA/CPU joint processing:

I like the ability to communicate rapidly between SW and HW, and to move functionality across that boundary.
Very nice modularization design using elements and host <-> FPGA PCIe channels.
Of particular note is the support for partitioning work between the CPU and the FPGA such that each can be utilized for the tasks best allocated to them.

The fourth phase ended in December, one month away from the deadline, and our evaluation had not yet been done. The supervisor hoped to create an application that could reflect the communication between high-performance CPUs and FPGAs, and finally chose a four-layer load balancer, placing the allocation logic of the backend server for each new connection on the CPU.

Comparison of ClickNP four-layer load balancer and Linux Virtual Server performance (the vertical axis is a logarithmic coordinate)

The fifth phase, that is, the evaluation, first measures the performance of network functions on the CPU, second measures the performance of ClickNP on the FPGA, and third measures the acceleration ratio of each element compared to the CPU and the resource overhead compared to the native Verilog of NetFPGA. We greatly underestimated the workload of this phase. First of all, it takes several hours to compile the FPGA once, and the program for sending packets and measuring performance often needs to be modified. Secondly, we have no experience with DPDK, and it took a lot of time to figure out its performance bottleneck and to make DPDK Click support multi-core parallelism. Again, we encountered some difficulties when configuring StrongSwan. Finally, implementing network functions in NetFPGA and compiling NetFPGA SUME code is also a hassle.

Due to insufficient time estimation, the result graphs of most applications were only made in the last week. Since we couldn’t find the reason for the occasional deadlock of large packets in the load balancer at the time, we had to use small packets for all experiments and hurriedly made the result graphs on the last night. The supervisor warned us that the experimental result graphs should be made as early as possible, and think clearly about what this graph wants to explain before doing the experiment. Some experiments are not necessary at all, and some experiments are not harmful if they cannot be done.

In the last two weeks, Professor Tan Kun carefully pondered the words and wrote the paper. I am ashamed that I can’t write SIGCOMM-level text. In the “Optimization” chapter, the content of two pages of paper, the supervisor thought and revised for five days, and I also came up with a new optimization method of delayed write. In the SIGCOMM review comments, this chapter is also the most commented, indicating that the content of this chapter indeed aroused the interest of the reviewers.

At the moment of SIGCOMM’s deadline, the code repository of ClickNP just broke through a thousand submissions, about 20,000 lines of code. The whole team of collaborators has been busy for eight months and has probably done three things: First, designed and implemented the ClickNP language and toolchain; Second, designed and implemented nearly a hundred high-performance network function modules on the FPGA; Third, designed and evaluated five FPGA-accelerated network applications. If there is any innovation, it is the fine-grained division of labor and high-performance communication between FPGA and CPU. Although ClickNP is the first high-performance programming framework to implement general network functions on FPGA with high-level languages, I am still ashamed that I have only done a little bit of work on the basis of existing HLS tools, and FPGA high-level language programming still has many difficulties.

When preparing for the SIGCOMM presentation, the supervisor organized three rehearsals for the whole group, and many constructive suggestions were given for each slide. Apart from Professor Tan Kun, no one has ever given me such detailed guidance on presentations.

I demonstrated ClickNP to Dean Hong Xiaowen at the Microsoft Student Technology Festival

In July of this year, ClickNP received praise from Dr. Hong Xiaowen, Senior Vice President of Microsoft and Dean of Microsoft Research Asia, at the Microsoft Student Technology Festival and won the Best Demonstration Award. In August, the “HTTPS Accelerator” project based on the ClickNP platform won the second place in the “Cloud & Enterprise” group of the Microsoft Hackathon worldwide. At the SIGCOMM conference, ClickNP was not only arranged as the first oral report, but was also cited by two oral reports at the same conference and another paper.

This year’s SIGCOMM Lifetime Achievement Award winner Jim Kurose said in his award speech that when choosing a research topic, you should think clearly about what fundamental problem you are solving, how many people will care about this problem in the next five to ten years, and where your advantages lie. I think the topic of using programmable hardware to accelerate data centers very much meets Jim Kurose’s selection criteria: the performance bottleneck of general-purpose processors calls for new architectures, and new architectures call for innovation in programming languages; Microsoft, as a pioneer in the field of FPGA-accelerated data centers, has opened up a blue ocean for academia and industry. Although FPGA is not a panacea and there are many difficulties in use, I am still full of hope for this highly parallel and highly flexible architecture. At the end of the SIGCOMM speech, I used a slogan: Across the memory wall and reach a fully programmable world.