2017-08-04
Open Source Technology Event LinuxCon Debuts in China, Gathering Industry Giants to Focus on Industry Trends

(Reposted from Microsoft Research Asia)

From June 19-20, 2017, the open source technology event LinuxCon + ContainerCon + CloudOpen (LC3) was held in China for the first time. The two-day agenda was packed, including 17 keynote speeches, 88 technical reports from 8 sub-venues, and technical exhibitions and hands-on experiments from companies like Microsoft. LinuxCon attracted many international and domestic internet giants, telecom giants, and thousands of industry professionals, including Linux founder Linus Torvalds, to gather and focus on industry trends.

Read More

2017-08-03
Implementing ClickNP: Highly Flexible and High-Performance Network Processing with FPGA + CPU

The First Asia-Pacific Workshop on Networking (APNet’17) Invited Talk:

Implementing ClickNP: Highly Flexible and High-Performance Network Processing with FPGA + CPU

Abstract: ClickNP is a highly flexible and high-performance network processing platform with reconfigurable hardware, published in SIGCOMM’16. This talk will share our implementation experience of the ClickNP system, both before and after paper submission. Throughout 8 months, we developed 100 elements and 5 network functions for the SIGCOMM paper, resulting in 1K commits and 20K lines of code. After the paper submission, ClickNP continues to develop and extends to a general-purpose FPGA programming framework in our research team, resulting in 300 elements, 86 application projects and 80K lines of code.

(1) Although with high-level languages, programming FPGA is still much more challenging than CPU. We had hard times to understand the behavior and pitfalls of black-box compilers, and shared our findings by enforcing coding style in the ClickNP language design and providing optimizations in the ClickNP compiler.

(2) OpenCL host to kernel communication model is a poor fit for network processing. This talk will elaborate internals of the high performance communication channel between CPU and FPGA.

(3) FPGA compilation takes hours, run-time debugging is hard, and simulation is inaccurate. For case study, we show how we identified and resolved a deadlock bug in the L4 load balancer, leveraging ClickNP debugging functionalities.

[Conference Website] [Slides]

Read More

2017-08-03
MELO: Memory Efficient Loss Recovery for Hardware-based Transport in Datacenter

Limited by the small on-chip memory, hardware-based transport typically implements go-back-N loss recovery mechanism, which costs very few memory but is well-known to perform inferior even under small packet loss ratio. We present MELO, an efficient selective retransmission mechanism for hardware-based transport, which consumes only a constant small memory regardless of the number of concurrent connections. Specifically, MELO employs an architectural separation between data and meta data storage and uses a shared bits pool allocation mechanism to reduce meta data on-chip memory footprint. By only adding in average 23B extra on-chip states for each connection, MELO achieves up to 14.02x throughput while reduces 99% tail FCT by 3.11x compared with go-back-N under certain loss ratio.

Read More

2017-01-10
'Chip Architecture Blood Change! How to evaluate Microsoft's use of FPGA in data centers?'

(This article was first published on Zhihu, and then reposted on Microsoft Research Asia)

We are not using FPGA to replace CPU, but using FPGA to accelerate the computing tasks suitable for it, and other tasks are still completed on the CPU, allowing FPGA and CPU to work together.

This answer will cover three questions:

  1. Why use FPGA, what are the characteristics compared to CPU, GPU, ASIC (dedicated chip)?
  2. Where is Microsoft’s FPGA deployed? How do FPGAs communicate with each other and with CPUs?
  3. What role should FPGA play in the future cloud computing platform? Is it just a computing accelerator card like GPU?
Read More

2016-12-28
What advantages does the BBR algorithm in Linux Kernel 4.9 have over previous TCP congestion control?

(This article was first published on Zhihu)

@Gao Yifan from USTC LUG has deployed the TCP BBR congestion control algorithm of Linux 4.9 on the LUG HTTP proxy server. The actual download speed from USTC’s mobile exit to Singapore’s DigitalOcean has increased from 647 KB/s to 22.1 MB/s (screenshot below).

(At the request of the dalao in the comment section, the test environment is explained: BBR was set up on the server in Singapore, which is the sender of the data. This server is an HTTP proxy for accessing resources outside the firewall. The connection between USTC’s mobile exit and DigitalOcean is not a dedicated line, but a public network. The USTC mobile exit is 1 Gbps unlimited speed (but shared with others), and DigitalOcean is limited to 200 Mbps. RTT is 66 ms. The test results are so good because most people use TCP Cubic (Linux) / Compound TCP (Windows), and in the case of a certain packet loss rate, TCP BBR is more aggressive and seizes more public network bandwidth. Therefore, it feels somewhat unethical.)

The TCP BBR congestion control algorithm submitted by Google to the Linux mainline and published in the ACM queue journal inherits Google’s research tradition of “deploying in production environments first, then open sourcing and publishing papers”. TCP BBR has been deployed on Youtube servers and Google’s internal wide area network (B4).

TCP BBR is committed to solving two problems:

  1. Fully utilize bandwidth on network links with a certain packet loss rate.
  2. Reduce the buffer occupancy rate on the network link to reduce latency.

The goal of TCP congestion control is to maximize the use of the bottleneck link bandwidth on the network. A network link is like a water pipe. The best way to use this pipe is to fill it with water, that is:
The amount of water in the pipe = the volume of the pipe = the diameter of the pipe × the length of the pipe
In network terms, it is:
The number of unacknowledged packets in the network = the number of packets that can be accommodated on the network link = link bandwidth × round-trip delay

TCP maintains a send window, estimates the current number of packets that can be accommodated on the network link, and hopes to send a packet for each acknowledgment packet that comes back when there is data to send, always keeping so many packets flowing in the network.

TCP and water pipe analogy (Image source: Van Jacobson, Congestion Avoidance and Control, 1988)
How to estimate the volume of the water pipe? One method that everyone can think of is to keep pouring water in until it overflows. The congestion control algorithm in standard TCP is similar: keep increasing the send window until packets start to be lost. This is the so-called “additive increase, multiplicative decrease”, that is, slowly increase the send window when an acknowledgment message is received, and quickly decrease the send window when a packet is lost.

There are two problems with the standard TCP approach:

First, it assumes that all packet loss in the network is due to congestion (the buffer of the network device is full, so some packets have to be dropped). In fact, there may be transmission errors causing packet loss in the network, and congestion control algorithms based on packet loss cannot distinguish between congestion loss and error loss. Inside the data center, the error packet loss rate is on the order of one in a hundred thousand (1e-5); on the wide area network, the error packet loss rate is generally much higher.

More importantly, for the “additive increase, multiplicative decrease” congestion control algorithm to work properly, the error packet loss rate needs to be inversely proportional to the square of the send window. The delay inside the data center is generally 10-100 microseconds, the bandwidth is 10-40 Gbps, multiplied to get a stable send window of 12.5 KB to 500 KB. The bandwidth on the wide area network may be 100 Mbps, the delay is 100 milliseconds, multiplied to get a stable send window of 10 MB. The send window on the wide area network is 1-2 orders of magnitude higher than the data center network, and the error packet loss rate needs to be 2-4 orders of magnitude lower to work properly. Therefore, standard TCP will only converge to a very small send window on long-fat pipes (i.e., links with high delay and large bandwidth) with a certain error packet loss rate. This is one reason why the download speed is very slow even when both the client and server have a lot of bandwidth and the operator’s core network is not full, or even no speed at all in the middle of the download.

Second, there will be some buffers in the network, like the bulging part in the middle of the infusion tube, used to absorb fluctuations in network traffic. Since standard TCP estimates the send window by “filling the pipe”, the buffer tends to be filled at the beginning of the connection. Subsequent buffer occupancy will gradually decrease, but it will not disappear completely. The client’s estimated pipe volume (send window size) is always slightly larger than the volume of the pipe excluding the bulging part. This problem is called bufferbloat.

Bufferbloat phenomenon illustration

Bufferbloat has two harms:

  1. Increase network latency. The more stuff in the buffer, the longer you have to wait.
  2. When there are many connections sharing the network bottleneck, it may cause the buffer to be filled and packets to be lost. Many people regard this kind of packet loss as network congestion, which is not the case.

Round-trip delay over time. Red line: Standard TCP (visible periodic delay changes, and the buffer is almost always filled); Green line: TCP BBR (Image from Google’s paper in ACM queue September-October 2016 issue [1], same below)

Many papers propose to feedback the current buffer size information to the terminal on network devices, such as ECN (Explicit Congestion Notification) widely used in data centers. However, there are many network devices on the wide area network, and it is difficult to update and replace them. It is difficult to deploy solutions that require network device intervention on a large scale.

How does TCP BBR solve the above two problems?

  1. Since it is not easy to distinguish between congestion packet loss and error packet loss, TCP BBR simply does not consider packet loss.
  2. Since filling the pipe can easily cause buffer bloat, TCP BBR estimates bandwidth and delay separately, instead of directly estimating the volume of the pipe.

The product of bandwidth and delay is the size that the send window should have. The TCP Westwood congestion control algorithm, which was invented in 2002 and has entered the Linux kernel, estimates bandwidth and delay separately and calculates their product as the send window. However, bandwidth and delay are like the position and momentum of particles, which cannot be measured accurately at the same time: to measure the maximum bandwidth, the pipe must be filled, and there is a certain amount of data packets in the buffer, at which time the delay is higher; to measure the lowest delay, the buffer must be ensured to be empty, the less traffic in the network, the better, but at this time the bandwidth is lower.

The way TCP BBR solves the problem that bandwidth and delay cannot be measured accurately at the same time is: alternate measurement of bandwidth and delay; use the maximum bandwidth and minimum delay within a period of time as estimated values.

When the connection is just established, TCP BBR uses a slow start similar to standard TCP, exponentially increasing the sending rate. However, standard TCP will immediately enter the congestion avoidance stage when it encounters any packet loss. Its original intention is to fill the pipe and then enter congestion avoidance. However, (1) if the error packet loss rate of the link is high, it will give up before the pipe is filled; (2) if there is a buffer in the network, it will always fill the buffer before giving up.

TCP BBR, on the other hand, enters the congestion avoidance phase when it finds that the effective bandwidth is no longer growing based on the received acknowledgment packets. (1) As long as the error packet loss rate of the link is not too high, it has no effect on BBR; (2) When the sending rate grows to start occupying the buffer, the effective bandwidth no longer grows, and BBR gives up in time (in fact, it gives up when it occupies 3 times the bandwidth × delay, and the extra 2 times buffer will be cleared later), so it will not fill the buffer.

The relationship between the send window and the round-trip delay and effective bandwidth. BBR will stop between the turning points on the left and right, and the standard TCP based on packet loss will stop at the turning point on the right (image from TCP BBR paper, same below)
During the slow start process, since the buffer is almost not occupied in the early stage, the minimum delay is the initial estimate of the delay; the maximum effective bandwidth at the end of the slow start is the initial estimate of the bandwidth.

After the slow start is over, in order to consume the extra 2 times bandwidth × delay, BBR will enter the drain phase, exponentially reducing the sending rate, and the packets in the buffer will be slowly drained until the round-trip delay no longer decreases. As shown in the green line below.

Comparison of effective bandwidth and round-trip delay of TCP BBR (green line) and standard TCP (red line)
After the drain phase is over, BBR enters a stable operating state, alternately detecting bandwidth and delay. Since the change in network bandwidth is more frequent than the change in delay, most of the time in the stable state of BBR is in the bandwidth detection phase. The bandwidth detection phase is a positive feedback system: try to increase the packet sending rate regularly, if the rate of receiving confirmation also increases, further increase the packet sending rate.

Specifically, with every 8 round-trip delays as a cycle, in the first round-trip time, BBR tries to increase the packet sending rate by 1/4 (that is, sending at 5/4 of the estimated bandwidth). In the second round-trip time, in order to drain the extra packets sent in the previous round-trip, BBR reduces the packet sending rate by 1/4 based on the estimated bandwidth. In the remaining 6 round-trip times, BBR sends packets at the estimated bandwidth.

When the network bandwidth doubles, the estimated bandwidth will increase by 1/4 for each cycle, and each cycle is 8 round-trip delays. The upward peak is an attempt to increase the packet sending rate by 1/4, the downward peak is to reduce the packet sending rate by 1/4 (drain phase), and the remaining 6 round-trip delays use the updated estimated bandwidth. After 3 cycles, that is, after 24 round-trip delays, the estimated bandwidth reaches the increased network bandwidth.

Behavior when network bandwidth doubles. The green line is the number of packets in the network, and the blue line is the delay
When the network bandwidth is halved, the extra packets occupy the buffer, causing a significant increase in the delay of the packets in the network (blue line below), and the effective bandwidth is halved. The delay is estimated using the minimum value, and the actual increase in delay will not be reflected in the estimated delay (unless in the delay detection phase, which will be discussed later). The bandwidth estimate is the maximum value within a sliding window time. When the previous estimate times out (moves out of the sliding window), the effective bandwidth after halving will become the estimated bandwidth. After the estimated bandwidth is halved, the send window is halved, the sender has no window to send packets, and the buffer is gradually drained.

Behavior when network bandwidth is halved. The green line is the number of packets in the network, and the blue line is the delay
When the bandwidth doubles, BBR converges in only 1.5 seconds; while when the bandwidth is halved, BBR needs 4 seconds to converge. The former is because the bandwidth growth is exponential; the latter is mainly because the bandwidth estimate uses the maximum value within a sliding window, and it takes a certain time for the effective bandwidth to decrease to be reflected in the bandwidth estimate.

When the network bandwidth remains unchanged, the stable state of TCP BBR is as follows: (We have seen this picture before) You can see the subtle changes in delay with a period of 8 round-trip delays.

Round-trip delay over time. Red line: Standard TCP; Green line: TCP BBR
The above introduces the bandwidth probing phase of BBR in a stable state, so when is the delay probed? During the bandwidth probing phase, the estimated delay is always using the minimum value. What if the actual delay really increases? TCP BBR enters the delay probing phase every 10 seconds if the estimated delay has not changed (i.e., no lower delay has been found). The delay probing phase lasts only 200 milliseconds (or one round-trip delay, whichever is larger), during which the send window is fixed at 4 packets, i.e., almost no packets are sent. The minimum delay measured during this time is used as the new delay estimate. That is to say, about 2% of the time BBR uses a very low packet sending rate to measure delay.

TCP BBR also uses pacing to reduce burstiness when sending packets, reducing the sudden transmission of a string of packets causing buffer bloat. The burstiness of packet sending may be caused by two reasons:

  1. The data receiver, in order to save bandwidth, accumulates multiple acknowledgments (ACK) into one, which is called ACK Compression. The data sender, after receiving this accumulated acknowledgment, will send a string of data packets if there is no pacing.
  2. Let’s see how TCP BBR performs next.

First, let’s look at the first problem BBR tries to solve: throughput in the case of random packet loss. As shown in the figure below, as long as there is a packet loss rate of one in ten thousand, the bandwidth of standard TCP is only 30%; when the packet loss rate is one in a thousand, it is only 10%; when the packet loss rate is one percent, it is almost stuck. While TCP BBR has almost no bandwidth loss when the packet loss rate is below 5%, and still has 75% bandwidth when the packet loss rate is 15%. The data sender does not have enough data to transmit and has accumulated a certain amount of idle send window. When the application layer suddenly needs to transmit more data, if there is no pacing, it will send out as much data as the size of the idle send window.

100 Mbps, 100ms packet loss rate and effective bandwidth (Red line: Standard TCP, Green line: TCP BBR)

The transmission between remote data centers across the WAN is often high bandwidth, high latency, and has a certain packet loss rate. TCP BBR can significantly improve transmission speed. This is also the main reason why the USTC LUG HTTP proxy server and Google’s WAN (B4) deploy TCP BBR.
Next, let’s look at the second problem BBR tries to solve: reducing latency and reducing buffer bloat. As shown in the figure below, standard TCP tends to fill the buffer, and the larger the buffer, the higher the latency. When the user’s network access speed is very slow, this delay may exceed the timeout for the operating system to establish a connection, resulting in a failure to establish a connection. Using TCP BBR can avoid this problem.

Buffer size and latency relationship (Red line: Standard TCP, Green line: TCP BBR)
After Youtube deployed TCP BBR, the median latency worldwide decreased by 53% (i.e., twice as fast), and the median latency in developing countries decreased by 80% (i.e., four times faster). As can be seen from the figure below, the higher the latency of the user, the higher the proportion of latency reduction after adopting TCP BBR, from 10 seconds to just 2 seconds. If your website needs to allow users with GPRS or slow WiFi access to the network to also access smoothly, you might as well try TCP BBR.

Ratio of median round-trip latency between standard TCP and TCP BBR
In summary, TCP BBR no longer uses packet loss as a signal of congestion, nor does it use “additive increase, multiplicative decrease” to maintain the size of the send window, but estimates the maximum bandwidth and minimum delay separately, and uses their product as the size of the send window.

The connection start phase of BBR consists of slow start and drain phases. To solve the problem that bandwidth and delay are not easy to measure accurately at the same time, BBR alternately probes bandwidth and delay after the connection is stable, with the bandwidth probing phase occupying the vast majority of the time, trying to respond quickly to changes in available bandwidth through positive feedback and periodic bandwidth gain; the occasional delay probing phase sends packets very slowly, used to measure delay accurately.

BBR solves two problems:

  1. Fully utilize the bandwidth on a network link with a certain packet loss rate. Very suitable for high latency, high bandwidth network links.
  2. Reduce the buffer occupancy rate on the network link, thereby reducing latency. Very suitable for users with slow network access.

Seeing many questions in the comment section about which side, client or server, deploying TCP BBR is effective, it needs to be reminded: The TCP congestion control algorithm is the data sender deciding the send window, so whichever side deploys it, it is effective for the data sent out from that side. If it is a download, it should be deployed on the server; if it is an upload, it should be deployed on the client.

If you want to speed up the access speed to foreign websites, and the download traffic is much higher than the upload traffic, deploying TCP BBR (or any acceleration algorithm based on TCP congestion control) on the client side has no effect. It is necessary to deploy TCP BBR at the foreign exit of the VPN and do TCP Termination & TCP Proxy. That is, when the client establishes a connection, it is actually connecting with the foreign exit server of the VPN, and the foreign exit server then connects with the target server, so that the high packet loss rate and high latency section (from the client to the foreign exit) is the foreign exit server that has deployed BBR sending data. Or deploy BBR at the foreign exit of the VPN and do HTTP(S) Proxy, the principle is the same.

Probably due to the length limit of the ACM queue and the target readers, this paper does not discuss the fairness of TCP BBR and standard TCP (only in the case of congestion packet loss). It also does not discuss the comparison of BBR with existing congestion control algorithms, such as those based on round-trip delay (such as TCP Vegas), those that integrate packet loss and delay factors (such as Compound TCP, TCP Westwood+), those based on network equipment providing congestion information (such as ECN), and those where network equipment adopts new scheduling strategies (such as CoDel). Looking forward to Google publishing more detailed papers, and also looking forward to colleagues reporting the performance of TCP BBR in experimental or production environments.

I am not an expert in the field of TCP congestion control, if there are any errors or omissions, please correct me.

[1] Cardwell, Neal, et al. “BBR: Congestion-Based Congestion Control.” Queue14.5 (2016): 50.

Read More

2016-12-24
How to Make the Motherboard Beeper Sound on Win10 Without Writing a Driver

The Windows API has a Beep function, which is used to make a beep sound. This beep function has a long history, and the BIOS alarm sound comes from the motherboard beeper. Its principle is to call the Programmable Interval Timer (PIT) that almost every machine has. Unfortunately, starting from Windows Vista, the behavior of this function has become to call the speaker to make a sound, and it is no longer using the motherboard beeper to make a sound.

How to use the motherboard beeper to make a sound in systems above Windows Vista? Do you have to write a Windows driver? In fact, using the function of the WinDbg kernel debugger, it can be done with just one line of code. The following line of code makes the motherboard beeper sound at a frequency of 800 Hz for 1000 milliseconds.

1
n 10; r $t0=800; r $t1=1000; ob 0x43 0xb6; ob 0x42 (1193180/$t0)&0xff; ob 0x42 (1193180/$t0)>>8; ob 0x61 3; .sleep $t1; ob 0x61 0

How to use:

  1. Download and install WinDbg
  2. Open kernel debugging. Run with administrator privileges
    1
    bcdedit /debug on
    , and restart.
  3. Open WinDbg with administrator privileges, File->Kernel Debug, select the “Local” tab, and confirm. If everything goes well, you will enter the kernel debug session.
  4. Enter this code, if your motherboard beeper is normal, you should be able to hear the beep sound. (Unfortunately, there is no sound in the screenshot)

Principle:

1
2
3
4
5
6
7
8
9
n 10;          设置十进制,默认 WinDbg 是 16 进制
r $t0=800; 设置 WinDbg 内部寄存器 t0 为 800,表示发声频率
r $t1=1000; 设置 WinDbg 内部寄存器 t1 为 1000,表示发声时长(毫秒)
ob 0x43 0xb6; 设置 PIT 输出到主板蜂鸣器的 PWM 波周期(这里的 ob 和 Linux 的 outb 相同)
ob 0x42 (1193180/$t0)&0xff; PWM 波周期的低字节
ob 0x42 (1193180/$t0)>>8; PWM 波周期的高字节
ob 0x61 3; 开始发声
.sleep $t1; 持续发声 $t1 这么长时间(毫秒)
ob 0x61 0; 发声结束

Thanks to The Square Root of Negative One (zzh1996) for the question.

Reference: http://wiki.osdev.org/PC_Speaker

Read More

2016-09-22
The Vane of Network Technology——SIGCOMM 2016

(Reprinted from Microsoft Research Asia)

As the oldest top academic conference in the field of computer networks, ACM SIGCOMM has been held 37 times since 1977. The Association for Computing Machinery (ACM) Special Interest Group on Data Communication (SIGCOMM) proudly calls SIGCOMM its annual flagship conference on its homepage. Over the past 40 years, from the TCP congestion control protocol in computer network textbooks to Software Defined Networking (SDN) and Network Function Virtualization (NFV) in cloud data centers, SIGCOMM has witnessed the birth and development of many key technologies in computer networks.

SIGCOMM papers are known for their high quality, with only about 40 accepted each year, an acceptance rate of around 15%. Network researchers around the world regard publishing papers at SIGCOMM as an honor. Each paper undergoes rigorous double-blind review, for example, this year there were three rounds of review, the first round selected 99 out of 225 papers, the second round selected 66, and the third round selected 60 for Program Committee (PC) discussion, deciding on the final 39 accepted papers after a day and a half of meetings. Each accepted paper received an average of 8 review comments, spanning dozens of pages. Even if not ultimately accepted, the opinions of these expert reviewers are very helpful for subsequent improvements to the paper.

Read More

2016-08-22
ClickNP FAQ

Read More

2016-08-22
ClickNP: Highly Flexible and High-Performance Network Processing with Reconfigurable Hardware

Highly flexible software network functions (NFs) are crucial components to enable multi-tenancy in the clouds. However, software packet processing on a commodity server has limited capacity and induces high latency. While software NFs could scale out using more servers, doing so adds significant cost. This paper focuses on accelerating NFs with programmable hardware, i.e., FPGA, which is now a mature technology and inexpensive for datacenters. However, FPGA is predominately programmed using low-level hardware description languages (HDLs), which are hard to code and difficult to debug. More importantly, HDLs are almost inaccessible for most software programmers.

This paper presents ClickNP, an FPGA-accelerated platform for highly flexible and high-performance NFs with commodity servers. ClickNP is highly flexible as it is completely programmable using high-level C-like languages, and exposes a modular programming abstraction that resembles Click Modular Router. ClickNP is also high performance. Our prototype NFs show that they can process traffic at up to 200 million packets per second with ultra-low latency (< 2µs). Compared to existing software counterparts, with FPGA, ClickNP improves throughput by 10x, while reducing latency by 10x. To the best of our knowledge, ClickNP is the first FPGA-accelerated platform for NFs, written completely in high-level language and achieving 40 Gbps line rate at any packet size.

Read More

2016-08-01
A Scalable and Efficient Architecture for FPGA-based HTTPS Accelerator

Joint project with Tianyi Cui for Microsoft Hackathon 2016.

Read More
RSS