(This article was first published on Zhihu)

@Gao Yifan from USTC LUG has deployed the TCP BBR congestion control algorithm of Linux 4.9 on the LUG HTTP proxy server. The actual download speed from USTC’s mobile exit to Singapore’s DigitalOcean has increased from 647 KB/s to 22.1 MB/s (screenshot below).

(At the request of the dalao in the comment section, the test environment is explained: BBR was set up on the server in Singapore, which is the sender of the data. This server is an HTTP proxy for accessing resources outside the firewall. The connection between USTC’s mobile exit and DigitalOcean is not a dedicated line, but a public network. The USTC mobile exit is 1 Gbps unlimited speed (but shared with others), and DigitalOcean is limited to 200 Mbps. RTT is 66 ms. The test results are so good because most people use TCP Cubic (Linux) / Compound TCP (Windows), and in the case of a certain packet loss rate, TCP BBR is more aggressive and seizes more public network bandwidth. Therefore, it feels somewhat unethical.)

The TCP BBR congestion control algorithm submitted by Google to the Linux mainline and published in the ACM queue journal inherits Google’s research tradition of “deploying in production environments first, then open sourcing and publishing papers”. TCP BBR has been deployed on Youtube servers and Google’s internal wide area network (B4).

TCP BBR is committed to solving two problems:

  1. Fully utilize bandwidth on network links with a certain packet loss rate.
  2. Reduce the buffer occupancy rate on the network link to reduce latency.

The goal of TCP congestion control is to maximize the use of the bottleneck link bandwidth on the network. A network link is like a water pipe. The best way to use this pipe is to fill it with water, that is:
The amount of water in the pipe = the volume of the pipe = the diameter of the pipe × the length of the pipe
In network terms, it is:
The number of unacknowledged packets in the network = the number of packets that can be accommodated on the network link = link bandwidth × round-trip delay

TCP maintains a send window, estimates the current number of packets that can be accommodated on the network link, and hopes to send a packet for each acknowledgment packet that comes back when there is data to send, always keeping so many packets flowing in the network.

TCP and water pipe analogy (Image source: Van Jacobson, Congestion Avoidance and Control, 1988)
How to estimate the volume of the water pipe? One method that everyone can think of is to keep pouring water in until it overflows. The congestion control algorithm in standard TCP is similar: keep increasing the send window until packets start to be lost. This is the so-called “additive increase, multiplicative decrease”, that is, slowly increase the send window when an acknowledgment message is received, and quickly decrease the send window when a packet is lost.

There are two problems with the standard TCP approach:

First, it assumes that all packet loss in the network is due to congestion (the buffer of the network device is full, so some packets have to be dropped). In fact, there may be transmission errors causing packet loss in the network, and congestion control algorithms based on packet loss cannot distinguish between congestion loss and error loss. Inside the data center, the error packet loss rate is on the order of one in a hundred thousand (1e-5); on the wide area network, the error packet loss rate is generally much higher.

More importantly, for the “additive increase, multiplicative decrease” congestion control algorithm to work properly, the error packet loss rate needs to be inversely proportional to the square of the send window. The delay inside the data center is generally 10-100 microseconds, the bandwidth is 10-40 Gbps, multiplied to get a stable send window of 12.5 KB to 500 KB. The bandwidth on the wide area network may be 100 Mbps, the delay is 100 milliseconds, multiplied to get a stable send window of 10 MB. The send window on the wide area network is 1-2 orders of magnitude higher than the data center network, and the error packet loss rate needs to be 2-4 orders of magnitude lower to work properly. Therefore, standard TCP will only converge to a very small send window on long-fat pipes (i.e., links with high delay and large bandwidth) with a certain error packet loss rate. This is one reason why the download speed is very slow even when both the client and server have a lot of bandwidth and the operator’s core network is not full, or even no speed at all in the middle of the download.

Second, there will be some buffers in the network, like the bulging part in the middle of the infusion tube, used to absorb fluctuations in network traffic. Since standard TCP estimates the send window by “filling the pipe”, the buffer tends to be filled at the beginning of the connection. Subsequent buffer occupancy will gradually decrease, but it will not disappear completely. The client’s estimated pipe volume (send window size) is always slightly larger than the volume of the pipe excluding the bulging part. This problem is called bufferbloat.

Bufferbloat phenomenon illustration

Bufferbloat has two harms:

  1. Increase network latency. The more stuff in the buffer, the longer you have to wait.
  2. When there are many connections sharing the network bottleneck, it may cause the buffer to be filled and packets to be lost. Many people regard this kind of packet loss as network congestion, which is not the case.

Round-trip delay over time. Red line: Standard TCP (visible periodic delay changes, and the buffer is almost always filled); Green line: TCP BBR (Image from Google’s paper in ACM queue September-October 2016 issue [1], same below)

Many papers propose to feedback the current buffer size information to the terminal on network devices, such as ECN (Explicit Congestion Notification) widely used in data centers. However, there are many network devices on the wide area network, and it is difficult to update and replace them. It is difficult to deploy solutions that require network device intervention on a large scale.

How does TCP BBR solve the above two problems?

  1. Since it is not easy to distinguish between congestion packet loss and error packet loss, TCP BBR simply does not consider packet loss.
  2. Since filling the pipe can easily cause buffer bloat, TCP BBR estimates bandwidth and delay separately, instead of directly estimating the volume of the pipe.

The product of bandwidth and delay is the size that the send window should have. The TCP Westwood congestion control algorithm, which was invented in 2002 and has entered the Linux kernel, estimates bandwidth and delay separately and calculates their product as the send window. However, bandwidth and delay are like the position and momentum of particles, which cannot be measured accurately at the same time: to measure the maximum bandwidth, the pipe must be filled, and there is a certain amount of data packets in the buffer, at which time the delay is higher; to measure the lowest delay, the buffer must be ensured to be empty, the less traffic in the network, the better, but at this time the bandwidth is lower.

The way TCP BBR solves the problem that bandwidth and delay cannot be measured accurately at the same time is: alternate measurement of bandwidth and delay; use the maximum bandwidth and minimum delay within a period of time as estimated values.

When the connection is just established, TCP BBR uses a slow start similar to standard TCP, exponentially increasing the sending rate. However, standard TCP will immediately enter the congestion avoidance stage when it encounters any packet loss. Its original intention is to fill the pipe and then enter congestion avoidance. However, (1) if the error packet loss rate of the link is high, it will give up before the pipe is filled; (2) if there is a buffer in the network, it will always fill the buffer before giving up.

TCP BBR, on the other hand, enters the congestion avoidance phase when it finds that the effective bandwidth is no longer growing based on the received acknowledgment packets. (1) As long as the error packet loss rate of the link is not too high, it has no effect on BBR; (2) When the sending rate grows to start occupying the buffer, the effective bandwidth no longer grows, and BBR gives up in time (in fact, it gives up when it occupies 3 times the bandwidth × delay, and the extra 2 times buffer will be cleared later), so it will not fill the buffer.

The relationship between the send window and the round-trip delay and effective bandwidth. BBR will stop between the turning points on the left and right, and the standard TCP based on packet loss will stop at the turning point on the right (image from TCP BBR paper, same below)
During the slow start process, since the buffer is almost not occupied in the early stage, the minimum delay is the initial estimate of the delay; the maximum effective bandwidth at the end of the slow start is the initial estimate of the bandwidth.

After the slow start is over, in order to consume the extra 2 times bandwidth × delay, BBR will enter the drain phase, exponentially reducing the sending rate, and the packets in the buffer will be slowly drained until the round-trip delay no longer decreases. As shown in the green line below.

Comparison of effective bandwidth and round-trip delay of TCP BBR (green line) and standard TCP (red line)
After the drain phase is over, BBR enters a stable operating state, alternately detecting bandwidth and delay. Since the change in network bandwidth is more frequent than the change in delay, most of the time in the stable state of BBR is in the bandwidth detection phase. The bandwidth detection phase is a positive feedback system: try to increase the packet sending rate regularly, if the rate of receiving confirmation also increases, further increase the packet sending rate.

Specifically, with every 8 round-trip delays as a cycle, in the first round-trip time, BBR tries to increase the packet sending rate by 1/4 (that is, sending at 5/4 of the estimated bandwidth). In the second round-trip time, in order to drain the extra packets sent in the previous round-trip, BBR reduces the packet sending rate by 1/4 based on the estimated bandwidth. In the remaining 6 round-trip times, BBR sends packets at the estimated bandwidth.

When the network bandwidth doubles, the estimated bandwidth will increase by 1/4 for each cycle, and each cycle is 8 round-trip delays. The upward peak is an attempt to increase the packet sending rate by 1/4, the downward peak is to reduce the packet sending rate by 1/4 (drain phase), and the remaining 6 round-trip delays use the updated estimated bandwidth. After 3 cycles, that is, after 24 round-trip delays, the estimated bandwidth reaches the increased network bandwidth.

Behavior when network bandwidth doubles. The green line is the number of packets in the network, and the blue line is the delay
When the network bandwidth is halved, the extra packets occupy the buffer, causing a significant increase in the delay of the packets in the network (blue line below), and the effective bandwidth is halved. The delay is estimated using the minimum value, and the actual increase in delay will not be reflected in the estimated delay (unless in the delay detection phase, which will be discussed later). The bandwidth estimate is the maximum value within a sliding window time. When the previous estimate times out (moves out of the sliding window), the effective bandwidth after halving will become the estimated bandwidth. After the estimated bandwidth is halved, the send window is halved, the sender has no window to send packets, and the buffer is gradually drained.

Behavior when network bandwidth is halved. The green line is the number of packets in the network, and the blue line is the delay
When the bandwidth doubles, BBR converges in only 1.5 seconds; while when the bandwidth is halved, BBR needs 4 seconds to converge. The former is because the bandwidth growth is exponential; the latter is mainly because the bandwidth estimate uses the maximum value within a sliding window, and it takes a certain time for the effective bandwidth to decrease to be reflected in the bandwidth estimate.

When the network bandwidth remains unchanged, the stable state of TCP BBR is as follows: (We have seen this picture before) You can see the subtle changes in delay with a period of 8 round-trip delays.

Round-trip delay over time. Red line: Standard TCP; Green line: TCP BBR
The above introduces the bandwidth probing phase of BBR in a stable state, so when is the delay probed? During the bandwidth probing phase, the estimated delay is always using the minimum value. What if the actual delay really increases? TCP BBR enters the delay probing phase every 10 seconds if the estimated delay has not changed (i.e., no lower delay has been found). The delay probing phase lasts only 200 milliseconds (or one round-trip delay, whichever is larger), during which the send window is fixed at 4 packets, i.e., almost no packets are sent. The minimum delay measured during this time is used as the new delay estimate. That is to say, about 2% of the time BBR uses a very low packet sending rate to measure delay.

TCP BBR also uses pacing to reduce burstiness when sending packets, reducing the sudden transmission of a string of packets causing buffer bloat. The burstiness of packet sending may be caused by two reasons:

  1. The data receiver, in order to save bandwidth, accumulates multiple acknowledgments (ACK) into one, which is called ACK Compression. The data sender, after receiving this accumulated acknowledgment, will send a string of data packets if there is no pacing.
  2. Let’s see how TCP BBR performs next.

First, let’s look at the first problem BBR tries to solve: throughput in the case of random packet loss. As shown in the figure below, as long as there is a packet loss rate of one in ten thousand, the bandwidth of standard TCP is only 30%; when the packet loss rate is one in a thousand, it is only 10%; when the packet loss rate is one percent, it is almost stuck. While TCP BBR has almost no bandwidth loss when the packet loss rate is below 5%, and still has 75% bandwidth when the packet loss rate is 15%. The data sender does not have enough data to transmit and has accumulated a certain amount of idle send window. When the application layer suddenly needs to transmit more data, if there is no pacing, it will send out as much data as the size of the idle send window.

100 Mbps, 100ms packet loss rate and effective bandwidth (Red line: Standard TCP, Green line: TCP BBR)

The transmission between remote data centers across the WAN is often high bandwidth, high latency, and has a certain packet loss rate. TCP BBR can significantly improve transmission speed. This is also the main reason why the USTC LUG HTTP proxy server and Google’s WAN (B4) deploy TCP BBR.
Next, let’s look at the second problem BBR tries to solve: reducing latency and reducing buffer bloat. As shown in the figure below, standard TCP tends to fill the buffer, and the larger the buffer, the higher the latency. When the user’s network access speed is very slow, this delay may exceed the timeout for the operating system to establish a connection, resulting in a failure to establish a connection. Using TCP BBR can avoid this problem.

Buffer size and latency relationship (Red line: Standard TCP, Green line: TCP BBR)
After Youtube deployed TCP BBR, the median latency worldwide decreased by 53% (i.e., twice as fast), and the median latency in developing countries decreased by 80% (i.e., four times faster). As can be seen from the figure below, the higher the latency of the user, the higher the proportion of latency reduction after adopting TCP BBR, from 10 seconds to just 2 seconds. If your website needs to allow users with GPRS or slow WiFi access to the network to also access smoothly, you might as well try TCP BBR.

Ratio of median round-trip latency between standard TCP and TCP BBR
In summary, TCP BBR no longer uses packet loss as a signal of congestion, nor does it use “additive increase, multiplicative decrease” to maintain the size of the send window, but estimates the maximum bandwidth and minimum delay separately, and uses their product as the size of the send window.

The connection start phase of BBR consists of slow start and drain phases. To solve the problem that bandwidth and delay are not easy to measure accurately at the same time, BBR alternately probes bandwidth and delay after the connection is stable, with the bandwidth probing phase occupying the vast majority of the time, trying to respond quickly to changes in available bandwidth through positive feedback and periodic bandwidth gain; the occasional delay probing phase sends packets very slowly, used to measure delay accurately.

BBR solves two problems:

  1. Fully utilize the bandwidth on a network link with a certain packet loss rate. Very suitable for high latency, high bandwidth network links.
  2. Reduce the buffer occupancy rate on the network link, thereby reducing latency. Very suitable for users with slow network access.

Seeing many questions in the comment section about which side, client or server, deploying TCP BBR is effective, it needs to be reminded: The TCP congestion control algorithm is the data sender deciding the send window, so whichever side deploys it, it is effective for the data sent out from that side. If it is a download, it should be deployed on the server; if it is an upload, it should be deployed on the client.

If you want to speed up the access speed to foreign websites, and the download traffic is much higher than the upload traffic, deploying TCP BBR (or any acceleration algorithm based on TCP congestion control) on the client side has no effect. It is necessary to deploy TCP BBR at the foreign exit of the VPN and do TCP Termination & TCP Proxy. That is, when the client establishes a connection, it is actually connecting with the foreign exit server of the VPN, and the foreign exit server then connects with the target server, so that the high packet loss rate and high latency section (from the client to the foreign exit) is the foreign exit server that has deployed BBR sending data. Or deploy BBR at the foreign exit of the VPN and do HTTP(S) Proxy, the principle is the same.

Probably due to the length limit of the ACM queue and the target readers, this paper does not discuss the fairness of TCP BBR and standard TCP (only in the case of congestion packet loss). It also does not discuss the comparison of BBR with existing congestion control algorithms, such as those based on round-trip delay (such as TCP Vegas), those that integrate packet loss and delay factors (such as Compound TCP, TCP Westwood+), those based on network equipment providing congestion information (such as ECN), and those where network equipment adopts new scheduling strategies (such as CoDel). Looking forward to Google publishing more detailed papers, and also looking forward to colleagues reporting the performance of TCP BBR in experimental or production environments.

I am not an expert in the field of TCP congestion control, if there are any errors or omissions, please correct me.

[1] Cardwell, Neal, et al. “BBR: Congestion-Based Congestion Control.” Queue14.5 (2016): 50.

Comments

2016-12-28