Network Virtualization is the creation of a virtual network that differs from the physical network topology. For example, a company has multiple offices around the world, but wants the company’s internal network to be a whole, which requires network virtualization technology.

Starting from NAT

CaptureCapture

Suppose a machine in the Beijing office has an IP of 10.0.0.1 (this is an internal network IP, which cannot be used on the Internet), and a machine in the Shanghai office has an IP of 10.0.0.2, and they need to communicate via the Internet. The public (Internet) IP of the Beijing office is 1.1.1.1, and the public IP of the Shanghai office is 2.2.2.2.

A simple way is to change the source IP 10.0.0.1 of the outgoing data packet at the edge router of the Beijing office to 1.1.1.1, and the destination IP 10.0.0.2 to 2.2.2.2; change the destination IP 1.1.1.1 of the incoming data packet to 10.0.0.1, and the source IP 2.2.2.2 to 10.0.0.2. Do similar address translation at the edge router of the Shanghai office. In this way, 10.0.0.1 and 10.0.0.2 can communicate, and they are completely unaware of the existence of the Internet and the address translation process. This is the basic NAT (Network Address Translation).

CaptureCapture

However, this method has serious problems. Imagine that the Shanghai office has added a machine with an internal network IP of 10.0.0.3. No matter how the Beijing office handles it, when the edge router of the Shanghai office receives a data packet with a destination IP of 2.2.2.2, should it be sent to 10.0.0.2 or 10.0.0.3? This bug seems simple, but it is easy for designers to overlook. When designing network topology or network protocols, you can’t just think about how data packets go out, you also have to think about how reply packets come in. If you use simple NAT, you need to add a public IP to the edge router for each additional internal network machine.

We know that public IP is very precious, so NAPT (Network Address and Port Translation) was born. NAT in Linux is actually NAPT. Outgoing and incoming connections need to be considered separately. For incoming connections, the basic assumption of NAPT is that two machines sharing the same public IP will not provide the same service. For example, 10.0.0.2 provides HTTP service, 10.0.0.3 provides HTTPS service, then the edge router of the Shanghai office can be configured to “send to 10.0.0.2 if the destination IP is 2.2.2.2 and the destination port is 80 (HTTP), send to 10.0.0.3 if the destination port is 443 (HTTPS)”. This is DNAT (Destination NAT).

For outgoing connections, things are a bit more complicated. 10.0.0.2 initiated a connection to 10.0.0.1, with a source port of 20000 and a destination port of 80. 10.0.0.3 also initiated a connection to 10.0.0.1, with a source port of 30000 and a destination port of 80. When a reply packet from the Beijing office arrives at the edge router of the Shanghai office, its source port is 80 and its destination port is 20000. If the edge router does not save the connection state, it obviously does not know who to forward this packet to. That is, the edge router needs to maintain a table:

CaptureCapture

When a reply packet comes, check the source port (80) and destination port (20000), match the first record, and know that it should be sent to 10.0.0.2. Why do we need the “new source port” column? If 10.0.0.2 and 10.0.0.3 initiate TCP connections to the same destination IP and the same destination port with the same source port, the reply packets of these two connections cannot be distinguished. In this case, the edge router must allocate different source ports, and the source port of the actual outgoing packet is “new”. Network address translation for outgoing connections is called SNAT (Source NAT).

IP-in-IP Tunnel

NAPT requires that two machines sharing the same public IP cannot provide the same service, which is often unacceptable. For example, we often need to SSH or remote desktop to each machine. Tunnel technology was born. The simplest three-layer tunnel technology is IP-in-IP.

CaptureCapture

As shown in the figure above, the white background and black text are the original IP packets, and the blue background and white text are the added headers. This header is generally added (encap) on the edge router of the sender. The added header is first the second layer (Link Layer) header, and then the third layer (Network Layer) header. The entire packet is a legal IP packet that can be routed on the Internet. After the edge router of the receiver receives this packet, it sees the IP-in-IP flag (IP protocol number = 0x04, not shown in the figure) in the added header, and knows that this is an IP-in-IP tunnel packet; and sees that Public DIP is itself, and knows that it should be unpacked (decap). After unpacking, the original packet (Private SIP, Private DIP) is exposed, and then routed to the corresponding machine in the internal network.

IP-in-IP Tunneling is not enough.

  1. If you try to build a LAN with the same network address and subnet mask at both ends of the tunnel using IP-in-IP tunneling, and do not configure the ARP table on the client, you will find that the clients (note not the routers at both ends of the tunnel) cannot ping each other. This is because before sending a ping packet (ICMP echo request), the system needs to get the other party’s MAC address through the ARP protocol in order to correctly fill in the link layer header. IP-in-IP tunnel can only pass through IPv4 packets, not ARP packets. (IPv4 and ARP are different layer 3 protocols) Therefore, the ARP table must be manually configured on the client, or the router should answer on behalf, which increases the difficulty of network configuration.
  2. In the data center, there is often more than one client. For example, two clients have created virtual networks, and the internal IP is 10.0.0.1. If they share the same Public IP when sending to the Internet, it is impossible to determine which client an incoming IP-in-IP packet should be sent to.
  3. If you want to do load balancing, you usually hash the packet header five-tuple (source IP, destination IP, layer 4 protocol, source port, destination port), and select the target machine according to the hash value, so that the packets of the same connection are always sent to the same machine. If a common network device doing load balancing receives an IP-in-IP packet, if it does not recognize the IP-in-IP protocol, it cannot parse the layer 4 protocol and port number, and can only hash based on the Public SIP and Public DIP. Public DIP is generally the same, so there is only one variable, source IP, and it is difficult to ensure the uniformity of the hash.
    The first problem shows that the encapsulated packet is not necessarily an IP packet. The second problem shows that additional identification information may need to be added. The third problem shows that the added packet header is not necessarily an IP packet header. The reason why network virtualization technology is “blooming” rather than one size fits all is this.

Classification of Network Virtualization Technology

Looking at a network virtualization technology, mainly look at the format of the packet in the tunnel.

  • The outermost layer is the encapsulation layer. Since it needs to be transmitted in the network, it must be a legal layer 2 packet, so the outermost layer must be MAC. When we say that the encapsulation layer is N layer, it means that the 2 … N layer encapsulation header has been added.
  • The middle is the optional shim layer, which contains some additional information and flags, such as the Tenant ID used to identify different customers’ virtual networks, and the entropy used to improve hash uniformity.
  • The inner layer is the packet actually sent by the client. This layer determines what the virtual network looks like to the client. For example, the inner layer of the IP-in-IP tunnel is an IPv4 packet, then the virtual network appears to be an IPv4 network to the client, and it can run TCP, UDP, ICMP or any other layer 4 protocol. When we say that the virtual network is N layer, it means that the 2 .. N-1 layer of the packet sent by the client will not be transmitted (these layers may affect the encapsulation layer, that is, which tunnel to enter). The virtual network is not the lower the layer (the closer to the physical layer), the better, because the lower layer protocol is more difficult to optimize, we will see this later.
    CaptureCapture

According to the format of the packet in the tunnel, common network virtualization technologies (I also included some tunneling technologies in the category of network virtualization technologies) can be simply classified: (The following PPP and MAC are layer 2 protocols, IP is a layer 3 protocol, TCP and UDP are layer 4 protocols)

CaptureCapture

As can be seen from this, almost all reasonable combinations of Encap Layer and Payload have corresponding protocols. Therefore, some people’s statement that “adding a layer 2 header to GRE can…” is meaningless. Once the Encap Layer and Payload change, it becomes another protocol. The following are some protocols as examples to illustrate the significance of different layer protocols, that is, what problems they solve.

GRE vs. IP-in-IP

The GRE (Generic Routing Encapsulation) protocol adds a middle layer (shim layer) to the IP-in-IP protocol, including 32-bit GRE Key (Tenant ID or Entropy) and sequence number information. The GRE Key solves the second problem of the IP-in-IP tunnel mentioned earlier, allowing different clients to share the same physical network and a group of physical machines, which is important in the data center.

CaptureCapture

NVGRE vs. GRE

The network virtualized by GRE is an IP network, which means that IPv6 and ARP packets cannot be transmitted in the GRE tunnel. The problem of IPv6 is relatively easy to solve, just modify the Protocol Type in the GRE header. But the problem of ARP is not so simple. The ARP request packet is a broadcast packet: “Who has 192.168.0.1? Tell 00:00:00:00:00:01”, which reflects a fundamental difference between layer 2 and layer 3 networks: layer 2 networks support broadcast domains. The so-called broadcast domain is which hosts a broadcast packet should be received by. VLAN is a common way to implement broadcast domains.

Of course, IP also supports broadcasting, but packets sent to the third-layer broadcast address (such as 192.168.0.255) are still sent to the second-layer broadcast address (ff:ff:ff:ff:ff:ff), which is implemented through the second-layer broadcast mechanism. If we insist on making the ARP protocol work in the GRE tunnel, it is not impossible, but people generally do not do so.

In order to support all existing and possible future third-layer protocols, and to support broadcast domains, the client’s virtual network needs to be a second-layer network. NVGRE and VXLAN are two of the most famous second-layer network virtualization protocols.

The essential changes of NVGRE (Network Virtualization GRE) compared to GRE are only two:

  • The inner Payload is a second-layer Ethernet frame rather than a third-layer IP packet. Note that the FCS (Frame Check Sequence) at the end of the inner Ethernet frame has been removed, because the encapsulation layer already has a checksum, and calculating the checksum will increase the system load (if the CPU is to calculate it).

  • The middle layer’s GRE key is divided into two parts, the first 24 bits are used as the Tenant ID, and the last 8 bits are used as Entropy.
    With NVGRE, why use GRE? Apart from historical and political reasons, the lower the level of the virtual network, the harder it is to optimize.

  • If the virtual network is second-layer, because MAC addresses are generally very scattered, only one forwarding rule can be inserted for each host, which is a problem when the network scale is large. If the virtual network is third-layer, IP addresses can be allocated according to the network topology, so that the IP addresses of nearby hosts on the network are also in the same subnet (this is how the Internet does it). In this way, the router only needs to match the network address and subnet mask prefix according to the subnet, which can reduce a large number of forwarding rules.

  • If the virtual network is second-layer, packets such as ARP broadcasts will be broadcast to the entire virtual network, so the second-layer network (commonly known as the LAN) generally cannot be too large. If the virtual network is third-layer, because IP addresses are allocated hierarchically, this problem does not exist.

  • If the virtual network is second-layer, the switch has to rely on the spanning tree protocol to avoid loops. If the virtual network is third-layer, multiple paths between routers can be fully utilized to increase bandwidth and redundancy. The network topology of the data center is generally as shown in the figure below (Image source)
    Data-Center-DesignData-Center-Design

If the level of the virtual network is higher and the payload does not include the network layer, it generally cannot be called a “virtual network”, but it still belongs to the category of tunnel technology. SOCKS5 is such a protocol whose payload is TCP or UDP. Its configuration flexibility is higher than that of IP-based tunnel technology, for example, it can specify that port 80 (HTTP protocol) goes through one tunnel, and port 443 (HTTPS protocol) goes through another tunnel. The -L (local forwarding) and -D (dynamic forwarding) parameters of ssh use the SOCKS5 protocol. The disadvantage of SOCKS5 is that it does not support any third-layer protocol, such as the ICMP protocol (SOCKS4 does not even support UDP, so DNS processing is more troublesome).

VXLAN vs. NVGRE

Although NVGRE has an 8-bit Entropy field, network devices that do load balancing, if they do not recognize the NVGRE protocol, still hash based on the “source IP, destination IP, fourth-layer protocol, source port, destination port” five-tuple, this entropy is still useless.

The solution of VXLAN (Virtual Extensible LAN) is: in addition to the MAC and IP layers in the encapsulation layer, add a UDP layer, use the UDP source port number as entropy, and the UDP destination port number as the VXLAN protocol identifier. In this way, the load balancing device does not need to recognize the VXLAN protocol, just hash this packet according to the normal UDP five-tuple.

white_paper_c11-685115-1white_paper_c11-685115-1

Above: VXLAN encapsulated packet format (Image source)

The interlayer of VXLAN is slightly simpler than that of GRE, still using 24 bits as the Tenant ID, without Entropy bits. The network device or operating system virtualization layer that adds the packet header generally copies the source port of the inner payload, as the UDP source port of the encapsulation layer. Since the operating system that initiates the connection generally increments or randomizes the source port number when selecting it, and the hash algorithm inside the network device is generally XOR, the uniformity of the hash obtained in this way is generally good.

STT vs. VXLAN

STT (Stateless Transport Tunneling) is a network virtualization protocol that was newly proposed in 2012 and is still in draft status. At first glance, STT seems to only replace UDP with TCP compared to VXLAN, but in fact, if you capture packets on the network that have been packaged by STT and VXLAN, they are quite different.

Why does STT use TCP? In fact, STT just borrows the shell of TCP, and does not use the TCP state machine at all, let alone the acknowledgment, retransmission, and congestion control mechanisms. What STT wants to borrow is the LSO (Large Send Offload) and LRO (Large Receive Offload) mechanisms of modern network cards. LSO allows the sender to generate TCP packets up to 64KB (or even longer), and the network card hardware breaks down the TCP payload part of the large packet, copies the MAC, IP, TCP packet headers, and forms small packets that can be sent at the second layer (such as Ethernet’s 1518 bytes, or 9K bytes with Jumbo Frame enabled). LRO allows the receiver to combine several small packets of the same TCP connection into a large packet, and then generate a network card interrupt to send to the operating system.

As shown in the figure below, before the Payload is sent out, it needs to add the STT Frame Header (middle layer) and MAC header, IP header, TCP-like header (encapsulation layer). The LSO mechanism of the network card will break this TCP packet into small pieces, copy the encapsulation layer to the front of each small piece, and then send it out.

CaptureCapture

We know that user-kernel mode switching and network card interrupts are very CPU time-consuming, and the performance of network programs (such as firewalls, intrusion detection systems) is often calculated in pps (packet per second) rather than bps (bit per second). Therefore, when transmitting a large amount of data, if the data packet can be larger, the system load can be reduced.

The biggest problem with STT is that it is not easy to implement policies for a particular customer (tenant ID) on network devices. As shown in the figure above, for a large packet, only the first small packet header has the STT Frame Header, and there is no information that can identify the customer in the subsequent small packets. If I want to limit a customer’s traffic from the Hong Kong data center to the Chicago data center to not exceed 1Gbps, it is unachievable. If other network virtualization protocols are used, since each packet contains information identifying the customer, this policy can be configured on the border router (of course, the border router needs to recognize this protocol).

Conclusion

This article uses protocols such as IP-in-IP, GRE, NVGRE, VXLAN, STT as examples to introduce the flourishing network virtualization technology. When learning a network virtualization technology, you must first understand its encapsulation level, the level of the virtual network, and the information contained in the middle layer, and compare it with other similar protocols, and then look at the flag bits, QoS, encryption and other details. When choosing network virtualization technology, you should also consider the support level of the operating system and network equipment, and the scale, topology, and traffic characteristics of the customer network.

Comments