Yesterday, a bizarre thing happened to mirrors.ustc.edu.cn. Mirrors has three access lines, with IPs 202.38.95.110, 202.141.160.110, and 202.141.176.110 respectively. Mirrors-lab is an LXC virtual machine on mirrors, with three IPs: 10.8.95.2, 10.8.140.2, 10.8.10.2.

On the mirrors host, iptables is configured to directly map the 50000~51000 ports on the host to the virtual machine:

1
2
3
-A PREROUTING -p tcp -d 202.38.95.110 -m multiport --dports 50000:50100 -j DNAT --to 10.8.95.2
-A PREROUTING -p tcp -d 202.141.160.110 -m multiport --dports 50000:50100 -j DNAT --to 10.8.10.2
-A PREROUTING -p tcp -d 202.141.176.110 -m multiport --dports 50000:50100 -j DNAT --to 10.8.140.2

The rsync daemon is running on port 50000 of the virtual machine, but only rsync://202.38.95.110:50000 can be accessed, the other two IPs are timed out. The bizarre thing is, we captured packets with tcpdump on the mirrors virtual machine and host, it seems that SYN has been received, and ACK packets have also been sent out. Tcpdump captures inbound packets before netfilter, we captured the physical network card eth0, inbound packets should not have reached iptables when they were captured, and outbound packets have passed iptables when they were captured. Even more bizarre is that on the blog server, which belongs to the same network segment as mirrors, all three IPs are accessible normally. Why can’t the reply packets be sent out of the local network?

After careful debugging by Guo Jiahua (here omitting 10,000 words), the problem occurred on ip rule. Below is the original ip rule of the mirrors host:

1
2
3
4
5
6
7
8
$ ip rule
0: from all lookup local
32761: from all to 221.224.40.18 lookup 101
32762: from 202.141.176.110 lookup 102
32764: from 202.141.160.110 lookup 101
32765: from 202.38.95.110 lookup 100
32766: from all lookup main
32767: from all lookup default

Linux has adopted the Policy Routing mechanism, which means there can be several routing tables, arranged according to certain priorities. If a routing table does not match any rules, it will go to find the routing table with a slightly lower priority. Taking the above routing rules as an example, first match the routing table with priority 0, which are all local routing rules, that is, match all packets sent to this machine; if it does not match, it will match the one with priority 32761, and then match the one with priority 32762, and so on.

Priority 32766 is main, which is “main”, the routing rules we add with the ip route command are in this routing table by default. There is generally a default route in main, that is, all that have not been matched before will come here. The default route on mirrors is default via 202.38.95.126 dev vlan95. 32767 is the last default, mirrors has not configured it, and it actually won’t get here.

Why is it configured like this? Imagine a packet coming in from the vlan10 line on mirrors. Naturally, the reply packet should also go out from vlan10, otherwise, what’s the use of three-line access if all outgoing packets go through one line? Linux’s routing table is based on the destination address, how can I ensure that the packet replying to vlan10, that is, the packet sent out from this machine with the source address 202.141.160.126, definitely goes out from vlan10? This is where policy routing comes into play. Through the following configuration, when the reply packet matches the rule with priority 32764, it finds that the source of the packet matches, enters routing table 101, and there is a default rule in the routing table, so this packet goes out from vlan10.

1
2
$ ip route show table 101
default via 202.141.160.126 dev vlan10

After all this, it seems to have nothing to do with the weird problem at the beginning. Hold on, we still need to look at the kernel implementation of iptables, netfilter.

1
2
3
4
5
6
7
8
9
10
--->PRE------>[ROUTE]--->FWD---------->POST------>
Conntrack | Mangle ^ Mangle
Mangle | Filter | NAT (Src)
NAT (Dst) | | Conntrack
(QDisc) | [ROUTE]
v |
IN Filter OUT Conntrack
| Conntrack ^ Mangle
| Mangle | NAT (Dst)
v | Filter

As can be seen from the above figure (source), in netfilter, NAT is actually divided into SNAT and DNAT. And the ip rule we just talked about is in the [ROUTE] part of the above figure. That is to say, first modify the destination address, then route, and then modify the source address. (By the way, this is also the reason why DNAT is in PREROUTING and SNAT is in POSTROUTING)

An rsync request (TCP SYN) from an external machine (set as 202.38.70.7) arrives at 202.141.160.110:50000 through the network card vlan10. In the PREROUTING stage, the destination address is modified to 10.8.10.2 and sent into the virtual machine. The virtual machine receives the TCP SYN from 202.38.70.7 and replies with a TCP SYN+ACK from 10.8.10.2, targeting 202.38.70.7. Now this reply packet has reached the host’s netfilter. Since the source address is modified in the POSTROUTING stage, the source address is still 10.8.10.2 when passing through the routing component. Reviewing the above ip rule, it does not satisfy the “entry rule” of the routing table with priority 0~32765, so it enters the main. In the main, it chooses the default route and goes to the network card vlan95.

When testing on a blog in the same network segment, the target MAC address of the link layer data packet is directly the network card of the blog, which seems to have no problem. But when testing on machines in different network segments, the gateway device of vlan95 (originally managing 202.38.95.110/25) sees the source IP address as 202.141.160.110, WTF? It might just drop the packet. Once the principle is understood, this bizarre problem doesn’t seem bizarre at all.

The final solution to this problem is to add three ip rules: (I also put it into rc.local)

1
2
3
ip rule add from 10.8.95.2 table 100
ip rule add from 10.8.10.2 table 101
ip rule add from 10.8.140.2 table 102

Comments

2013-05-13