About Open Source Software Mirror Alliance (Technical Part)
Please read first: About Open Source Software Mirror Alliance (Non-Technical Part), thank you
I. DNS or 301?
Update: I previously misunderstood DNS’s CNAME, now corrected. Reference: https://tools.ietf.org/html/rfc3568
DNS solution (if I’m wrong, feel free to correct me):
Assign a secondary domain name to each mirror for separate scheduling; set NS record to the main site
User queries IP from ISP DNS
The main site returns one or more IP addresses of mirror nodes based on the source IP (ISP DNS’s IP)
ISP DNS returns this IP to the user, and caches it for a certain period
User establishes a connection with the mirror node and downloads
301 solution:Domain A record resolves to the main site
User establishes a connection with the main site and initiates an HTTP request
The main site returns HTTP 301 status based on the user’s source IP, redirecting to a mirror node
User establishes a connection with this mirror node and downloads
**Comparison** | **DNS** | **301** |
Scheduling accuracy | Can only know the IP of the ISP DNS, match by region | Can know the user's IP, match accurately based on the user's IP |
Modification of scheduling strategy | Need to wait for DNS cache to timeout to take effect | Can be modified at any time, takes effect immediately |
Mirror node directory structure | Must be the same | Does not have to be the same |
User needs to access outside the school | Not necessary | Necessary(*) |
Access delay | Almost no increase in delay | Need to establish a connection with the main site first, adding the delay of one TCP handshake |
Main site pressure | ISP DNS has cache, low pressure | Every visit has to go to the main site, and the cost of parsing HTTP requests is higher than DNS, so the pressure is high |
Main site stability requirements | High | Relatively high |
Load balancing of multiple main sites | Can be implemented | Can be implemented |
II. Communication between mirrors
The main node needs to monitor the status of each mirror node in real time, for example using Ganglia/Nagios; when a fault is detected, on one hand, the scheduling strategy needs to be adjusted so that new requests no longer go to this mirror; on the other hand, an alert needs to be sent out by email.
A mainland main node of a mirror regularly synchronizes from the upstream source. After synchronization is completed, it notifies other nodes to synchronize from itself (this API needs to be discussed); if synchronization fails, it also notifies other nodes to synchronize from the upstream source.
The access logs of the mirrors have analytical value and are an important basis for adjusting the scheduling strategy. Therefore, there needs to be a mechanism where each node sends its web access logs to the main site every day, and the main site then classifies them according to the mirror.
Each mirror site should try to negotiate to use the same Linux distribution, to facilitate the writing of the “maintenance tool chain”.
III. Information that the main site web interface needs to provide
- Public mirror list, showing the size of each mirror, daily update statistics, which mirror nodes already exist, and which mirrors each node maintains, so that newly joined mirror sites can act according to their abilities, and use limited resources on the most needed mirrors.
- Status monitoring of each mirror node
- Help information for users using each mirror
Reference for this article: http://blog.ustc.edu.cn/pipermail/ustc_lug/2013-March/009974.html