The Story of Collecting Large Model Training Corpus

Starting from July, I spent a month alone collecting over 200 TB of large model training corpus, spending 200,000 RMB on traffic and cloud storage fees. Just like the recently released Mate60 Pro, it’s truly a case of “the ape’s cries are incessant, and the light boat has passed ten thousand mountains”.

What’s in the 200 TB Corpus

Z-library e-books, 22.43 million volumes, totaling 31 TB
Libgen library e-books, 3.78 million volumes, totaling 33 TB
Scimag academic papers and journals, 87.6 million volumes, totaling 77 TB
Various Chinese corpora, totaling 4 TB, including:
- Complete set of primary, middle, and high school textbooks, 35 GB
- Over 10,000 university textbooks and professional books, 142 GB
- Collections of dozens of classic newspapers and magazines such as “People’s Daily”, “Reference News”, “Sanlian Life Weekly”, “Global Science”, “Reader”, “China National Geographic”, totaling 1 TB
- Baidu Encyclopedia with 12 million entries, 20 GB
- Ancient books, local county annals 1.6 TB
- Various recommended book lists, English-Chinese bilingual world classics, translations of Chinese classics, etc., over 20,000 books, about 300 GB
- Various dictionaries 100 GB
- Various Chinese novels about 100 GB
Various datasets:
- RedPajama dataset, an open-source replica of the LLaMA dataset, 2.8 TB
- MNBVC dataset, 1 TB
- CommonCrawl May-June 2023 version of WET plain text data, compressed to 8.6 TB
- Historical Whois data for almost all domain names worldwide (3 billion entries), 2.5 TB
- TheStack dataset, source code of well-known open-source projects on GitHub, 3 TB
- The-Eye dataset, a collection of many AI training datasets, 15 TB
- AmazonReviews dataset, 55 GB

Why did I collect so many books? Many of these books are PDFs composed of images and require OCR to be used as text model training corpus. I have two considerations:

The quality of the corpus is more important than the quantity. The number of posts on Baidu Tieba may be more than the number of books, but posts on Tieba can only train a large model into a jokester, not to do serious work; to master knowledge, you still need to systematically learn from books and literature.
In the future, multimodal large models will become mainstream. Vision contains a lot of important information about the human world. The current text large models only use text for training, which actually loses a lot of information. Future multimodal large models can directly learn multimodal knowledge containing images and text from PDF books.

Whois Domain Registration History Dataset

Today, I used one of the more interesting datasets, had GPT-4 help me write code, and spent 3 hours making a query website: 3 billion Whois history queries for domains worldwide: whois.os.ai.

For example, if you search for Microsoft, you can see that there are actually many microsoft.* domains, and it takes a while to load them all. You can also search for your own domain. Most domains that have existed in history are in this database, and most newly registered domains can be queried in this system the next day.

This dataset originated from my course assignment for the Advanced Software Engineering course at MSRA in 2013~2014. At that time, I made a website soip.net (you can still find the historical traces of domain registration on whois.os.ai), got the .com and .net DNS Zone File from Verisign (currently these gTLD Zone Files can be obtained through ICANN), and then slowly crawled all the Whois data of these tens of millions of domains (currently the number of .com domains has exceeded 100 million), and also crawled the IP addresses resolved from each domain.

This formed a linked data of domain, IP, and Whois domain registration information. You can reverse lookup which domains are hung on a host based on IP, and you can also reverse lookup which domains a person has registered based on registration information. At that time, domain registration information protection was not popular, and the real name, address, email, and phone number of the domain registrant could be publicly found through Whois. Actually, there were already companies providing such services at that time, so I made this website just for the course assignment and did not continue to operate it.

But I think the history of Whois domain registration information should be of high value. It records one side of Internet history like the Internet Archive WayBackMachine. So I kept maintaining it, and later added more gTLD and ccTLD data sources. Of course, my interest-based project can’t achieve 100% coverage, unlike companies like WhoisXMLAPI that professionally provide Whois data history.

10 years have passed, and there are already more than 700 million domains in the Whois dataset, close to 3 billion Whois historical records, of which only over 200 million domains are currently active, and over 400 million domains have disappeared into the dust of history. Most of these domains are bought by “domain farmers” for investment or collection, and are not really used to build websites. Some people who don’t understand technology think that as long as they don’t tell others after registering a domain, no one will know, but that’s not the case. For most top-level domains, the daily increments of domain registration information and DNS information are public, and anyone with a cooperative relationship can get them. With the domain dataset, you can crawl many websites that are not included in search engines.

If I were to write this query website from scratch, it would take at least 2 days. With GPT-4, it only took 3 hours, and the front end is even more beautiful than what I could do. The source code for the entire website was basically written by GPT-4, including the front end, Flask backend, and the script for importing CSV data into MongoDB (of course, importing data took a day or two). The entire front end consists of only one file, and the backend also only has one file, totaling over 500 lines of code. If there is any problem with the code, I let GPT-4 modify it. I’m just a product manager who provides requirements, without writing a single line of code.

Data Collection and Purchasing Data

I have also been in contact with some companies that sell data. The cost of cleaned data is actually quite high, far exceeding the cost of collecting data on your own. But some data is hard to crawl on your own, such as Tianya Forum which no longer exists today, it’s hard to browse all articles on WeChat public accounts, and there are some non-public industry data.

However, for websites like Zhihu, there is no need to buy data. Zhihu now has hundreds of millions of questions and billions of answers. If you buy data according to the pricing of data companies, it would cost an unknown amount of money. Therefore, the ability to crawl data on your own is very important.

Data cleaning is also crucial. I have seen some large language models where the answers still contain things like “expand all”, “previous page”, “next page”, which indicates that the data has not been properly cleaned.

I just used my spare time to do some preliminary data collection and cleaning, and I will share any new progress with everyone in the future.

The Story of Collecting Large Model Training Corpus

What’s in the 200 TB Corpus

Whois Domain Registration History Dataset

Data Collection and Purchasing Data

Comments