2023-09-12
PLDI '21 Talk Transcription: AKG: Automatic Kernel Generation for Neural Processing Units using Polyhedral Transformation

Jie Zhao, Bojie Li, Wang Nie, Zhen Geng, Renwei Zhang, Xiong Gao, Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, Peng Di, Kun Zhang, Xuefeng Jin. AKG: Automatic Kernel Generation for Neural Processing Units using Polyhedral Transformations. 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI’21). Virtual, Canada, June 20-25, 2021. pp.1233-1248. [Paper PDF] [Slides by Jie Zhao]

Read More

2023-09-12
SIGCOMM '19 Talk Transcription for SocksDirect: Datacenter Sockets can be Fast and Compatible

Large models are really amazing. This SIGCOMM 2019 talk was completely off-script, as can be seen from the video, where I am standing in the middle of the stage, not looking at speaker notes. My English wasn’t that good at the time, I often stuttered, and the audio recording even had an echo, which made it a bit hard for me to listen to. I didn’t expect that a large model could recognize such poor speech almost completely correctly, it’s amazing.

The recognition method is here. Because the screen recorded in this video is not clear enough, I replaced the images extracted from the video with images exported from the original PPT. You can see how high the recognition rate of the audio in this video can be achieved by the voice recognition software on the market. The ones I’ve tried, including Google Speech-to-Text and Whisper, are basically unusable.

SocksDirect: Datacenter Sockets can be Fast and Compatible. [PDF] [Slides] [Video]
Bojie Li, Tianyi Cui, Zibo Wang, Wei Bai, Lintao Zhang.
Proceedings of the 2019 SIGCOMM Conference (SIGCOMM’19).

Read More

2023-09-12
SIGCOMM '21 Talk Transcription for 1Pipe: Scalable Total Order Communication in Data Center Networks

Bojie Li, Gefei Zuo, Wei Bai, and Lintao Zhang. 1Pipe: Scalable Total Order Communication in Data Center Networks. SIGCOMM ‘21. [Paper PDF] [Slides with audio (25 min)] [Slides with audio (12 min)]

Read More

2023-09-12
Release of English Version of the Blog

To facilitate international friends to read my blog content, I used GPT-4 to automatically translate the content of this site into English:

Automatically translated English version

Chinese main site

Read More

2023-09-10
A100/H100 too expensive, why not use 4090?

(Long text warning: this article is about 16000 words)

This is a good question. To start with the conclusion, it’s not feasible to use 4090 for training large models, but it’s not only feasible to use 4090 for inference/serving, it can also be slightly higher in cost performance than H100. If 4090 is optimized to the extreme, the cost performance can even reach twice that of H100.

In fact, the biggest difference between H100/A100 and 4090 lies in communication and memory, and the gap in computing power is not large.

H100 A100 4090
Tensor FP16 computing power 989 Tflops 312 Tflops 330 Tflops
Tensor FP32 computing power 495 Tflops 156 Tflops 83 Tflops
Memory capacity 80 GB 80 GB 24 GB
Memory bandwidth 3.35 TB/s 2 TB/s 1 TB/s
Communication bandwidth 900 GB/s 900 GB/s 64 GB/s
Communication latency ~1 us ~1 us ~10 us
Price $30000~$40000 $15000 $1600

There is a lot of water in NVIDIA’s power table. For example, the H100 TF16 power is written as 1979 Tflops, but that includes sparsity, and the dense power is only half; the official promotion of 4090 Tensor Core power is as high as 1321 Tflops, but that is int8, FP16 is only 330 Tflops. The first version of this article used the wrong data, both H100 and 4090 data were used incorrectly, and the conclusion was very outrageous.

The price of H100 actually has more than 10 times the water. In 2016, when I was at MSRA, I witnessed Microsoft deploying FPGA on each server, hitting the price of FPGA to the sand, and even became an important pusher for supplier Altera to be acquired by Intel. In 2017, I mined by myself, knowing which graphics card is the most cost-effective. Later at Huawei, I was also a core participant in the software development of Kunpeng and Ascend ecosystems. Therefore, I have a rough idea of how much a chip costs.

Xia Core, the chief architect of Kunpeng, has a well-known article “Talking about the broken ass of the Nvidia Empire“, which analyzes the cost of H100 well:

Open his cost, the cost of SXM will not be higher than $300, the packaging Substrate and CoWoS also need about $300, the largest Logic Die in the middle, looks the most expensive :) That is a 4nm 814mm2 Die, a 12-inch Wafer from TSMC can roughly manufacture about 60 Dies of this size, Nvidia does very well in Partial Good (he almost doesn’t sell Full Good), so these 60 Dies can roughly have 50 available, Nvidia is a big customer, the price obtained from TSMC is about $15000, so this expensive Die only needs about $300. Oh, only HBM is left, the current DRAM market is so weak that it is almost dying, even HBM3 is basically selling at a loss, it only needs about $15/GB, um, the cost of 80GB capacity is $1200.
TSMC once told a story. Taiwanese compatriots work hard to save money to build factories, a 4nm so advanced process, can only sell for $15000, but that certain customer takes it, can sell for $1500000 ($30000*50) goods, locomotive, that is very annoying. Do you understand what I mean?
As I said at the beginning, under the business rules of this world, selling something with a cost of $2000 for $30000, only one company, and the sales volume is still large, this is illogical, this kind of golden hen must have an aircraft carrier to keep it.

It is said that Microsoft and OpenAI have taken half of the H100 production capacity in 2024, guess if they will play the traditional art of bargaining with Altera? Will they really spend $40,000 * 500,000 = 20 billion dollars to buy cards?

Let’s analyze the cost of 4090 again, the 5nm 609mm2 Die, the cost is about $250. GDDR6X, 24 GB, calculated at $10 per 1 GB, $240. Let’s count PCIe Gen4, this cheap thing, as $100. Packaging and fans, count it as $300. The total cost is at most $900, this thing sells for $1600, it is considered a conscience price, because the research and development cost is also money, not to mention that most of NVIDIA’s R&D personnel are in Silicon Valley, where the average salary of programmers is the highest in the world.

It can be said that the H100 is like a house in a first-tier city in China. The concrete and steel itself is not worth much money, and the house price is completely blown up by the supply-demand relationship. I have been living in LA for two weeks. The house rented by the company has 4 times the usable area of my house in Beijing, but the price is only 30% more expensive, and it comes with a small courtyard, which is equivalent to 1/3 of the unit price of a house in Beijing. When I chat with locals, they are all surprised. Your average income level is so much lower than LA, how can you afford a house in Beijing?

The question is, if the 4090 is so fragrant, why does everyone have to scramble to buy the H100, causing the H100 to be out of stock? Even the H100 has to be banned from selling to China, and a castrated version of the H800 has been made?

Read More

2023-09-08
APNet'23 Talk Transcription for FastWake: Revisiting Host Network Stack for Interrupt-mode RDMA

Although most people prefer watching videos, I prefer reading text because text facilitates non-linear searching, allows for quick skimming, and is convenient for reviewing previous content at any time.

Recently, I have converted some of my lecture videos at academic conferences into text, such as ClickNP, KV-Direct and The New Golden Age of Computer Networks Series. Today, I am releasing FastWake from APNet 2023. Before the ClickNP and KV-Direct presentations, I would write the script in the notes of the PPT and read it directly on the spot. This year, even the PPT was rushed to finish the day before the conference, and there was no time to write notes, let alone a complete practice. I just went on stage to speak.

Now with large models, it’s not difficult to convert lecture videos into PPT + text scripts. In fact, I’ve always wanted to make such an online conference plugin.

  1. Extract the key frames from the video to form a PPT image list. If the difference between each frame and the previous one exceeds a certain threshold, it is considered that a PPT page has been switched. An open-source software video2pdf can do this.
  2. OCR each image into text, all are printed characters, the recognition accuracy is very high, Tesseract can do it.
  3. Extract the video soundtrack that stays on each PPT page and give it to the Speech-to-Text model for recognition, for example, I use OpenAI’s open source Whisper.
  4. (The last step is very important) Let the large language model (such as GPT-4) refer to the current page PPT and the homepage PPT content OCR’d out, and correct the transcription recognized by the Speech-to-Text model.

The current Speech-to-Text model is not very accurate in recognizing proper nouns and names, but many of these proper nouns have appeared on this page of PPT, and the PPT homepage also frames the title and field of the speech. Therefore, with the PPT content as a reference, the large language model can correct most of the errors in recognizing proper nouns. Without the PPT content as a reference, GPT-4 is needed to correct most of the proper nouns, but with the PPT content, LLaMA-2-70b-chat is enough. In addition, the large language model can correct the colloquial expressions in the speech, making the text script more rigorous and readable.

The following text script is completely auto-generated, except for a few names, nothing has been changed. Of course, some minor errors are also retained, but they are all harmless. The Video2PDF, Tesseract, Whisper, and LLaMA-2-70b-chat models used in the whole process all run on my own Mac notebook, and no internet connection is required throughout the process.

Read More

2023-09-06
The Story of Collecting Large Model Training Corpus

Starting from July, I spent a month alone collecting over 200 TB of large model training corpus, spending 200,000 RMB on traffic and cloud storage fees. Just like the recently released Mate60 Pro, it’s truly a case of “the ape’s cries are incessant, and the light boat has passed ten thousand mountains”.

What’s in the 200 TB Corpus

  • Z-library e-books, 22.43 million volumes, totaling 31 TB
  • Libgen library e-books, 3.78 million volumes, totaling 33 TB
  • Scimag academic papers and journals, 87.6 million volumes, totaling 77 TB
  • Various Chinese corpora, totaling 4 TB, including:
    • Complete set of primary, middle, and high school textbooks, 35 GB
    • Over 10,000 university textbooks and professional books, 142 GB
    • Collections of dozens of classic newspapers and magazines such as “People’s Daily”, “Reference News”, “Sanlian Life Weekly”, “Global Science”, “Reader”, “China National Geographic”, totaling 1 TB
    • Baidu Encyclopedia with 12 million entries, 20 GB
    • Ancient books, local county annals 1.6 TB
    • Various recommended book lists, English-Chinese bilingual world classics, translations of Chinese classics, etc., over 20,000 books, about 300 GB
    • Various dictionaries 100 GB
    • Various Chinese novels about 100 GB
  • Various datasets:
    • RedPajama dataset, an open-source replica of the LLaMA dataset, 2.8 TB
    • MNBVC dataset, 1 TB
    • CommonCrawl May-June 2023 version of WET plain text data, compressed to 8.6 TB
    • Historical Whois data for almost all domain names worldwide (3 billion entries), 2.5 TB
    • TheStack dataset, source code of well-known open-source projects on GitHub, 3 TB
    • The-Eye dataset, a collection of many AI training datasets, 15 TB
    • AmazonReviews dataset, 55 GB

Why did I collect so many books? Many of these books are PDFs composed of images and require OCR to be used as text model training corpus. I have two considerations:

  1. The quality of the corpus is more important than the quantity. The number of posts on Baidu Tieba may be more than the number of books, but posts on Tieba can only train a large model into a jokester, not to do serious work; to master knowledge, you still need to systematically learn from books and literature.
  2. In the future, multimodal large models will become mainstream. Vision contains a lot of important information about the human world. The current text large models only use text for training, which actually loses a lot of information. Future multimodal large models can directly learn multimodal knowledge containing images and text from PDF books.

Whois Domain Registration History Dataset

Today, I used one of the more interesting datasets, had GPT-4 help me write code, and spent 3 hours making a query website: 3 billion Whois history queries for domains worldwide: whois.os.ai.

For example, if you search for Microsoft, you can see that there are actually many microsoft.* domains, and it takes a while to load them all. You can also search for your own domain. Most domains that have existed in history are in this database, and most newly registered domains can be queried in this system the next day.

This dataset originated from my course assignment for the Advanced Software Engineering course at MSRA in 2013~2014. At that time, I made a website soip.net (you can still find the historical traces of domain registration on whois.os.ai), got the .com and .net DNS Zone File from Verisign (currently these gTLD Zone Files can be obtained through ICANN), and then slowly crawled all the Whois data of these tens of millions of domains (currently the number of .com domains has exceeded 100 million), and also crawled the IP addresses resolved from each domain.

This formed a linked data of domain, IP, and Whois domain registration information. You can reverse lookup which domains are hung on a host based on IP, and you can also reverse lookup which domains a person has registered based on registration information. At that time, domain registration information protection was not popular, and the real name, address, email, and phone number of the domain registrant could be publicly found through Whois. Actually, there were already companies providing such services at that time, so I made this website just for the course assignment and did not continue to operate it.

But I think the history of Whois domain registration information should be of high value. It records one side of Internet history like the Internet Archive WayBackMachine. So I kept maintaining it, and later added more gTLD and ccTLD data sources. Of course, my interest-based project can’t achieve 100% coverage, unlike companies like WhoisXMLAPI that professionally provide Whois data history.

10 years have passed, and there are already more than 700 million domains in the Whois dataset, close to 3 billion Whois historical records, of which only over 200 million domains are currently active, and over 400 million domains have disappeared into the dust of history. Most of these domains are bought by “domain farmers” for investment or collection, and are not really used to build websites. Some people who don’t understand technology think that as long as they don’t tell others after registering a domain, no one will know, but that’s not the case. For most top-level domains, the daily increments of domain registration information and DNS information are public, and anyone with a cooperative relationship can get them. With the domain dataset, you can crawl many websites that are not included in search engines.

If I were to write this query website from scratch, it would take at least 2 days. With GPT-4, it only took 3 hours, and the front end is even more beautiful than what I could do. The source code for the entire website was basically written by GPT-4, including the front end, Flask backend, and the script for importing CSV data into MongoDB (of course, importing data took a day or two). The entire front end consists of only one file, and the backend also only has one file, totaling over 500 lines of code. If there is any problem with the code, I let GPT-4 modify it. I’m just a product manager who provides requirements, without writing a single line of code.

Data Collection and Purchasing Data

I have also been in contact with some companies that sell data. The cost of cleaned data is actually quite high, far exceeding the cost of collecting data on your own. But some data is hard to crawl on your own, such as Tianya Forum which no longer exists today, it’s hard to browse all articles on WeChat public accounts, and there are some non-public industry data.

However, for websites like Zhihu, there is no need to buy data. Zhihu now has hundreds of millions of questions and billions of answers. If you buy data according to the pricing of data companies, it would cost an unknown amount of money. Therefore, the ability to crawl data on your own is very important.

Data cleaning is also crucial. I have seen some large language models where the answers still contain things like “expand all”, “previous page”, “next page”, which indicates that the data has not been properly cleaned.

I just used my spare time to do some preliminary data collection and cleaning, and I will share any new progress with everyone in the future.

Read More

2023-08-27
10 Soul-Searching Questions for AI Large Model Startups

  1. To build or not to build a foundational large model?
  2. To B or to C? Domestic or overseas?
  3. RMB capital or USD capital?
  4. Is AI Native application a mobile internet-level opportunity?
  5. Is your vision AGI?
  6. Can the problem of large models talking nonsense be solved?
  7. How does the large model infra profit?
  8. Where is your moat?
  9. Can your business model scale?
  10. How to deal with the regulation and legal responsibility of large models?

Below are my views on these 10 soul-searching questions.

Read More

2023-08-24
Tsinghua's Link Genius Boy: When Top Workers Start Their Own Business

Original video by Bilibili up master “Bao Bao Ba 2022”

Backup of the video on this site (25:58, 121 MB)

The following is the text transcript of AI voice recognition:

Read More

2023-08-17
Speeches at Our Wedding by Various Guests

May 1, 2023, Shijiazhuang

  • Speech by Tan Bo
  • Speech by Mentor Lin Tao
  • Speech by Professor Tan Haisheng
  • Wedding vows of the groom, Li Bojie
  • Wedding vows of the bride, Meng Jiaying
  • Speech by the father of the groom
  • Speech by the father of the bride
  • Speech by the parents of the bride at the name-changing ceremony
  • Speech by the bride at the name-changing ceremony
  • Speech by the parents of the groom at the name-changing ceremony
Read More
RSS