Which Domestic AI Large Model Has the Most Promising Future?

(This article was first published on Zhihu)

No conflict of interest: Since I am not working on foundational large models (I work on infra and application layers) and am currently not involved in the domestic market, I can provide some information from a relatively neutral perspective.

After a few months of entrepreneurship, I found that I could access much more information than ordinary big company employees, learning a lot from investors and core members of the world’s top AI companies. Based on the information gathered in the United States over three months, I feel that ByteDance and Baidu are the most promising among the big companies, and among the startups that have publicly released large models, Zhipu and Moonshot are the most promising.

Although Robin said that there are already hundreds of companies working on foundational large models in China, due to the relatively homogeneous nature of foundational large models, the market for foundational large models is likely to end up like the public cloud market, with the top 3 occupying most of the market share, and the rest being categorized as others.

Most of the large model startups in China have just started for half a year, and nothing is set in stone yet. Some hidden masters are still quietly preparing their big moves. The era of large models has just begun, and as long as the green hills are there, one need not worry about firewood.

ByteDance

Why do I think ByteDance might be the most promising?

ByteDance has the most multimodal Chinese private data. High-quality data is very critical in the training of large models. There is already a scarcity of high-quality public Chinese data, which has been almost entirely scraped clean. The next step for large models is definitely multimodal, and ByteDance has the most multimodal data.
ByteDance has scientists from OpenAI. A few months ago, ByteDance poached several scientists from OpenAI at a high price. In April, there were rumors that ByteDance wanted to spend millions of dollars to poach people from OpenAI, but the interviewer was counter-poached by OpenAI, which became a joke. Unexpectedly, ByteDance really managed to poach people from OpenAI.
ByteDance has a lot of GPU resources. A few years ago, ByteDance began building large-scale GPU clusters, accumulating a lot of GPU resources. At the beginning of this year, when there was a GPU shortage, Volcano Cloud became the second-largest GPU cloud service provider in China thanks to its stockpile of GPUs. ByteDance started working on large-scale RoCE networks for GPU cluster interconnection a few years ago, and there are only a few companies worldwide that can manage this, with Microsoft being the largest, having just published a paper this year. Some companies also tried to emulate Microsoft’s RoCE interconnection for GPU clusters but fell into pitfalls.
ByteDance has practical application scenarios. For example, many are working on video generation, but ByteDance has not pursued an end-to-end generation like Runway ML. Instead, it has taken a more pragmatic approach by integrating AI capabilities into Jianying, making it easier for users to create short videos.

Note that Douyin does not use ByteDance’s latest large model, so the model used by Douyin does not reflect the latest progress of ByteDance’s large models. Although ByteDance’s current level of large models is not as good as Baidu’s, its development acceleration is relatively large.

Baidu

I also think Baidu is very promising among the big companies.

Baidu has a first-mover advantage. Wenxin Yiyan is the first officially released Chinese large model, which already has millions of DAUs on the C-end. It has reached a level between GPT-3.5 and GPT-4, with some Chinese capabilities already reaching GPT-4.
Baidu’s top management places high importance on it. During the most critical period of Wenxin Yiyan, Robin had to listen to reports from the Wenxin Yiyan team every day. GPUs are the only computing resources that require Robin’s personal scheduling.
Baidu has a lot of text data accumulation. Before the mobile internet era, Baidu was the largest public data aggregation site in the Chinese internet. The data of the mobile internet is mostly siloed, and much of it is even private, Tencent would not dare to use WeChat QQ chat records for large model training. Baidu’s data team is also very strong, with professional data collection and cleaning. Just for data augmentation, they spend tens of millions on OpenAI API calls a month.

Of course, Alibaba, Tencent, and Huawei each have their own advantages, such as Alibaba’s numerous GPUs, advanced Infra, Tencent has practical application scenarios, and Huawei has its own AI chips. But the level of large models they have released so far is not as good as Baidu and ByteDance.

Moonshot

Moonshot is a representative of domestic large model startups.

Moonshot has a professional and harmonious team. Although Moonshot’s team is relatively young, they have very sharp technical views, including the “compression equals intelligence” view that was popular a few months ago. In the field of large model technology, since no one has successfully developed a large model before, being young might actually be an advantage. The founding team is harmonious, without the big company disease, all focusing on technology.
The progress of large models is fast, already surpassing the GPT-3.5 level. Among the large models released by startups, only Moonshot’s model level has surpassed GPT-3.5. They did not directly copy the LLaMA architecture but made many engineering optimizations. For example, its long-context capability is the strongest in China. Under reasonable prompts, the probability of extracting information from each position in the context exceeds 90%, which is not simply achievable by methods like LongChat. However, it does not rule out that other companies are quietly preparing big moves.

Compared to the top few large model startups, the biggest disadvantage now is that the financing amount is not the largest, and the GPU resources may not be sufficient to train a GPT-4 level model.

I believe that ByteDance, Baidu, Alibaba, and Huawei will definitely use their own large model teams, and although Tencent is also developing large models, it is possible that they will acquire a large model company. At the right time, being acquired by Tencent is actually not a bad outcome.

Zhipu

Zhipu is a representative of domestic to B field large model startups.

Zhipu has a unique business model. Although the to B market is relatively difficult to profit from, the revenue is more secure. While most domestic large model startups mainly target the to C market, having to B resources will secure a stable niche market in this track. Therefore, Zhipu is also the largest in terms of personnel among domestic large model startups.
Large financing amount. Among domestic startups, the financing amount should be relatively large. Although the investment may not be as much as that of big companies like ByteDance and Baidu, as long as the GPUs are in place, it is enough to train a GPT-4 level model.
Early start, timely transformation. Zhipu was initially working on knowledge graphs. After the large model wave arrived, it timely transformed into large models. It took a more pragmatic approach in combining knowledge graphs and large models, without forcibly integrating knowledge graphs into the Transformer.

However, the publicly available model of ChatGLM has not yet fully reached the GPT-3.5 level, while the model levels of ByteDance, Baidu, and Moonshot have already surpassed GPT-3.5. I included Zhipu mainly because it will definitely occupy a place in the to B field.

What about other companies?

I cannot comment on other companies one by one. There are many strong companies not listed here, and I can only see the current progress, not predict the future.

“Happy families are all alike; every unhappy family is unhappy in its own way.”

The most common problems in big companies:

Resource fragmentation, limited GPU resources, and talent resources are dispersed among multiple competing teams, wasting a lot of resources internally.
Thick department walls, departments working on large models cannot access data.
Senior experts not on the front line commanding young experts on the front line, technical solutions are not grounded enough. Large models are a new field, and except for a few top scientists, everyone is starting from the same starting line. The research taste that was successful in traditional AI may not necessarily be transferable to large models and may even become an obstacle. As Jelinek said decades ago, every time a linguist is fired, the accuracy of speech recognition goes up.
Locked in the company’s existing business scenarios. Large models are originally general technologies, but big companies often require prioritizing enabling existing products. If there are few integration points between existing products and large models, it may lead to the large models not being applied effectively. OpenAI was initially required by Microsoft to prioritize use in Office, which led to staff departures, fortunately, Microsoft did not end up doing such a short-sighted thing.
Copying others’ architectures without their own innovation.
Another extreme, blindly pursuing innovation, such as insisting on an innovative non-Transformer architecture without enough deep thought, resulting in falling into pitfalls.
Attacking on all fronts, but succeeding in none. Wanting to do GPT-4, multimodal, and long-context at the same time, with text and multimodal being handled by two different groups; wanting to do both to B and to C, both domestic and overseas markets.
Unable to buy/rent GPUs.

The most common problems in startups:

Founder dispute, like what happened to OpenAI recently, core members engage in internal strife. This is the most terrifying problem for startups.
Not a big company, but suffering from big company disease. If there are many people with senior big company experience in the company, this problem is likely to occur.
Unable to recruit reliable experts in a certain field. Data, algorithms, Infra are all important directions, and it’s not easy to recruit reliable people in all three directions.
Rushing to release technology that is not mature enough, damaging the company’s reputation. For example, many Chinese large models are based on continue pretraining on LLaMA, adding some Chinese corpora to become Chinese large models; using LLaMA’s architecture, starting from scratch to collect and clean data for pretrain is already something only top startups can handle. The real masters are those who are prepared to make a GPT-4 level model with a decisive strike.
Lack of technical moat. For example, creating a virtual Freud product with a prompt is easy, but if it needs to truly understand Freud’s theory and be a reliable psychotherapist, it’s not possible without certain technical accumulation. If a company is always worried about what to do if OpenAI does their business, then it shows that the technical moat is not deep enough.
Founders do not understand technology. If founders cannot understand papers, it’s hard to follow the latest developments in the large model field, being bombarded by various public account information every day, easily losing patience.
Unable to buy/rent GPUs.
Blindly expanding scale, not only likely to quickly burn through limited funds but also likely to lead to overstaffing and suffering from big company disease.
Unable to secure the next round of financing. It’s unknown how many foundational large model companies will perish in the cold winter of next year.