Data is the Moat for Internet and AI Companies

This article was first published in a Zhihu answer to the question “Looking back at the development of the internet, what underlying logics seem simple but will continue to be effective in the future?”

Data is the most important moat.

The Moat for Internet Companies is Data

I really like Lao Wang’s Product Class. Wang Huiwen is one of the founders of Xiaonei and Meituan. His Tsinghua product class is a classic, worth revisiting repeatedly. It talks about economies of scale, and social networks have network effects. The essence of network effects is actually data: who are my friends? How close am I to these friends?

Lao Wang’s product class mentions that replicating WeChat is difficult. Alibaba and ByteDance tried to attack WeChat but failed. However, if one day there is a Prophet app that knows all of a person’s real-life friendships and automatically generates friend relationships based on this, it could potentially compete with WeChat. This is the value of WeChat’s control over friend relationship data.

But this Prophet app doesn’t have WeChat’s chat history or Moments history, so something is still missing. This is the value of conversation history data. If the Prophet app goes further and knows what everyone says and does every day, then even WeChat might not be its match.

The moat for e-commerce, food delivery, and ride-hailing platforms is also data. Which merchants are there, what goods or dishes do they have; which riders are available, what is the traffic situation; what do users want to buy, eat, or where do they want to go. The more people on the platform, the more data there is, and the more likely it is to match what users want. Imagine a food delivery platform in Beijing with only 10 restaurants and 10 riders; it would be hard for users to find what they want to eat, and even if they do, it would be difficult to find a nearby rider to deliver it, making the platform’s value very limited.

For content communities like Zhihu, Bilibili, Douyin, and Xiaohongshu, the moat is also data. The most interesting questions are asked on Zhihu, and the most professional respondents answer on Zhihu, giving a knowledge Q&A community great value. Developing a content community is not costly, but operating one well is much harder than development itself.

The moat for internet financial companies is also data. Why do they dare to lend to users without collateral? It’s because they can better assess users’ credit status based on their interaction data on the platform.

The Moat for AI Startups is Data

After the sedimentation of AI entrepreneurship in 2023 and 2024, people are also pondering what exactly is the moat for AI startups. I believe that the moat for most AI startups still relies on data.

Including myself, many AI startups are not so successful mainly because they overestimate the value of technology. They think that as long as their algorithms are good, effective, high-performance, and low-cost, they can capture the market. But often, customers feel that “good enough” is sufficient. Everyone uses similar foundational models, and engineering optimization alone cannot create a gap. Foundational models are developing so quickly that any technical lead is quickly nullified with a new model update, putting everyone back at the same starting line.

Reflecting on the underlying logic of the internet, data is the key to building a moat for AI products.

Data is divided into two aspects: user data and vertical domain data.

User Data:

AI companion apps: An app that knows a user’s preferences, personality, and personal experiences can obviously chat better. These user data need to be accumulated slowly, and once accumulated, they become a moat.
Productivity tools like ChatGPT: If the tool knows what the user is currently working on (such as information about the current project), it doesn’t need to repeat the same prompts every time. In fact, ChatGPT’s bio already has this feature, but it’s not done well yet.
AI programming tools like Cursor and Windsurf: If the tool knows the user’s preferred tech stack and frameworks or the company’s regulations in this area, it can use these preferred tech stacks and frameworks every time it writes code.

Vertical Domain Data:

Many people are doing RAG now and complain that the quality of RAG-generated content is not high, mainly because the issue lies not in the generation part of the large model but in Retrieval. General search engines cannot guarantee that the results found are all relevant (accuracy) or that all relevant results can be found (recall rate). In commercial scenarios requiring high accuracy, it cannot be implemented. To achieve high accuracy, the data source must be structured, whether it’s a database or a knowledge graph. I shared in a talk at Jiacheng’s Flowing Banquet, “AI Agents from Demo to Implementation,” where the second part discusses how vertical domain RAG can use structured data to improve Retrieval’s accuracy and recall rate.
Another challenge for AI implementation is that many industries’ data and workflows have not been digitized at all, or even if they have been digitized, they are scattered in chat records and fragmented documents, not forming a structured knowledge base. Previously, the cost of data collection and cleaning to build a structured knowledge base was very high, requiring a lot of manpower, but now with AI, the cost can be significantly reduced. This is also what I discussed in the second part of “AI Agents from Demo to Implementation.” It is foreseeable that if a company has some accumulation in the industry and is the first to use AI to build an industry knowledge base, it will have a significant competitive advantage.
AI can amplify the value of data and deepen the data moat. For example, if a company has the most comprehensive and accurate data in a certain industry, such as information on all schools and teachers nationwide, previously, this data could only be used to create a resource site for users to search and view, and selling the data in packages wouldn’t fetch much money. But now with AI, if these data are used to create a RAG application, the information users get about choosing schools and majors might be the most professional, equivalent to everyone having a one-on-one consultation with Zhang Xuefeng. RAG applications that can only use general search engines cannot achieve such comprehensive and accurate results. The value of these industry data can be realized.

The vertical domain data mentioned above is not limited to B2B; it is also applicable to B2C.

Many B2C apps currently have low retention rates, mainly because they lack gamification and contextualization. After the initial novelty, users don’t know what to do next and cannot receive immediate incentives and feedback. A good B2C app needs to achieve gamification and contextualization, matching users’ preferences and needs, with a system of gameplay that provides users with immediate incentives and feedback. These scenarios and gameplay are domain data, and users’ preferences are user data.

An English learning app that only chats randomly like ChatGPT will not be effective. The app needs to have carefully designed courses and content, recommending content based on users’ preferences and skill levels, and also have some assessment and reward mechanisms.
A companion chat app that only allows free chatting with virtual characters will quickly leave users unsure of what to talk about. This requires gamification, with scenarios and incentives. Behind these scenarios and gameplay systems is data. If some users need psychological counseling, then professional language and guidance techniques from the field of psychological counseling are also needed.

It can be seen that “interesting” and “useful” are not mutually exclusive. Learning English and psychological counseling do not have to be boring. Useful apps can also be very interesting.

In addition to AI application entrepreneurship, for foundational model companies, data is also very important. With higher quality data, the trained model will have a higher knowledge density and stronger cost competitiveness. Although most high-quality text data have been exhausted in foundational model pre-training, the law of knowledge density in models continues. In 2023-2024, the knowledge density of models increased by 100 times. Why? Because everyone is continuously using better large models to distill higher quality data, thereby training smaller models with higher knowledge density. For example, OpenAI’s latest pre-trained model has not been released because it is too large and expensive, and its capability improvement is not significant. Its role is to generate training data to enhance the capabilities of low-cost commercial models.

In the future, every internet company will be an AI company. Most companies will not create AI but will use AI.

From the internet era to the AI era, data has always been a very important moat. AI can also amplify the value of data.

In the future, the economies of scale of the internet will not only look at the number of users but more importantly, what data users contribute and what domain data the platform controls.

Comments

Gitalking ...