This article was first published as a Zhihu answer to 《回顾互联网发展,有哪些底层逻辑看似简单但将在未来持续奏效?》

Data is the most important moat.

The moat of Internet companies is data

I really like Lao Wang’s product course. Wang Huiwen is one of the founders of Xiaonei and Meituan. His Tsinghua product course is a real classic, worth revisiting again and again. In it he talks about economies of scale, and how social networks have network effects. Behind network effects is actually data: which friends do I know? How close or distant am I with these friends?

In Lao Wang’s product course, he says it’s very hard to copy WeChat. Alibaba and ByteDance both tried to attack WeChat but failed. But suppose one day there’s a “Prophet” app that knows all of a person’s real-world relationships and automatically generates friend links based on that. As long as two people meet and chat a bit, it can recommend them as friends automatically, without needing to scan a QR code—of course, whether to add or not is up to the users themselves. Such a Prophet app would probably be able to fight WeChat head‑on. That’s the value of WeChat’s friend‑relationship data.

But this Prophet app doesn’t have WeChat’s chat history, nor Moments history, so it’s still missing something. That’s the value of conversation‑history data. If this Prophet went even further and knew what everyone said and did every day, then probably even WeChat wouldn’t be its match.

The moats of e‑commerce, food delivery, and ride‑hailing platforms are essentially all data. Which merchants are there, what goods and dishes they have; which couriers there are, and what the traffic conditions on the road are; what users want to buy, want to eat, and where they want to go. The more people on the platform, the more data there is, and the more likely it is to match people with what they want. Imagine a food delivery platform in Beijing with only 10 restaurants and 10 riders. It would be very hard for users to find what they want to eat, and even if they did, it’d be hard to find a nearby rider to deliver it. The value of such a platform would be very limited.

For content communities like Zhihu, Bilibili, Douyin, and Xiaohongshu, their moat is also data. If all the most interesting questions are asked on Zhihu, and the most professional answerers are answering there, then such a Q&A community has great value. Building a content community is not very costly, but operating a content community well is much harder than the initial development.

The moat of Internet finance companies is also data. Why do they dare to lend to users without collateral? Because they can use users’ interaction data on the platform to relatively well assess their credit status.

The moat of AI startups is data

After the 2023–2024 cooldown, people in AI entrepreneurship are also thinking about what really constitutes a moat for AI startups. I believe for most AI startups, the moat still has to rely on data.

Myself included, many AI startups haven’t been going very smoothly, mainly because we overestimated the value of technology. We felt that as long as our algorithms were good, results strong, performance high, and cost low, we would win the market. But often customers feel “good enough” is enough. Everyone is using similar base models, and you can’t pull ahead much just by engineering optimization. Base models are progressing so fast that by the time you finally open up a bit of technical gap, a new generation of base models comes out, much of your technical accumulation is wasted, and everyone is back on the same starting line again.

Reflecting on the underlying logic of the Internet, data is the key to building AI product moats.

Data falls into two broad categories: user data and vertical‑domain data.

User data:

  • AI companion apps. If an app knows a user’s preferences, personality, and personal experiences, it can obviously chat with them much better. This user data needs to be accumulated gradually, and once accumulated, it becomes a moat.
  • Productivity tools like ChatGPT. If they know what the user is currently working on over a period of time (for example, information about the current project), then you don’t need to repeat that background in the prompt every time you ask a question. In fact, ChatGPT’s bio feature already has something like this, but it doesn’t work very well yet.
  • AI coding tools like Cursor and Windsurf. If they know what tech stacks and frameworks/libraries the user is good at and likes to use, or some company-specific rules in these areas, then every time they write code they can default to those preferred stacks and frameworks.

Vertical‑domain data:

  • Many people are doing RAG now and complain that the quality of RAG‑generated content is poor, but the main problem is not in the generation part of the large model, it’s in the retrieval. General‑purpose search engines can neither ensure that all results they find are relevant (precision), nor that all relevant results are found (recall). That makes them impossible to apply in high‑precision business scenarios. To achieve high precision, the data source must be structured, whether in a database or a knowledge graph. Yesterday at the Jiacheng “流水席” I shared a talk, “AI Agents 从 demo 到落地 (Download PDF version)”, where the second part was about how vertical‑domain RAG can leverage structured data to improve retrieval precision and recall.
  • Another challenge for AI implementation is that in many industries, data and workflows haven’t been digitized at all, or even if digitized, they’re scattered in chat logs and ad‑hoc documents, without forming a structured knowledge base. In the past, the cost of data collection and cleaning to build a structured knowledge base was very high and required a lot of manpower, but now with AI the cost can be greatly reduced. This is also what I talked about in the second part of “AI Agents 从 demo 到落地 (Download PDF version)” yesterday. It’s foreseeable that if a company already has some accumulation in an industry and is the first to use AI to build up the industry knowledge base, it will gain a huge competitive advantage.
  • AI can amplify the value of data and deepen the data moat. For example, suppose a company has the most complete and accurate data in a given industry—say, information on all majors and teachers at every school nationwide. Previously, this data could only support a reference site for users to search and browse, and even selling the data bundle wouldn’t fetch much money. But with AI, if you build a RAG application on top of this data, the school/major selection advice users get there may be the most professional available—equivalent to everyone getting one‑on‑one guidance from Zhang Xuefeng. A RAG app that can only use general‑purpose search engines simply cannot reach that level of coverage and accuracy. The value of this industry data can then be fully realized.

The vertical‑domain data described above isn’t limited to B2B; it also applies to B2C.

Many B2C apps now have poor retention mainly because they lack gamification and contextualization. After the novelty wears off, users don’t know what to do next and can’t get instant incentives and feedback. A good B2C app needs gamification and contextualization, to match users’ preferences and needs, and to provide a system of mechanics that gives users instant incentives and feedback. These scenarios and mechanics are domain data, and user preferences are user data.

  • If an English‑learning app only chats casually like ChatGPT, learning outcomes will definitely be poor. The app needs carefully designed curricula and content, recommending content based on users’ preferences and proficiency levels, and it also needs some assessment and reward mechanisms.
  • If a companion chat app only allows free‑form conversation with virtual characters, users will soon run out of things to say. This requires gamification, with scenarios and incentives. Behind these scenarios and mechanic systems lies data. And if some users’ needs are actually psychological counseling, then you also need some professional counseling scripts and guidance techniques from that field.

You can see that “fun” and “useful” are not mutually exclusive here; English learning and psychological counseling don’t have to be boring. A useful app can also be very fun.

Beyond AI application startups, for base‑model companies, data is also extremely important. With higher‑quality data, the model you train has higher knowledge density and stronger cost competitiveness. Although pretraining for base models has now almost exhausted the vast majority of high‑quality text data, the “knowledge density law” of models is still in effect. From 2023 to 2024, model knowledge density improved by 100×. Why? Because people are continually using better large models to distill higher‑quality data, and then using that to train smaller models with higher knowledge density. For example, OpenAI’s latest pretrained model hasn’t been released because it’s too big and too expensive, and its capability gains aren’t that obvious. Its role is to generate training data to improve the capabilities of low‑cost commercial models.

In the future every Internet company will be an AI company. Most companies won’t build AI; they’ll use AI.

From the Internet era to the AI era, data has always been a very important moat. AI can further amplify the value of data.

In future Internet economies of scale, the key will not just be the headcount of users, but more importantly what data those users contribute and what domain data the platform controls.

Comments