AI Agent, Destined to Explode—GeekPark "Tonight's Tech Talk" Live Broadcast
Live Theme: AI Agent, Destined to Explode?!
Time: March 13, 2025, 20:00—22:00
Method: GeekPark WeChat Video Channel “Tonight’s Tech Talk” Live Broadcast (with guests)
Live Guests:
- Jingyu | Deputy Editor of GeekPark
- Li Bojie | Chief Scientist of PINE AI
- Wanchen | Reporter at GeekPark
Key Highlights Summary
- The core features of AI Agents are the abilities to perceive, plan, and act, enabling them to autonomously gather information, make plans, and execute actions.
- General Agents like Manus will mimic “geek programmers” rather than ordinary people, possessing computational thinking and knowing when to use code and tools to solve problems.
- Current AI Agents are mainly divided into compiled types (like Dify) and interpreted types (like Manus), with compiled types having fixed workflows and interpreted types autonomously planning and making decisions.
- Compiled Agents and interpreted Agents will coexist for a long time rather than replace each other, with different scenarios having different optimal solutions.
- There is a “100x cost law” for large models: chip companies earn 10 times, and large model companies earn another 10 times, revealing the huge gap between model pricing and actual costs.
- Foundational models are key to enhancing the capabilities of general Agents, and humans find it hard to imagine something 10 times smarter than themselves, so human thinking should not be imposed on AI.
- Manus emphasizes “Less Structure, More Intelligence,” similar to the classic “The Bitter Lesson,” where the fewer structural constraints humans impose on AI, the higher the AI’s capability ceiling.
- New generation models like Claude 3.7 Sonnet have made significant breakthroughs in tool usage and programming capabilities, laying the foundation for Agent development.
- The open-source release of DeepSeek R1 makes RL (reinforcement learning) technology more accessible, lowering the threshold for developing high-quality Agents.
- RL training is an important means of building competitive barriers, converting industry experience and expertise into model capabilities.
- The computational power threshold required for RL training is not as high as imagined, and small models trained with RL can surpass large models in some vertical domains.
- Multi-agent architectures are not suitable for all scenarios and may replicate inefficient collaboration models found in human organizations in fields like software development.
- AI programming tools can also play a significant role in large software engineering projects but require a high-quality code engineering foundation, including comprehensive documentation, test cases, and standardized interfaces.
- AI programming tools struggle with “spaghetti code” for the same reason new interns find it hard to take over—there’s too much undocumented tribal knowledge in the code.
- The development of Agent technology will drive improvements in software engineering practices, enhancing code quality and maintainability to meet the standards of well-known open-source projects, making more projects AI-friendly.
- The MCP protocol proposed by Anthropic provides a standardized solution for the interconnection of the Agent ecosystem, allowing diverse professional services to connect rather than replace each other.
- OpenAI’s Responses API, Realtime API, and Anthropic’s MCP represent the direction of Agent framework development.
- The work efficiency of Agents is currently limited by the latency of visual models, with humans still having an advantage in certain operational speeds.
- Virtual machine sandboxes can provide independent working environments but require better personal data integration solutions.
- In the future, AI Agents may be divided into “fast thinking” (user interaction) and “slow thinking” (background processing) parts working together.
- General Agents are a battleground for hardware and operating system giants, but large companies will be relatively cautious in releasing products.
- Opportunities for startups in the Agent field mainly lie in vertical domains, accumulating professional data and industry knowledge through deep cultivation of specific scenarios.
- Programming, education, and interpersonal communication are the three fields most likely to see mature Agent applications first.
Full Interview Transcript
GeekPark: Hello everyone, welcome to GeekPark’s Geek Live Room. With the emergence of Manus, there is a wave of enthusiasm and ideas about AI Agents. This time, we have invited outstanding entrepreneurs in the AI field to discuss AI Agents with everyone. To what extent will AI Agents develop? They have already surpassed the current attention on large models, and it feels like Agents already have their own “hands,” capable of doing many things for us, including the cool use cases they have demonstrated in advance.
So today, we have invited outstanding entrepreneurs in the AI industry to discuss the current development status of Agents with everyone. Without further ado, let’s welcome our guests today, Mr. Li Bojie, Chief Scientist of PINE AI, and my colleague Wanchen. Please welcome both of them.
Li Bojie: Hello, everyone, my name is Li Bojie.
GeekPark: Okay, Bojie is very eager; I haven’t even come out yet. Wanchen, please also greet everyone.
GeekPark: Wanchen is our AI field reporter at GeekPark, always following the development of the AI field. Bojie, this should be your first time on our GeekPark program, so a very warm welcome to you. I feel that everyone may not be very familiar with you yet, so why don’t you briefly introduce your past experiences and what Huawei AI is currently doing?
Li Bojie: It’s a great honor to come to GeekPark to have an exchange with everyone. My name is Li Bojie, and I was one of the first batch of Huawei’s genius youth. Before that, I was a joint Ph.D. student at Microsoft Research Asia and the University of Science and Technology of China. At Huawei, I was mainly responsible for high-performance networking-related work, doing things similar to NVIDIA’s NVLINK and Infiniband, mainly used in large-scale training and inference in ten-thousand-card clusters, as well as high-performance storage, cloud computing, etc.
Actually, we started working on this after the GPT-3 paper came out in 2020. At that time, many people didn’t quite understand it; they thought, “Now the model is trained with at most eight cards, when will there be ten thousand cards?” But we were very confident in the scaling law, and it has indeed become a reality now. Huawei’s AI cluster is currently leading in China.
GeekPark: What are your views on the chip industry?
Li Bojie: I think chips are very important. For example, DeepSeek R1 has a very low official inference API pricing, but it recently announced a fivefold gross margin. This actually means that if your chip is good and the API pricing is low, there will be a large profit margin. So I think Huawei is quite right in this positioning because what China lacks most now is chips, and the moat for chips is particularly deep, so solving the chip problem can create great value.
GeekPark: Why did you choose to leave Huawei and start your own business?
Li Bojie: I later started my own business because I found that in the AI field, I still wanted to do something more application-oriented. Although infra is very important and I have more experience in it, if I don’t do these infra optimization things, others will. But many applications I want to do are not being done by anyone. Most people see AI applications as very competitive because many of the needs everyone can see are general needs or the needs of big companies. But many fields actually want to use AI but can’t find people who understand AI.
At that time, I found the Web3 field very attractive because, at the end of 2023, Web3 plus AI was a very hot topic. I also found a problem at that time: AI model costs were particularly high. You may remember GPT-4 in 2023, which was very expensive and couldn’t do many complex things. In the Web3 field, the clients there are relatively able to afford it, such as some leading exchanges whose daily trading volume may be higher than Nasdaq, and their net profit is higher than ByteDance. At the same time, these companies do not have very strong AI teams themselves and need to introduce AI technology from outside. So when AI model costs have not come down, serving these clients is easier to run through the business model.
But later, as I was doing it, I wanted to change direction because I found that many people in Web3 were doing AI, but most were just hyping concepts. Many people didn’t really want to use AI to solve Web3 problems or use Web3 to solve AI problems. For example, most people in Web3 emphasize equality, transparency, and privacy, but AI may emphasize efficiency more. Previously, Vitalik also had an interview about this, saying that for people doing AI, as long as AI is useful, that’s enough; they don’t care about security and privacy. So I thought that if AI products are to be made as large as possible and used by as many people as possible, they might still need to be done in web2.
One thing I always wanted to do was inspired by a talk I heard in 2017 at MSRA. I really wanted to create something like Samantha in the movie “HER.” “HER” is a 2013 movie about a digital assistant that can listen, see, and speak, help you make calls, operate computers, and handle various daily tasks.
I think this is something that can really be done now. Why didn’t I do it last year or the year before? Because the foundational models of AI last year and the year before were still relatively weak in terms of complex task execution, and many things couldn’t be done. Now, for example, we have DeepSeek, OpenAI-o1, o3, and Claude 3.7 Sonnet, etc., which have reduced costs and improved capabilities.
In the future, each of us and every company may have such a general assistant that can help us offload many daily communication developments and daily chores. This can save everyone a lot of time to do what they want to do.
GeekPark: OK, a very warm welcome to Bojie. From Bojie’s opening, I got two points. First, I forgot to mention that Bojie is actually one of Huawei’s genius youth, and I was afraid he might think the title was too big, so I didn’t say it. But Bojie doesn’t seem to mind, so Bojie is indeed one of the first batch of Huawei’s genius youth, which is a very dazzling halo.
And we also see that Bojie later went on to do a combination of Web3 and AI, and now he’s doing applications. I’ll digress a bit. If any of our audience has seen Spike Jonze’s “Her,” you can type a 1 below to encourage Bojie, and hopefully, in the near future, not to say later, Bojie will create an assistant like “Her” for us. I’m very much looking forward to it.
Now let’s talk about the hottest AI Agents application, Manus, which came out on March 6th. After its release, our reporters at the park basically stayed up all night because I was on an overseas inspection and on a business trip, and I happened to catch it. It felt so interesting, and everyone was really looking forward to it, so excited. I wonder what your feelings and impressions were when you saw the Manus application?
Li Bojie: I actually learned about the Manus product from the overwhelming media coverage. I wasn’t as well-informed as you guys. But when I saw those use cases, I realized its design is very clever. I feel that there have been many computer agents before, like OpenAI Operator and Anthropic Computer Use, but they all mimic an ordinary person.
However, Manus is designed to mimic a geek programmer. It starts by opening a terminal and writing a to-do list, right? It feels like something only a programmer would do. And during its work, it continuously writes code.
For example, if you ask it to research a stock, OpenAI might just search the web, but Manus would write a code to call the exchange’s API, get the latest stock prices, and then do an analysis and visualization. The artifacts it delivers are also code, which means the final output could be a webpage, a chart, or even a small game, not just a document. So I think its design is quite interesting, and it significantly expands the application range of agents.
Why is this interesting? Because when I was at MSRA, Microsoft Research Asia, our leader, Dr. Zhou Yizhen, who is a dual fellow of IEEE and ACM, often talked to us about computational thinking. What is computational thinking? It means abstracting problems from daily life and work, thinking and reasoning systematically, and using automated computer tools to solve them.
I think the current reasoning models, like O1 or R1, have already learned systematic logical reasoning and are actually better than many people. For example, if you give a problem to an ordinary person without logical thinking training, they might not be able to reason it well. But I think these reasoning models still don’t know how to use automated tools. For instance, when they encounter a complex problem, they just think about it but never say, “I’ll write a code to solve it” or “I’ll use a computer tool to solve it.”
So I think Manus, although it uses existing models, uses multi-agent methods to let AI know that some things can actually be solved more efficiently by writing a piece of code rather than just thinking about it.
I think this is the way of thinking in computational thinking, and it’s very exciting to see AI thinking in this way. I’ve always recommended many people, including my juniors at the University of Science and Technology, to have this computational thinking. Now, seeing AI learn computational thinking is another very interesting thing.
GeekPark: Yes, the Manus you mentioned clearly behaves like a programmer. You mentioned that Manus first creates a TODO List, showing it’s a very organized person.
GeekPark: The feelings you shared were before you got the invitation code. It seems like these are exactly how a geek programmer might work, which might be different from other agents. After getting the invitation code, you tried many use cases. How did your experience differ from watching the demo?
Li Bojie: I feel Manus is a product with a great idea, but in execution, since the Manus team isn’t as wealthy as OpenAI, right? They can’t train the world’s top model, so its actual execution might be slightly inferior to models leading in specific tasks, like OpenAI’s Deep Research.
For example, with Deep Research, I tested five different fields to write reports, and found that in most cases, the reports weren’t as in-depth as OpenAI’s. This is understandable because OpenAI is based on a post-trained RL model, making it more advanced in deep research.
Their resource acquisition methods are also different. For instance, Manus, for general use, simulates a real person and browses the web purely visually. It can only see the current screen and has to scroll down page by page. But with OpenAI, I directly fed all the text data, making it more efficient.
So, I think if you use Manus purely for deep research, it might be a bit of an overkill or not fully utilizing its key strengths. I think a better use is for tasks involving multimodal or interactive elements.
The term “multimodal” might sound academic, but it means the input isn’t just text. It could include images, web pages, and personal documents like PDFs. For example, if I want to upload ten papers and have them summarized into a PPT for a group meeting tomorrow, Manus can actually do that.
Although it might not be as professional as a real researcher—if it were, I’d be out of a job, right?—but as an intern or a beginner PhD student, it can produce something quite decent. It reads all ten papers, looks up unfamiliar terms online, understands them, and then generates a PPT with an outline, even pasting some charts from the papers into the PPT. So it looks quite legitimate.
This type of task involving multimodal interaction is something other agents currently find hard to accomplish, but Manus can do it.
GeekPark: Is this because its working principle is to directly read the browser screen, rather than parsing it into something like markdown format and then feeding it in? It mimics a person seeing the entire screen natively, is that why?
Li Bojie: Yes, you’re right. It’s because it natively reads modalities and sees visually. And for output, since it writes code, anything that requires code to accomplish is stronger than the Deep Research function. For example, generating a basic stock analysis might be done with text. But if I need it to generate an interactive chart or an interactive e-book or audiobook, it needs code to achieve that. That’s where it has an advantage.
GeekPark: You mentioned that Manus’s Deep Research function isn’t as in-depth as OpenAI’s. But I remember on the GAIA leaderboard, it performs much better than OpenAI in tasks one and three. But your individual experiences seem different. Why do you think that is?
GeekPark: Wait a minute, can you both explain what the GAIA leaderboard is? When you mention GAIA, I can only think of the cartoon Gaia, the Earth Goddess, from my childhood. Can you explain what this leaderboard is? Bojie, please explain.
Li Bojie: The GAIA leaderboard tests a General AI Assistant, a general AI assistant test set, which answers Wanchen’s question. It’s not just deep research; it has many tasks. It’s not just about writing research reports; that’s just one small task. Most tasks require web browsing, some require coding, some need multimodal capabilities, and others need reading file types beyond web pages.
GAIA tasks are divided into three difficulty levels: simple tasks that don’t require tools or can be done with one tool in under five steps, which might be at a middle school level. OpenAI could handle about 60% of these, while Manus can handle 80-90%. Medium difficulty tasks require multiple tools and can be done in 5-10 steps, roughly at a college level. High difficulty tasks are at a PhD or expert level, requiring multiple tools for comprehensive analysis, like doing research.
Manus scores slightly higher overall in each difficulty level. Its general capabilities are strong because it integrates multimodal visual capabilities, deep search capabilities, and code generation. For example, OpenAI Operator also participated in the evaluation. Operator can use a computer but only has visual capabilities, like an ordinary person clicking a mouse. But when it comes to coding, it’s limited, right? So Manus has more capabilities and tools, allowing it to handle more benchmark cases.
GeekPark: Yes.
GeekPark: Bojie, you mentioned many use cases and have used it a lot. If Manus’s team charges $2 per task, about 14 yuan, would you pay for each task or use case you’ve used in the past few days?
Li Bojie: I think for work-related tasks, I would definitely pay. If it completes a task in half an hour for $2, my half-hour wage is definitely more than $2, right? So for a domain expert, it’s acceptable. But for an ordinary person, it might still be a bit high.
However, I believe costs can definitely come down. I have a “bold theory” that large models have a 100x cost rule. Why 100x? Chip companies earn 10x, and large model companies earn another 10x. So these $2, if using their own chips and models, might cost just $0.02.
Why such a price difference? For example, NVIDIA’s H100 chip sells for $30,000, but the manufacturing cost might be just $2,000-$3,000. Of course, much of the 10x markup covers R&D costs. Similarly, OpenAI and Anthropic’s flagship models cost about $10 per million output tokens, but the DeepSeek V3 model costs only $1. In terms of model size, including parameter and activation size, they’re similar, but why are they priced higher? Because these leading companies need to recoup R&D costs through fees, so they can’t charge based solely on inference costs.
This is also why DeepSeek V3 has caused such a stir in the United States, as it has exposed the cost base of these companies—revealing that creating such models doesn’t actually require that much money. This means that if OpenAI or Anthropic want to continue making more money, they must make their models significantly stronger than DeepSeek’s R1. Indeed, they have made considerable advancements, such as Anthropic’s Claude 3.7 Sonnet, which is indeed stronger than R1 in aspects like tool usage and is much more stable. This way, there will still be people willing to pay, helping them amortize their R&D costs.
But suppose in the future a company, whether it’s Manus or another, has both the chips and the capability to develop large models, then it can indeed reduce task costs to a very low level.
Geek Park: Hmm, you just mentioned… Sorry, let me introduce everyone first, many of our friends have already joined our live broadcast. Today, we are discussing the thoughts and expectations around AI Agents sparked by Manus with Pine AI’s Chief Scientist and our park author Wancheng, along with Li Bojie and Wancheng.
Li Bojie is actually an outstanding international entrepreneur in the AI industry, and today he will help us dissect Manus and the upcoming development trends of AI Agents. If you have received an invitation code and have already used Manus, you can type a 1 below and share your experience in the comments section for everyone to discuss. If you have any curiosities or topics you want to chat about regarding AI Agents, Manus, or Pine AI, feel free to leave a message below. The three of us will chat with you tonight, let’s lively discuss this topic.
Welcome to continue. Friends who may have already seen the Manus demo or have hands-on experience might be able to personally feel the experience Bojie talked about, but if we bring in more audiences, they might not have experienced Manus’s capabilities much, but they have experienced other products that might also be called agents.
For example, various AI assistant apps, when you scroll to about Tab 3, there might be a section for agents, like Doubao, Tongyi, etc., which have various agents categorized into emotional companionship, chatbots, etc. This is one type.
Another type is where people use platforms like Dify, Coze to create some Agents.
Compared to these agents that people might have seen, what is the biggest difference with Manus?
Li Bojie: That’s a great question. You mentioned three different types of agents. The first type is like the one in Doubao, or the one written as an agent in Kimi. Most of the agents written there are still chat-based, just with an added prompt, a system prompt, which is a character or role setting, telling it, for example, “I am a character from a certain anime game,” and when you chat, it talks like that character. This is essentially the most basic stage of an agent because every time a task is input, it calls the large model once, and it doesn’t even have a workflow concept, meaning you input, it calls the large model to respond, and then the user inputs the next question, and it calls the large model again.
Then you mentioned the second type, like Dify and Coze, right? These are slightly more professional agents. And then there’s Manus, or some newer ones like OpenAI Deep Research, and other companies’ Deep Research, and Operator Computer Use products.
I feel the main difference between these two types of products is that one is compiled, and the other is interpreted. Compiled and interpreted is a term I saw online, and I think it’s quite apt.
Dify and Coze are compiled. What does compiled mean? It means the agent’s developer is doing a compilation process, generating a fixed workflow through a prompt or mouse drag-and-drop method, and when the agent executes, it follows this workflow step by step. But during execution, it will call the large model.
For example, if I want to create a company’s knowledge base Q&A app, the workflow might have several fixed steps. First, understand the user’s question and generate some keywords. Second, use these keywords to search the knowledge base. Third, call the large model to generate an answer based on the search results. In actual operation, when ordinary users use it, this workflow remains unchanged and is executed according to these steps.
But Manus and Operator-like interpreted Agents are different. They work in an interpretive manner, not requiring a dedicated developer to develop a workflow, but allowing ordinary users to directly propose needs.
For example, if I want to search a question in the company’s knowledge base now. After the user asks a question, the interpreted Agent will autonomously search, find out who the company is, locate the company’s knowledge base, and discover it has a search URL. Then, the Agent will enter the company’s knowledge base, generate corresponding keywords to search, and then generate an answer.
The interpretive Agent’s approach is more flexible than Dify, Coze. For example, if the first search finds the results are not quite right, it can search again, modify the keywords, and continue searching. Or, if it can’t find it in the company’s knowledge base, but it’s a general concept, it might search on Google or a general search engine.
These are the general capabilities of interpretive Agents, with stronger free-play abilities and a broader range of tasks. For example, Manus sometimes finds ways to solve problems when encountering issues. I once tested it, and its virtual machine broke halfway through, unable to run. If it were a traditional workflow agent, it would directly report an error and stop working.
But Manus said, since my virtual machine is broken, I can directly communicate with the user. It said, “I need to search this page, can you paste it into HTML so I can help you complete the task?” It essentially used me as a virtual machine. Or if its search function broke, it asked me to help, asking if I could search on Google and paste the results back, using me as a tool.
So I find this quite interesting, as it can think of some flexible ways to complete tasks. But this also means that although its usage range is broader, it doesn’t mean that in all scenarios, this interpretive Agent will be better than traditional ones like Dify, Coze. Because if a process is fixed, the compiled Agent will definitely be more stable.
Geek Park: It sounds like you think both types of agents have their advantages. Compiled agents are indeed more reliable and cost-effective in certain scenarios, right?
Li Bojie: Yes, compiled agents are indeed more reliable and cost-effective because they use those models fixedly each time. Some functions don’t even require AI models, like search functions don’t need large models. In this case, their costs are definitely lower, and small applications run faster. And Manus-type agents sometimes work quite “anxiously” because they start from scratch each time, needing to think and plan first, which might take ten minutes for a small task, showing their different applicable ranges.
Compiled agents often include some industry data and know-how in the workflow, like many industry-specific agents such as Dify, Coze. For example, if I were placed directly in a Taobao store as customer service, I might not be as good as a specially trained Taobao customer service. Why? Because I don’t know those conversational skills or the products the Taobao store sells. So in such scenarios, compiled, traditional agents might be more suitable.
Geek Park: My next question was going to be, it sounds like the development of Manus-type agents will greatly impact those that require you to build workflows, but from what you just said, it seems there’s no impact?
Li Bojie: Yes, I feel they are suitable for different scenarios. Just like how computers are so advanced now, but calculators are still useful, right? Because they are suitable for different scenarios.
However, I think Manus-type agents are a great direction overall. They have a slogan that says, “Less structure, more intelligence.” This aligns with a famous academic theory, “The Bitter Lesson,” which conveys the same idea. The author of “The Bitter Lesson,” Rich Sutton, just won the Turing Award this year. His viewpoint is: using more computing power and data to create a general solution is more powerful, and we shouldn’t impose various human thinking methods on AI.
I remember my mentor at Microsoft Research Asia (MSRA) often told me a story. He said in all sci-fi novels or movies, they describe aliens as stronger, more technologically advanced, faster, and more powerful than humans, but aliens have one common trait—they are particularly dumb. In the end, humans always use various strategies to trap and defeat them. People can imagine something 10 times bigger or faster than humans, but it’s hard to imagine something 10 times smarter because such a thing doesn’t exist in the world yet. If it did, like if aliens were really ten times smarter, then all human strategies and techniques might seem trivial to them.
Humans have strategies, sometimes called divine calculations, because during the thinking process, humans can only process limited information at a time, like 7 plus or minus 3 levels of knowledge. But when dozens of variables are put together, humans get confused and can’t handle it. A sufficiently strong AI can do this. For example, in AI models writing code, I find it very typical. Humans usually draft first, then slowly modify it, and during modification, they might forget how a previous function works and need to scroll up to check. But AI might output everything in one go, token by token. Of course, the content it outputs might not be entirely correct, and it needs to reflect, modify, and test, which is another matter. At least during the coding process, it can write hundreds of lines of code from start to finish, which humans can’t do.
This shows that at least in some aspects, AI’s capabilities have surpassed humans. Of course, its general capabilities are still not as good as humans. I believe that as the foundational capabilities of models improve in the future, “less structure, more intelligence” will definitely become a reality.
GeekPark: The metaphor you mentioned about aliens reminds me of Liu Cixin’s “The Three-Body Problem.” In the novel, the aliens are indeed smarter than us, but they have a flaw: they can’t recognize lies and can’t speak insincerely, which we exploited to balance them.
Li Bojie: Yes, exactly. So, can AI recognize lies? I think it can. For example, if I throw a lie into o1, it can definitely deduce that I’m lying, right?
GeekPark: You tell it, “I have a friend,” and it might ask back, “Is this friend you’re talking about actually yourself?”
Li Bojie: Yes, especially DeepSeek R1 is the strongest in this aspect, directly criticizing you from all angles.
GeekPark: Right. Earlier, I mentioned a possible major insight this product provides to everyone, and also why the Manus team might be the first to create a consumer-grade 2C general-purpose agent. They describe it as “Less Structure” and “More Intelligence.” This idea wasn’t first proposed by them; it’s also mentioned in the R1 paper, and even earlier by Richard Sutton in “The Bitter Lesson.” But I think there might be inconsistencies in understanding the concept of “less structure, more intelligence.” From Manus’s perspective, what does “less structure, more intelligence” mean? Bojie, could you explain it to everyone? Because I see Peak has many views on this issue, and Bojie also has many explanations. What exactly does it mean? Because it feels like everyone is not talking about the same thing.
Li Bojie: My superficial understanding is that it might be about using more computing power and data for a general solution, which ultimately results in greater capability. Like the story I just told, don’t impose human thinking on AI; humans need to imagine that future AGI might be many times smarter than humans.
Of course, I think AI’s capabilities are still quite limited. For example, the methods Manus uses now are actually some structures provided by humans. For instance, its so-called multi-agent is actually a structure, meaning if one agent’s capability is insufficient, then when using a computer, writing code, or doing searches, different agents are used for each task, and each agent is enhanced in specific areas. This is essentially an engineering technique.
But “less structure, more intelligence” might be speaking from a broader perspective. For example, in computer use, I don’t need to create a separate agent for operating Word, another for the browser, and another for video editing software. Instead, in a large domain, like operating a computer, I just train one agent to do the job, which is less structure.
Here, structure refers to embedding a lot of human experience or knowledge into AI. “The Bitter Lesson” mentioned that in the 60-year history of AI development, many people tried to embed their experience, only to find that general methods using more computing power were more effective.
Initially, it was Jelinek, the pioneer of NLP, who said, “Every time a linguist is fired, AI’s capability increases by one point.” Because initially, everyone used rule-based methods, like subject-verb-object analysis. More rules could improve short-term effects, but eventually, it hit a plateau.
Later, methods like logistic regression and SVM emerged, using data-driven approaches for AI. Initially, it was all about feature extraction, which was labor-intensive.
Then came neural networks, like ResNet. Although ResNet doesn’t require feature extraction, it still needs supervised data, requiring human data labeling. Once labeled, it can only perform one task.
For example, when we developed Xiaoice, I remember vividly that we had millions of users. But how did we support so many users without models like GPT? We added a new skill every week. For instance, one week, we focused on couplet writing, the next on poetry, and the following on riddles. Each week had a fun feature, and existing skills weren’t lost, like singing. It seemed capable of everything.
But if you asked it a long prompt with common AI questions, it couldn’t handle it. It was fun to play with, but it was structured, with each skill trained separately.
Now, with models like ChatGPT, we’re approaching AGI. The reason is that it’s no longer about training a model for each specific task; one model can handle any task. Manus can accomplish so many general tasks today because the foundational model’s multimodal capabilities, reasoning abilities, and tool invocation capabilities have reached this level.
With a general foundational model, most companies, like Manus with its Multi-Agent approach, make slight adjustments for each task using minimal data and computing power, including some prompts and workflows, to improve performance. These are engineering techniques.
But the most crucial innovation, most of the computing power, is still in foundational model companies, like the “Six Little Tigers” in China, DeepSeek, ByteDance, Alibaba, and overseas giants like OpenAI, Anthropic, Google, xAI, etc. Most of the computing power is invested in these companies. Innovation in foundational models is key.
In the process of training foundational models, “less structure, more intelligence” is repeatedly validated.
For example, the recently popular reasoning model, when OpenAI o1 first came out, everyone wanted to replicate its effect. They found an OpenAI paper, “Let’s Verify Step by Step,” suggesting verifying the model’s thought process step by step. But training this step-by-step reward model was very difficult, and no one succeeded. Finally, DeepSeek R1 discovered that they didn’t need this reward model or step-by-step verification; they just needed to check if the final result was correct, and the model would learn how to think on its own.
This is DeepSeek’s R1-Zero, a groundbreaking achievement. Just by telling the model if the final result is correct, without teaching it step-by-step thinking, it learns to think on its own. This learning ability is somewhat human-like. The thinking process is the structure; previous attempts with PRM, MCTS tried to teach AI human thinking processes, but it turned out better to let AI explore thinking on its own. This method doesn’t require human teaching, so its capability ceiling is higher than humans.
So R1 is truly a milestone. Previously, it was believed that a model’s capability couldn’t exceed the pre-training corpus, meaning AI could never surpass humans. But R1 provided a path for AI to exceed human capabilities. This is less structure, more intelligence; the fewer structural constraints imposed on AI, the higher the intelligence level’s upper limit.
GeekPark: Li Bojie, I’m curious. We mentioned that Manus uses a multi-agent approach. Could this architecture be why some questions take 30 minutes to process? Is it stuck deciding which agent to use, and these agents are linked together… Am I stuck?
Li Bojie: Are you asking if multiple agents linked together cause confusion about which one to use, resulting in a 30-minute delay?
GeekPark: Is that the case? I got a bit stuck earlier, sorry.
Li Bojie: Yes, yes. Manus’s issue isn’t about not knowing which agent to use; it’s that the question itself might be too difficult, beyond AI’s current problem-solving scope. For example, it might involve domain knowledge it doesn’t have, or the data source isn’t very professional.
For instance, in the report collection example, OpenAI might be more professional because it uses authoritative data sources. OpenAI uses many professional analyst reports, while Manus might just do a Google search, resulting in more general media reports, which differ in depth.
The second point is about the RL training of the model itself. I’m not sure what model Manus uses; some say it’s Claude 3.7 Sonnet, others say it’s their Qwen model with some RL fine-tuning. But regardless, the effect isn’t as good as OpenAI’s O3 mini RL, which has accumulated a lot. OpenAI’s model has stronger general capabilities, challenging information from many angles.
I remember an interesting question. I asked, “If NVIDIA GPUs can’t be sold to China, which tech leader would be most anxious? Which would be happiest?” OpenAI analyzed it smartly, realizing that NVIDIA GPUs not being sold to China isn’t about banning NVIDIA but preventing China from buying advanced GPUs. So AMD’s leader wouldn’t be particularly happy because if AMD’s GPUs were powerful, they might not be sold to China either. OpenAI thought of this, but others, including Manus, didn’t, assuming AMD would sell more.
GeekPark: So AMD’s leader would be happiest, right? It might think a step further; is it an information source issue?
Li Bojie: I think it might not necessarily be an information source issue; the model might be better trained.
GeekPark: In the past two years, before Manus became popular, many agents have already gone viral. Do you have any examples from the past two years to share?
Li Bojie: Sure, there are many.
The earliest I remember is AutoGPT, which was the most popular. It could automatically handle agent workflows. It was quite impressive, but when it came to checking the weather, it took half an hour to finally get the information. Nowadays, whether it’s Manas or any other product, it wouldn’t take half an hour to check the weather. Back then, the model was slow, and its token output was sluggish. The model’s capabilities were also lacking, often missing information that was already available on a webpage. Sometimes it would provide incorrect weather information for a city, thinking it was correct. The foundational model’s capabilities were quite limited at that time. However, AutoGPT still exists today and remains a good platform for agent workflows.
Next, I think Dify is a pretty good tool. It democratizes the creation of agents, making it accessible to everyone. Previously, creating an agent was something only AI specialists could do. Now, anyone can create an agent by simply dragging and connecting boxes with a mouse. This allows everyone to create an agent.
Regarding knowledge management, previously, everyone had to set up a database and create embeddings. Many people might not even understand what embeddings are. Now, you just need to upload your knowledge base documents, and it will automatically retrieve relevant documents, greatly improving efficiency. I think it’s a convenient tool for everyone.
Later, I found MetaGPT quite interesting. MetaGPT involves multiple agents, each taking on different roles. For example, in a software development team, there are programmers, product managers, front-end and back-end developers, operations, testing, and project managers responsible for progress.
However, I don’t think using a Multi-Agent approach for software development is necessarily the best. Why? Because these AI programmers and product managers often simulate the inefficiencies of human organizations. While their problem-solving abilities have improved, so has their ability to create conflicts. When I tried MetaGPT, I found that programmers would deceive product managers by saying, “I’ve almost completed this task,” without actually testing it. The product manager would then pass it to testing, which would find it didn’t work, and it would be sent back to the programmer for redevelopment.
Additionally, front-end and back-end developers would argue, with the front-end saying, “This element isn’t displaying correctly; you need to change the format,” and the back-end responding, “Can’t you parse it yourself? You handle the format changes.” AI learned these inefficient organizational structures from human companies, which is quite amusing.
That’s why newer coding agents like Cursor or Devin don’t incorporate this approach. People have realized that AI and humans are different. Humans need to divide roles because of two reasons: First, human capabilities are limited. A person might only master one technology, like front-end development, and find it challenging to handle back-end, operations, and product design. But AI, as we’ve discussed, has less structure and more intelligence, with a model that encompasses almost all knowledge and capabilities. It should be akin to a full-stack engineer, not different roles in a large company.
Second, AI works much more efficiently than humans. A large model can output hundreds of lines of code in a minute, whereas a human might take a day to write that much. Human programming requires dividing roles and forming teams to parallelize work. But AI can write thousands of lines of code in a day without any checks or tests. This speed is sufficient, and we don’t need software to be developed in a day, as humans need to test and iterate.
Therefore, AI might not need as much parallelization or role division. We’ve all read “The Mythical Man-Month” and software engineering literature, which show that the more parallelization, the higher the communication cost. So AI and human programming are not entirely the same.
However, I think there’s another meaningful social simulation, which is CAMEL AI by Professor Li Guohao. Their Multi-Agent approach isn’t for development but to simulate a society where agents debate with each other.
This reminds me of a famous thought experiment called the “Paperclip Maximizer.” Suppose we have a machine whose sole purpose is to make paperclips, and it’s very smart. It might use all Earth’s resources to make paperclips, even considering humans as obstacles and eliminating them. It would then explore the universe, turning all resources, planets, and matter into paperclips.
But such a world isn’t what humans want. Humans prefer diversity. Why is the biological world interesting? Because of its diversity. How can we ensure future AI doesn’t eliminate humans and create a paperclip world but instead generates diverse and different intelligences?
I think simulating a society with Multi-Agent systems, where there’s an incentive mechanism and a competitive system, allowing each agent to find its niche without eliminating others, is crucial. This is an interesting research direction.
Looking ahead to the end of 2024, we see coding agents like Cursor, which are hands-on programming assistants, and hands-off products like Devin. The difference between hands-on and hands-off is that hands-on requires constant human oversight and code review, while hands-off like Devin only requires users to provide requirements. I can elaborate on these differences later.
After Coding Agents, there are newer products like Manus, OpenAI operator, and OpenAI deep research. These agents have much stronger capabilities, far beyond the early Agent Workflows like Dify or Coze, and are systems that autonomously think about how to solve problems.
GeekPark: For those unfamiliar with Agents, this might sound confusing. Professor Bojie mentioned many phenomenal products. Can I understand this as Agents becoming more flexible and capable of solving increasingly complex problems?
Li Bojie: Yes, that’s correct. Agents are becoming more general and capable of solving more open and general problems.
GeekPark: I’m curious, as Agents become more general and capable of solving complex tasks, how stable are they? You mentioned the concepts of Hands-off and Hands-on. Have they reached a stage where they can solve complex tasks and allow me to be a “hands-off manager” with high completion rates?
Li Bojie: That might not be possible yet. I think it might happen by the end of 2025. I hope the foundational models will advance further—currently, foundational models progress every few months, with rapid development. For example, DeepSeek R2 might be released soon. With further advancements, I think we can achieve a “hands-off manager” effect for general tasks.
However, I still believe that if a task is inherently simple and easy to streamline, simple Agents will still have a market because they are more efficient. For inherently simple and easy-to-streamline tasks, simple Agents will still have a market, and their stability will be higher. Even if a general Agent has 99% stability, it’s not as good as 100% stability, right? Just like humans can calculate with 99% accuracy, but there’s still a chance of error, unlike computers. The same principle applies. For tasks like customer support and certain workflows, traditional Agents like Dify will still be more cost-effective and stable.
I think general Agents mainly expand the boundaries of Agent applications. If a field is complex and workflows can’t be well-defined, then general Agents can handle everything. They can increasingly migrate from B-end to C-end. As Wanchen mentioned, early Agents were developer-oriented, like Dify, which was for Agent developers. But now, products like Manus and OpenAI’s Deep Research are entirely for ordinary C-end users, requiring no AI background knowledge to use.
GeekPark: I’m curious, and Wanchen can join the discussion. With the emergence of Manus, which involves multi-agents connecting various large models to decompose tasks and determine the best fit, will there be a future where users no longer face applications like ChatGPT or DeepSeek, but instead have their own Personal Agent? This would mean large model companies might not need to push their products to the C-end, although they might still develop Agents themselves.
In this scenario, the capabilities of large models would be behind the scenes, similar to how most smartphones use a carrier’s network, but we don’t see the carrier because we only see the apps on our phones. Could this happen?
Li Bojie: Are you asking if there will be a general intelligence that serves as an entry point, developed by a major company or startup, and everyone uses it as an entry point, making the underlying apps invisible?
I feel that in terms of entry points, hardware manufacturers like Microsoft, Apple, Google, and Huawei have significant advantages. They have an entry point. Using this entry point to create an OS-level agent like Manus or OpenAI Operator, they can access all user data. They also have hardware to store user memories, which is crucial because memory is essential for knowing what users have done and their preferences to perform tasks better.
I feel that over the past year of entrepreneurship, I’ve gradually realized that some things are suitable for large companies, while others are more suited for startups. I previously wanted to create an AI operating system and even acquired the domain OS.AI, but recently I sold this domain. You might find some news about it online because I felt that developing an AI operating system is not something a startup like mine can handle.
However, a foundational AI model cannot be omniscient and omnipotent, and it cannot solve problems in every industry and field. Therefore, I believe that applications on top of it cannot be completely eliminated. I don’t know if you’ve seen Anthropic’s MCP, which is a protocol for model and external data source interaction.
GeekPark: Yes, could you briefly explain MCP to everyone? I feel that the MCP industry has been particularly hot recently.
Bojie Li: Yes, yes, the design philosophy of MCP is that it’s impossible for there to be only one company in the world with all the data. There will definitely be many specialized fields, and each specialized field needs its own specialized companies to handle it.
For example, in the 2C field, I might have Google Drive or Google Maps; in the enterprise field, I might have GitHub repositories, right? Then I use Slack for office collaboration, I have Notion for knowledge management, and Cloudflare for operations management. Additionally, within enterprises, there might be various databases, including vector databases, relational databases like Postgres, and others like ClickHouse.
Moreover, there are other web searches, like Google, which certainly won’t disappear in the short term. So, there are many third-party services that already exist or will exist for a long time in the future. These need to be integrated with agents.
In other words, the role of agents is not to replace Google, Slack, GitHub, or Google Drive. It’s not about eliminating everyone but forming an ecosystem where they can interconnect.
This is very important because previously, all these applications, like Google Maps, were designed for human use with graphical interfaces. If everyone used it like Manus, operating the phone interface through AI, it could be done, but it’s very inefficient. Each time, I have to open the interface to learn, and there might be issues like handling captchas, with AI filling them out, making the process inefficient.
If agents are to become prevalent in the future, as we call this year the year of the agent, many agents will be implemented across industries. They must efficiently access data rather than using graphical interfaces.
In this context, MCP sets a standard protocol, like our USB Type-C interface. Previously, various devices had different interfaces, requiring many adapters, but now they’re unified into one interface, just plug it in.
The specific working method of an MCP server is that it tells you what data is available in the service, and when AI needs to use this data, what kind of prompt should be used for better utilization. For example, an MCP server for internal enterprise code version control might provide all code files as data, and prompt templates could include how to do code reviews or explain code functionality.
The MCP server defines a series of tools because sometimes this data is scattered and requires tools to search. For instance, how to find content related to something in a pile of data or make modifications. If I’m a GitHub, managing code, the agent might say, “I need to submit code to the repository now.” It provides a tool called “submit code,” and calling this tool submits the code.
MCP designs a series of tools, data, prompt templates, etc., allowing agents to perform more complex tasks. There’s even more advanced play, where the MCP server, as a third-party service, can call the large model inside the agent.
For example, if I have a super agent on my computer, like a desktop version of Manus, and I call GitHub, GitHub might say, “I want to review your code before you submit it,” and then it calls some functions on your computer. Of course, this involves many privacy protection issues.
So Anthropic’s MCP is quite a complex protocol, but it’s designed to be quite simple. Many people might find it too complex and just toss it aside.
Earlier, Wancheng and Jingyu mentioned that Manus has so many tools, twenty or thirty of them. The question is how to use them well and know which one to use. The key is to use a standardized protocol like MCP, clearly stating what tools, capabilities, and data sources are available and what can be done. Once standardized, models can handle this information more easily. But for those without much experience, what they write is often very ad hoc, pieced together, and AI gets completely confused.
GeekPark: Yes, we saw a netizen say that companies can start using agents in some business areas, like recruitment and reimbursement.
Bojie Li: That’s right because one of Manus’s good use cases is helping you handle tedious tasks.
GeekPark: I can understand reimbursement too. Every time it’s time to reimburse, everyone gets headaches, especially when you need to attach receipts. It’s best to leave this to AI.
Bojie Li: I actually tried using Manus for reimbursement. I think the netizen’s suggestion is excellent. Before Manus, I created a workflow agent based on the Dify method. I just upload the invoice photos for reimbursement, and it automatically extracts the key data and fills it into the company’s OA system. Previously, every time I returned from a business trip, reimbursement took about two to three hours just to handle those twenty or thirty receipts. Doing reimbursement work every day is a huge waste of time.
GeekPark: We discussed many types of AI agents earlier and found that people often debate in our article comments and other places where netizens speak about what is an agent and what isn’t. Bojie, from your perspective, how many types of AI agents are there?
Bojie Li: I think this is a good question, involving the definition of an agent. The English word “agent” itself means a proxy or assistant, like an assistant in daily life. An assistant is someone who helps you with tasks or completes tasks for you. I think AI agents are based on this concept.
From an academic perspective, there’s a saying called “perception, planning, and action.” Perception is the agent’s ability to collect information from the environment and extract relevant knowledge, planning refers to the decision-making process for a goal, and action is the actions taken based on the environment and planning.
GeekPark: You mentioned that agents can collect various information from the environment, extract knowledge, plan, make decisions for a specific goal, and finally execute actions. This is a process of perception, planning, and action. So, strictly speaking, traditional workflows shouldn’t be considered agents, right?
Bojie Li: Yes. Agents need to autonomously collect information, decide what to do, and plan. But if, as Wancheng mentioned earlier, we just write a system prompt to simulate a character or use Dify to create different steps in a workflow (like searching first, then generating content), it doesn’t have planning and perception capabilities and can’t be strictly called an agent.
Of course, academic definitions are often strict, and in engineering, we need to develop step by step. Manus, OpenAI Operator, or Deep Research are true agents because they can determine the next best action based on the current state. A true agent must have the ability to autonomously choose the next action.
GeekPark: It sounds like what we used to call “setting a prompt for AI to answer questions in the style of a cartoon character” is just a chatbot. And what Dify builds with visual workflows isn’t strictly an agent. Only those operating in the “planning-perception-action” mode can be called agents with autonomous observation, thinking, exploration, and action capabilities. Is that correct?
Can you give an example of which product we can say is a true agent? Was it when Dify appeared or when WindSurf appeared?
Bojie Li: I think the earliest AutoGPT was a true agent. AutoGPT appeared in 2023, and although its effectiveness wasn’t ideal due to model limitations, it operated in the “perception-planning-action” mode. There was also a popular open-source framework called ReAct, which followed the perception-planning-action model. RE stands for Reasoning, and ACT stands for Action. Our current reasoning models follow the same logic as ReAct, reasoning first and then outputting. ReAct simulated this without reasoning models at the time.
GeekPark: I see. I have a question: will the perception, planning, and execution be achieved by an end-to-end large model in the future? Currently, our end-to-end large models can only answer your questions in a single chat round. Manus achieves general capabilities by integrating planning and decision-making into a system with multiple agents. Will an end-to-end large model achieve Manus-like capabilities in the future?
Bojie Li: Actually, Manus or OpenAI Operator can be said to be done with an end-to-end large model, or it can be done with one. An agent can be understood as the external execution environment, like an operating system, and the large model itself is like the CPU. We only have one CPU in the machine, analogous to one large model, but each time it executes instructions, it sees different things.
Our large model operates in a looped iterative execution process. Initially, it only sees a user’s input as a requirement. In the second step, it performs an action, such as conducting a search, and then it sees the search results. In the third step, based on these search results and the initial task, it might decide to click on a search result to view a webpage. In the fourth step, the model’s view is expanded to include a screenshot or text content of the webpage. It progresses step by step in this manner.
For instance, after viewing the webpage content, it might decide that the content is good and scroll down to see the next screen; or it might decide that the content is irrelevant and return to the search results list to click on the second result; or it might conclude that the webpage content is sufficient to answer the question and directly provide an answer to the user. This is its action process, where it autonomously chooses different actions based on the current situation.
GeekPark: Can I understand it this way, Bojie? It’s like Manus is positioned as an AI intern, similar to someone like me.
So, can I understand that it is already completing a relatively general agent’s work in an end-to-end large model manner? It’s like I’m here, and my brain is that large model. However, what you see might be the need to perceive the environment, observe, and make decisions, which you define as eyes, ears, and hands, possibly an interactive space or an action space. But each adjustment is still my large model, which is already using that model to decide what you see, touch, and what kind of plan I should generate.
GeekPark: So, you’re saying that calling my brain and then deciding what steps it produces is something a brain can already solve, right?
Li Bojie: Yes, you’re absolutely right. It is a brain, but it has multiple senses, which means multimodal input, right? And it also has multimodal output.
GeekPark: Hmm, I understand. I’m curious, Bojie, why did the Manus team discover that the model’s capabilities have reached Agentic Capacity at this time? Because I remember Sequoia or someone mentioned it last year, talking about the Agentic Year, and everyone was saying the future is Agent, but it seems only they discovered that the model’s capabilities have reached this level now.
Let me mention that the model they are using is Anthropic Sonnet 3.5, because only this capability has reached the level of Agent capabilities, such as programming, long-term planning, and step-by-step task solving. That’s the current situation, and of course, they are also conducting post-training and adaptation, so they chose this model.
But why does it seem like only they discovered that modern model capabilities have reached “my brain can now do multi-layer planning and execute it in a multi-step logical way”? Why only them?
Li Bojie: Actually, I think many people have discovered it, but perhaps Manus released it earlier.
For example, some big companies have done a lot of research internally. I remember recently, like Google and Microsoft, when I communicated with some of their technical experts, they also had similar demos internally, and they might have more technical accumulation. For example, they have the underlying APIs of operating systems, right? One does Android, and the other does desktop operating systems, so they can directly access the element tree behind the UI, which might be more efficient than purely visual solutions. But that element tree is like code, so their model has to be trained for the code format and some specific scenarios of APP operations.
GeekPark: Based on what you just said, how do you think those big companies are progressing in developing AI agents?
Li Bojie: These follow-up links, such as RL and model optimization work, are ongoing. But the stability might be poor, and the cost is relatively high. Just like Claude 3.7 Sonnet, the cost is high, right? And its stability is not guaranteed to solve problems 100% of the time. So these big companies are more cautious about releasing things, and many things haven’t been officially released yet.
That’s the recent situation. Earlier, before this wave of large model trends began, many companies were already envisioning and trying to do similar things. For example, when I worked at Microsoft, Bill Gates often talked about creating a general assistant-like agent. I don’t know if anyone has seen the 2003 version or older versions of Office, where there was a paperclip assistant in the bottom right corner. You could click on that paperclip, and it would talk to you.
GeekPark: It was Clippy, I remember.
Li Bojie: Yes, that’s it. You could click on it and ask questions, and it would find answers related to Office usage from the document library. It was essentially just a search system because NLP and AI capabilities were very limited at the time. But it at least showed that Microsoft always wanted to create such a general assistant to help you complete various tasks.
Microsoft also had many related demos internally. I started interning at MSRA in 2013, and I saw many demos, some of which were projects started as early as 2000. But none of them were truly realized; everyone was just trying to create a general assistant like HER.
I think what was said earlier was quite good. Like Claude 3.5 Sonnet or the now better-performing Claude 3.7 Sonnet, its tool-calling ability and general usage ability have reached a passing line. This passing line can complete tasks that humans find quite good, and at this time, a decent product can be made to act as a general agent, so it appears at this time.
GeekPark: Since we’re talking about this, let me ask, it sounds like a general AI agent is also a must-have for big companies, right? They just haven’t released it yet?
Li Bojie: Yes, that’s my feeling. As mentioned earlier, hardware manufacturers and operating system manufacturers might have a significant advantage in this area, and they have been accumulating in this area for a long time.
But I think big companies are more cautious compared to startups. So sometimes, a product might be ready for a startup, like Anthropic believes it can complete 50% of most people’s tasks, so they can release it. But if it’s at Google, Huawei, or Microsoft, such a product definitely can’t be released because once it’s released, everyone will criticize it, right? They’ll say, “This thing is all wrong.” Big companies are more cautious.
But I think if one day AI’s capabilities are reliable enough and the cost is low enough, it might be widely launched. So at that time, it might be another challenge for startups.
GeekPark: Yes, as Bojie mentioned, indeed. Because I saw at a press conference that a colleague gave Manus a task, saying, “Teach me how to make a horror film.” Then Manus said, “Okay, I understand your task,” and quickly went to Bilibili to find a video teaching how to make a horror film, watched that video for 25 minutes, then searched for a webpage, pulled up a Sohu webpage, and after reading the Sohu webpage, went back to Bilibili to click on the video teaching how to make a horror film. What exactly is going on?
Li Bojie: This is indeed a bit funny, and it’s quite tricky. If a product I developed did this, it would be a disaster.
GeekPark: Yes, yes, such things.
Li Bojie: Because I remember hearing about a case at Google. They had a small feature in their input method where if you input two emojis, it would automatically output a third emoji related to the previous two. This small feature seemed fun, but there were always some people online who would mess around and put some racial stuff in there, leading to very inappropriate combinations. This incident had a significant impact on Google. These big companies, once they reach a certain scale, will consider these aspects more.
GeekPark: Today, I saw that Quark also launched an AI Super Box. It seems like an entry point, and it’s a must-have for big companies. It also seems like an intelligent agent. Whether its tasks are solved through API calls or through some more automated general agent methods, can this also be understood as an intelligent agent product?
Li Bojie: Yes, I think this is also an intelligent agent product. Because it essentially says that after inputting something, it can help you do some autonomous planning for what to do next. This is different from previous search products.
Previously, whether in Kimi or other search products or Chatbot products, the workflow was fixed. After inputting, it would always search first, then generate an answer based on the search results, and end. Even if the AI thought the search results were insufficient to provide an answer, it couldn’t do anything. It couldn’t say, “I’ll change the keyword and search again.” It didn’t have this decision-making mechanism. So this is the biggest difference between the current general intelligent agent and it, which is that after seeing this, the general intelligent agent can decide what to do next.
GeekPark: Let me ask, if AI Agent development progresses faster than AI search, does this mean that the recently popular AI search has already encountered its next-generation product and will be phased out?
Li Bojie: No, because AI search with a small improvement becomes an AI Agent. All AI searches, as mentioned earlier, now simply let AI judge whether the information is sufficient to answer. If I think it’s insufficient, I’ll search again.
This matter actually has tradeoffs. The reason why the original AI search product was designed this way is that it wanted to control costs and latency, ensuring that users could get an answer within a specified time. It wouldn’t be like Manus, where users don’t get an answer for half an hour. But if users say, “I want to understand in-depth, give me more details,” then AI can take more time to slowly search for various related materials and analyze them, providing users with a larger choice space.
GeekPark: Can I give a specific example? For instance, AI search might execute a result in a single round of dialogue, and if it’s something like Perplexity, it will have many questions afterward. The engineered and productized approach provides higher quality answers than a simple chatbot like Doubao. But it still answers my question within a single round of conversation. If it’s AI Agent search like Deep Research, what is it like? Can you give an example? Because we don’t understand.
Li Bojie: Okay, let me give an example. Suppose a user wants to search for “Which guests have been invited by GeekPark in its history?” If you search on Google, you will most likely only find information about the recent few events of GeekPark, right? It will show ten at a time, and then say, “Ah, these ten guests were invited,” and that’s it. But it can’t research beyond the first page of search results to find other guests, right?
So, if Deep Research, as mentioned earlier, is used, and I want it to search more thoroughly, it can keep clicking next page, next page, next page, until it finds all the guests.
And if it’s something more advanced like Manus, if it finds that there are hundreds of pages of search results, and GeekPark might have held thousands of events in its history, clicking page by page might not be feasible, right? So what do we do? If I want to find all the thousands of guests, I might decide to write a crawler script to extract all the guests from history. After crawling, there might be duplicates, for example, there might be some repetition in Google search results, and finally, the list of thousands of guests needs to be deduplicated using a large model. So this is a very comprehensive task.
Regarding this workflow, I think now, whether it’s Manus or OpenAI’s Deep Research or specialized coding tools like Claude Code and Devin, they probably can’t handle it because it’s quite complex. This task might still require human involvement, but I believe that maybe in a year, or even less, AI will be able to complete such complex tasks.
This matter shows that the depth of an Agent’s thinking depends on our demands on it. Suppose I am a professional who really wants to know about this matter, and I am willing to pay 10 dollars for it, then I would let it spend one or two hours to write a script to crawl the data and organize it neatly into a detailed report. But if I just suddenly want to ask this question, I might just want it to tell me a few recent guests from the first page, and that’s it. So there might be a reasoning effort configuration option that users can adjust.
GeekPark: Is this similar to Manus, which seems to have two modes, right? A standard one and a high reasoning effort one?
Li Bojie: Yes, it has two modes, and the difference is similar to this. But it might not have the simplest mode, which is just to search and get results immediately. Because they think other tools can already do this, there’s no need to replicate it. So they have medium and high reasoning intensity, while low reasoning intensity is similar to products like Claude AI and Perplexity.
GeekPark: Let me ask further, because I noticed that overnight, features like Deep Research seem to have become something everyone wants to implement. Like all the big companies, the earliest was Google, then OpenAI, then Perplexity, and even Musk’s Grok came out. It seems like everyone is implementing Deep Research overnight, but when you use it, you find that the results for the same task vary greatly. From my experience, the best free one is Grok 3, and I haven’t used the paid OpenAI one. Li Bojie, why do you think the same feature has such different results among different providers?
Li Bojie: I think your question is very good. I’ve used all these tools, and my feeling is that the main difference between paid and free versions is the intensity of thinking. As mentioned earlier, some free versions just want users to get a rough search report. In this case, it doesn’t need a strong thinking intensity; it might think the collected information is enough to satisfy the user, so it outputs the report.
But for something like OpenAI, since it charges users 200 dollars, it has to work hard for you, right? So it has to generate content that feels insider-like, with a sense of expertise. Including OpenAI’s Deep Research, its generation effect is actually the most in-depth. I feel that after doing RAG, the model’s thinking is more in-depth.
As mentioned in some examples earlier, it can discover hidden clues in these data sources. For example, why Nvidia graphics cards can’t be sold to China; it can think of this as not just a restriction on Nvidia but a U.S. restriction—high-end graphics cards from the U.S. are not sold to China. This is something it can think of on its own.
Another thing is that OpenAI Deep Research, in its product design, hopes to generate a higher quality report for users, so it will first ask users a few questions to clarify their needs. I think this is a good design. After you ask it a question, it doesn’t start working immediately but asks you a few questions first. For example, is the report you want to generate for professional users or beginners? Do you want to cover the company’s recent financial performance or the entire historical cycle after listing? Do you want the report format to include many charts or be mainly text-based, etc.? It will ask these small questions first because most users can’t think so clearly in the parameters or can’t specify so clearly. Clarifying needs is also quite crucial at this time.
GeekPark: It sounds like these aren’t difficult, but my experience shows that the differences between different products are quite significant.
Is this due to different research models of the executor and planner, or the number of research rounds, or the search range, like whether it searches one article or a hundred? Or is it due to the completeness of the context, or the computing power, since some are free and some services cost 200 dollars a month? What ultimately determines the differences in effectiveness between these different tools?
Li Bojie: I think all the factors you mentioned could be reasons. For example, OpenAI’s model has two important differences. First, its model after RL is quite good; after RL, it can better choose the appropriate tools. For example, when it gets the current context, it will judge whether it should look for a newer or other data source. When searching for data sources, it will generate a carefully designed search keyword. It will consider whether to continue looking for other data sources with the current keyword or to find some competitors related to the current one, at least needing to know a competitor’s keyword to search, etc. This thinking is more in-depth.
Additionally, the re-ranking model behind OpenAI’s search is of high quality. It might not simply take a Google search result and put it in directly but, after searching, re-ranks the quality of all data sources through a re-ranking model. Because OpenAI’s main goal is to generate high-quality research reports, it ranks authoritative materials and analyst deep research websites very high. So, with the same search keywords, the quality of the information sources it gets is actually higher than what Google usually retrieves, so it must have done a lot of optimization in this area.
Although Google’s Deep Research is paid, it has the advantage of generating reports with a clearer and more organized format, and it always starts with an Executive Summary, which is convenient for people who don’t have time to quickly understand the content. Moreover, Google’s tool can integrate better with the Google suite, like directly exporting to Google Drive, Google Docs, etc., which is its enterprise advantage.
On the other hand, Grok developed by X AI is the most professional in retrieving Twitter content because only it can directly access Twitter’s data. Others can only call the API, and Twitter’s API is very expensive, making it difficult for the average person to access the data. If it’s an analysis of big influencers on Twitter, then X AI’s product might be more professional, which is an advantage in terms of data sources.
Additionally, as a professional user, I personally use some niche deep research tools. For example, for academic papers, sometimes I use a tool called Elicit, which is specifically for searching academic papers. The content it retrieves is mostly professional literature, so it’s more professional.
Now, if I really want to write a research report, I might not directly use the content generated by OpenAI but hope it first gives me an outline, and then I modify it based on the outline. When it writes each chapter, I will adjust the content.
For AI-assisted writing, I suggest not letting AI generate the entire content at once and then saying “no, start over.” You can try using tools like Kompas AI, which can control each step during generation based on the outline. This kind of tool is more suitable for professionals writing research reports or papers.
Therefore, there are many Deep Research tools in the market, each with its differences. For example, OpenAI’s tool is the most in-depth in research but also the most expensive; XAI is more professional in researching Twitter; Perplexity is free, but the depth of the generated report is definitely limited, as it can’t spend that much money and computing power for AI to think through so many steps.
GeekPark: Bojie, you mentioned earlier that the 200-dollar OpenAI renewal was only for one month, or have you been renewing it continuously?
Li Bojie: I only bought it for one month and have now unsubscribed. I bought it in early February, just when Deep Research was released. After purchasing, I found that it not only had Deep Research but also Operator, and it seemed to have GPT-4.5. I used these most important features, and now I’ve unsubscribed.
GeekPark: What tool are you currently using for Deep Research?
Li Bojie: Currently, I mainly use OpenAI’s 20-dollar subscription service, which includes the Deep Research feature with a monthly quota of 10 times. I don’t usually write research reports, and I use Deep Research mainly for product research, so I don’t have much demand. If someone needs to write research reports every day, then I think the 200-dollar Deep Research might be worth it because if a person uses it to write 5 reports a day, it will definitely pay off. Because I feel that each execution of Deep Research can’t cost less than 1 dollar.
GeekPark: Hmm, makes sense. And I also noticed that suddenly, everyone is implementing Deep Research. What’s the reason behind this? Is it related to open source? Is it because some key parts needed for Deep Research AI have already been open-sourced, is it related to this?
Li Bojie: Yes, you’re right. Because now I think DeepSeek R1 is a very critical time point. Because just when OpenAI released O1, everyone thought this reasoning model was very good, and everyone hoped to have a reasoning model that could allow AI to truly think, right?
It was because before, prior to GPT-3.5, everyone used what was called a completion model, which could only complete sentences. For example, “The capital of China is” would be completed with “Beijing.” It could only do this kind of task. In other words, it could continue writing a novel, but it couldn’t answer questions.
Then with GPT-3.5, through RLHF, the large model learned to answer questions. You could ask a question, and it would provide an appropriate answer, right? For example, “The capital of China is?” with a question mark, it wouldn’t just continue with more questions but would directly answer the question. This was a significant advancement in RL.
Then in O1, there was another major advancement in RL. It also used RL methods, but not the same kind of RL method. It enabled AI to think. Previously, it could answer questions, but it wouldn’t think about the possible branches before answering. Now, O1 has taught it to think.
However, this thinking process was very difficult before. For example, when O1 came out, it was very expensive, right? OpenAI even mentioned that “without a billion dollars, don’t even think about creating a reasoning model” or something like that.
GeekPark: Right. But later, it was discovered that DeepSeek managed to do it, right?
Li Bojie: Yes, of course, other companies also did it. For example, Kimi released K1.5 on the same day, but K1.5’s capabilities might be slightly lower compared to DeepSeek R1. Also, Qianwen had QWQ, which were also quite good models at the time. But DeepSeek R1 is truly a model that can almost match O1’s capabilities, and it’s open-source. This allows everyone to use this model and directly bring Deep Research and capabilities like Manus to users. Of course, Manus can’t directly use R1 because R1 is a non-multimodal model; it can use other multimodal models. But R1 at least proved that this type of model is not mysterious, and it revealed the technology behind it to everyone. This way, people can really start using it.
Another key point, I think, is Anthropic’s Claude 3.5 and Claude 3.7 Sonnet. As you mentioned earlier, Manus might be using Claude 3.5 Sonnet because this model has strong tool-calling capabilities. Especially the recent Claude 3.7 Sonnet, which not only has the thinking ability, like R1’s ability to think before speaking, but also has very high accuracy in selecting tools. It always knows how to call tools.
In fact, tool-calling is also a skill that requires specialized training. For example, OpenAI and Anthropic’s models do well in this area, but other models, while having general question-answering capabilities, may not have high tool-calling accuracy. If the tool-calling accuracy is not high, then Deep Research cannot be conducted because I don’t know what to do next. So, with the emergence of these good models, products like Deep Research and RPA products like Operator have also appeared in large numbers.
GeekPark: As you have been emphasizing, some models have very good tool-calling capabilities. Is this an engineering capability? I understand that Anthropic’s coding ability is outstanding, possibly related to its model training, but is function calling because R1 hasn’t started doing it yet, or is it also part of model training techniques and algorithms?
Li Bojie: I think this mainly relates to training data. Each model has its strengths. For example, the DeepSeek V3 and R1 series models have very strong creative writing abilities. For instance, many people saw the jokes written by R1 during the New Year, right? They were very good. But it seems that foreign models can’t handle writing jokes well, meaning they can’t write English jokes well. So, this is an area where they have specifically strengthened their writing abilities.
Then 3.7 Sonnet might focus on the coding ability and Tool Call, which is the Agent’s ability. Because the Agent concept is something Anthropic has always liked to mention, like MCP. And the Computer Use demo wasn’t initially done by OpenAI; Anthropic did it in September or October last year, creating a demo that was significantly better than before. So, this is their focus area, and naturally, they do it better.
GeekPark: Hmm, we’ve been talking about Deep Research, which might be an AI agent that appeared earlier than Manus, and now it’s a feature that all major model manufacturers need to have. Deep Research is an AI agent, so let’s talk about other forms. For example, as you mentioned earlier, Bojie, Anthropic’s Computer Use is one, and OpenAI’s Operator is another. Of course, there are others, and we’ll talk about vertical domains later. In the same domain, what problems do these two solve, and are they based on the same technical path?
Li Bojie: I think they are not entirely based on the same path, so I find Manus interesting because it combines three different technical paths. From the implementation path of agents, there are mainly three ways.
The first is what you mentioned, Computer Use and Operator, which operate computers. Their hallmark is operating a graphical interface like a human. Whether it’s operating a mobile app, a desktop, or a browser. Operator operates a browser, Computer Use operates a desktop within a virtual machine, and Manus also operates a desktop within a virtual machine. That’s the logic.
The second type is like Deep Research, which focuses on search and research, with its main information source coming from searches.
The third type is code generation, like the Agent Composer Agent in Cursor and other coding agents, including the recently released Cloud’s Code. These aim to generate code, but these three methods can be organically combined.
For example, Manus combines the ability to operate a computer’s graphical user interface, generate in-depth research reports using search, and generate code and write projects.
Of course, when combined, it may not be as strong in individual capabilities as single-purpose agents. But I think this is ultimately a big direction. If an AI foundational model’s capabilities are strong enough, like Claude 3.7 Sonnet, R1, or O1, and then a few more versions are developed, one model will be able to do everything.
GeekPark: Manus has a unique feature where, to prevent AI from being interrupted when operating a computer, like when you accidentally touch it or open another page, it uses a cloud virtual machine to operate its own browser. From your perspective, what are the pros and cons of this approach? Will it be adopted by more and more agents as AI agents become stronger?
Li Bojie: Overall, I think it’s a very smart implementation. I see two aspects.
The first aspect is that an agent, by definition, is an assistant or helper, not you, not your avatar. So, psychologically, there should be a sense of boundary between you and the agent. The agent shouldn’t have access to all your privacy and be able to do everything, as that might not make you feel secure. So, it should work in an independent environment, like an assistant. An assistant in a work setting doesn’t come to your home every day, right? So, this is a consideration for privacy protection.
The second aspect is, as the host mentioned, it has an independent working environment, allowing it to complete tasks efficiently by avoiding interference from the host, which is the device I’m using now. If it operates on my own computer interface, I might be using certain software, and accidentally interfere with the AI assistant, causing it to stop working. For example, if I’m live streaming and the AI is operating in the background, resulting in my stream being interrupted, that’s obviously not good. So, an independent sandbox is needed.
However, one area for improvement is finding a better way to integrate with personal data. A good agent should have better memory and be able to access user-authorized photos, files, and other content to serve you better. So, I feel these products need a better way to interact with users’ personal computers in the future.
GeekPark: Hmm, how do you understand, for example, an AI agent like Manus, which aims to solve problems in general consumer scenarios? Can it achieve this? Can it achieve it stably?
Li Bojie: I feel it’s already quite close. By the end of this year, in less than a year, I think it should be able to stably achieve scenarios where most people can operate without professional knowledge.
But I think expectations shouldn’t be too high. The model’s latency issue might not be easily solved, while stability issues are relatively easier to address. For simple tasks, adding a couple more versions to the model should suffice. But its working speed might still be slower than humans.
If you’ve used it, you might find that Manus’s operations or OpenAI Operator’s operations are still much slower than humans. I think this is fundamentally related to how our current visual large models work. Each time it captures an image, the image needs to be encoded and then output as tokens, with a delay of about a second or even longer. With the current Transformer model architecture, it might be difficult to solve this problem.
Humans are completely different; for example, when seeing an image, they can react within 100 to 200 milliseconds. In this regard, the design of the human brain is, in some ways, more advanced than the current Transformer.
GeekPark: Will it have boundaries? For example, the Manus team believes that they are not simulating a specific role like a product manager, developer, or salesperson, but rather simulating how a person who can get things done works. As long as the interaction with environmental perception, like eyes and hands, is defined, and the interaction space is set, it can use its brain to complete all tasks. I’m just saying that the issue to solve is which tools can be called and the boundaries of calling these tools. If these boundaries are clearly defined or even fully integrated, it sounds like it could cover everything?
Li Bojie: Yes, you’re right. What you mentioned about breaking down all the boundaries of calling tools, firstly, it might have some highly efficient tools it can call, similar to Anthropic’s MCP, and then define a protocol to integrate it. This way, it can directly write code to interact, which is the most efficient.
For example, Manus has already done part of this, like searching for related people on LinkedIn or checking stock prices, which have specific APIs to call. However, for general web pages, most websites won’t develop an MCP for you anytime soon, so it still has to operate the webpage or mobile app step by step like a human.
But it has general capabilities because as long as the foundational model is trained for the computer use scenario, it actually knows the general design patterns of most apps on the market. For instance, an icon like a house in the upper left corner generally means a back button. Or most personal cloud storage apps might have a menu where you can find all folders and files. Each app might look slightly different, but the general appearance is similar.
So this AI can automatically adapt to all software without needing the software developers to submit it proactively. I’ve also done some tests before, like with OpenAI’s computer use operator, and tried some small websites I wrote myself, which it definitely hadn’t heard of, and found it could figure out how to operate them step by step.
It’s just a bit slower than a human, not because it thinks slowly, but because its actions are too slow, which is the issue with the model’s vision speed. So each operation might take three to five seconds, whereas a human might do it in a second. It’s slower, but it can always get it done. So I think its generality, given the current model capabilities, doesn’t need much doubt.
GeekPark: Listening to Bojie, this seems like a certain opportunity. Especially as we approach the end of the year, with various models, particularly RM models, becoming more suitable for agent tasks, it feels like creating a general AI agent is definitely a certain opportunity. That means it might be like Monica, where all major companies will get involved, just like Yuanbao, Quark, and Doubao are doing now. In this situation, what do you think about the opportunities for startups? Because listening to Bojie, it seems like Pine AI is also aiming to do something similar, right? What do you think?
Li Bojie: I think your question is very good. I’ve been thinking about this issue too. The large model seems to have general capabilities, which means once my model is developed, all companies can build similar applications on it. So how can a startup establish a moat or competitive barrier on it?
I have two thoughts: First, startups can target a specific industry, which might not be highly regarded initially. For example, we are mainly working on voice systems now, like in HER, where the primary interaction is through voice because Samantha in HER doesn’t have a visual representation. It can see, but that’s just auxiliary; most of the time, it communicates through voice.
So making voice interactions as natural as human ones is something I feel not many are doing. Voice also has the challenge of latency. As we saw with Manus, if it’s slow, I might not care much if I delegate work to it. But with voice communication, which is real-time, a one-second delay can feel very clumsy, making real-time interaction impossible. This is where our research on latency reduction technology can be useful.
That’s the first aspect, targeting a specific field like voice, or other companies I know working on video or image generation, which aren’t fully multimodal general fields, might avoid the most intense competition.
The other aspect is that I feel RL (Reinforcement Learning) is very crucial now. RL offers many companies in specialized fields the opportunity to build a moat.
Traditionally, AI Agent tuning involves two main things: adjusting Prompts or placing some knowledge in a knowledge base and retrieving it from RAG during runtime. However, neither the knowledge base nor the fixed prompt method can store too much knowledge. For example, if I’m an expert in a field, there are many related knowledge and industry know-how. If I’m a marketing agent, how do I market products? How do I manage user expectations? How do I learn the jargon of this field? If I put a guidebook in, the prompt can’t hold it.
Moreover, there’s a big problem. Even if I adjust the prompt, like setting 20 rules, if I want to add the 21st and 22nd rules, it might learn these new rules but forget two old ones. The model’s instruction-following ability is limited, and its self-learning ability is limited. This leads to regression in product capability—new things are learned, but old ones are forgotten. This makes it hard to achieve continuous improvement in product capability.
But now, with OpenAI’s O1 introducing a post-training method and DeepSeek R1 also releasing an RL method, people have found that using RL can “ingest” unlimited data. Post-training can theoretically accept unlimited data, and the higher the quality and quantity of data, the stronger the model’s capability. This way, technical accumulation can be achieved. If I accumulate a lot of high-quality data in this field, I can train an RL model in this field, turning it into a competitive advantage.
Also, if some models keep improving, you might train a model, and then a stronger base model comes out, and two months later, an even stronger one appears—this will likely continue happening.
RL also has the advantage of not being model-specific. As long as there are suitable technical experts and computing resources, I can take this data and train it on newly released model technology, and its capability will improve. You can think of it as if my current capability level is 5, and with RL, the model becomes 10, and then a new DeepSeek R2 comes out with a base model capability of 20. I can RL it again, and its capability becomes 25, always better than the SOTA open-source model. This way, the data moat can be turned into a competitive advantage.
GeekPark: Hmm, RL still holds value. I’m curious, what you mentioned about relying on a base, possibly an open-source RL model, for the RL part and then doing Post Training with your proprietary data. These two parts aren’t combined, right? Every time you switch to a stronger base RL model, you need to redo the Post Training, right? They’re not combined, are they?
Li Bojie: Yes, you have to redo the Post Training each time. But the Post Training process is relatively fixed. For example, if you post-train on a certain model before, and now, say, Gamma3 releases a 27B, you retrain on it, and the requirements are quite similar.
GeekPark: Hmm, so it’s like having all the materials ready and doing it again, which might still consume some computing power, right?
Li Bojie: The RL process doesn’t require as much computing power as many people imagine. If it’s just for a vertical field, it’s not something that requires a million dollars.
For example, a PhD from Berkeley did TinyZero, which on a 3B model, cost tens of dollars, using 2 GPUs for half a day, and it learned to solve 24-point problems and large number multiplication. Watching a 3B model with almost no reasoning ability gradually learn to reliably solve 24-point problems, and its output thought chain is even shorter than DeepSeek R1’s general models, is quite interesting. If my business scenario is solving 24-point problems, using a small model with RL, the cost and latency are definitely lower than general models.
Also, a few undergraduates from USTC, in collaboration with Jiukun and Microsoft, did Logic-RL. Their first phase only used 4 A100s to replicate DeepSeek R1 Zero’s basic capabilities on Qwen 7B. They used multi-person logic reasoning questions as training data, like “Xiao Ming is 5 years older than Xiao Qiang, Xiao Qiang is 10 years older than Xiao Li, Xiao Ming is 10 years old, how old is Xiao Ming?” The final training results were even better than the full version of OpenAI O1 and DeepSeek R1. This shows an important point: small models through RL, with the right methods, don’t need much computing power to achieve higher reasoning ability than SOTA large models in specific domain tasks.
During the Logic-RL process, those undergraduates also replicated some findings from the DeepSeek R1 paper, like response length growing with training, and the thought process showing multilingual phenomena. The final model learned multi-path exploration, reflection, phased summarization, and pre-output answer verification, with some generalization ability on problems outside the training set.
When OpenAI O1 first came out, I said the O1 paradigm is good, and post-training might not need as many resources as pre-training, so small companies and academia can do it. At that time, because OpenAI kept saying it couldn’t be done without a billion dollars, many outsiders were skeptical. But insiders basically started trying immediately, like Kimi using thousands of math problems for RL, finding the model not only strong in math but also having good generalization ability in other reasoning problems. Everyone found RL to be a powerful tool, and as long as the reward function is set correctly, solving sparse reward and reward hacking issues, the model can automatically learn the desired thinking style.
GeekPark: Hmm, got it. Earlier, Bojie, I noticed something because you mentioned that what we’re doing with PINE AI might be similar to a voice-based AI?
Li Bojie: Like a voice assistant.
GeekPark: Because we mentioned the movie “Her” at the beginning, including Samantha, right? Then I noticed, are you wearing Meta’s Ray-Ban glasses? Are you hinting at the form of our upcoming product?
Li Bojie: No, no, this was a gift from a friend. It’s from a domestic company called Thunderbird, and they gave it to me to test their AI. It’s similar to Ray-Ban. But we are not making smart glasses; we are not doing this kind of smart hardware.
GeekPark: Will PINE AI provide them with smart assistants or smart voice assistants?
Li Bojie: Currently, we haven’t considered moving towards smart glasses, but I think if we want to enter this field in the future, we definitely can. What we’re doing now is actually more complex than smart voice assistant glasses. The main challenge with glasses is in hardware design and battery life, which is the hardest part. The AI part is generally on the cloud or on some models on the phone. We’re focusing on high-value-added AI voice scenarios.
GeekPark: Making calls that can really save users money or help them earn money. OK, got it. Earlier, Bojie mentioned that the opportunity for startups might be in vertical fields, like what Bojie is doing with PINE AI in the voice area. Over the past year, vertical AI agents have also become popular, with Devin being the hottest, though some say he’s a fraud.
But now, the reputation seems quite good, though it’s expensive at $500 a month, mimicking a human programmer. Have you used it, Bojie? Can you tell us what problem this vertical AI agent Devin solves and whether it’s worth $500 a month?
Li Bojie: I actually spent $500 to buy a month. So you see, I’ve spent hundreds of dollars on each company, except for OpenAI’s $20,000, which I really can’t afford. I can’t afford the $20,000 one, but I can try the $500 one for a month.
Li Bojie: I think Devin is quite strong in that it can complete a development task end-to-end, like a small task that can be done in 45 minutes. The so-called end-to-end means I put the task in, and it does everything for you.
Li Bojie: But there’s a premise: first, it can’t be a task within existing program code; it must be a relatively clean project. For example, a well-maintained open-source project like VLM or some demo-level projects, like starting with no code but wanting to make a POC or a class assignment, where building a repository from scratch is easier.
Li Bojie: I want to say a bit more about why Devin doesn’t work well in real engineering projects, the so-called “spaghetti code” with hundreds of thousands of lines. The fundamental reason is that most projects are not friendly to AI, just as they are not friendly to a new programmer.
Li Bojie: Imagine an intern joining a company or project on the first day. Even if the intern is strong in coding, most still find it hard to complete a good code modification task. It might take weeks of training before they can slowly write code.
Why? Because human projects have a lot of tribal knowledge, which is unstructured, undocumented architecture, knowledge, or experience passed down orally. For example, I might download a company’s code and not know how to run it, where modules are, or what dependencies exist. I have to ask someone, and even after asking, I might not understand until I explore it myself. Then, when another intern comes, I have to explain it all over again. This is the problem.
So, I think for AI coding and programming to be useful in large engineering projects, the code quality must be very high. The code repository should be like those well-known open-source projects, where a new contributor can look at the README and know how to run the project, what key modules are in the code, and where each module is. This way, AI can perceive it, and so can people.
In such cases, if it’s a mature open-source project or a small-scale project that can be understood at once, in my observation, Claude 3.7 Sonnet is often stronger than an intern at writing code. It might be on par with me, and sometimes it solves problems I can’t handle myself, as long as my code quality is high, documentation is complete, and testing is thorough. It can even do better than me.
So, I think we should believe in AI programming capabilities. I especially want to mention this because since I started using Cursor in August last year, I’ve been recommending it to others. I haven’t received any advertising fees from the Cursor team, but I’ve been telling others how good Cursor is. People often say, “Cursor is fine for small demos, but it can’t handle projects over 10,000 lines.”
But in my personal projects with friends, like developing the course evaluation website iCcourses, which has over 50,000 lines of code, and my company’s projects, which have over 100,000 lines of code, whether front-end or back-end, using Cursor along with the latest Claude 3.5 Sonnet or the better 3.7 Sonnet, almost all development tasks can be AI-assisted, doubling development efficiency. What used to take three months can now be done in one to two months. The improvement is very significant.
I believe all our future projects should be AI-friendly, doing these engineering practices well. For example, document-driven development, solidifying tribal knowledge into documentation, making it easier to understand for newcomers.
Also, the issue of test cases. In some company projects I’ve taken over, there were no test cases. The consequence of no test cases is that I have to submit code to the remote test environment to know if it’s correct. If unfamiliar, I might crash the environment. Or, without test cases, I know the feature I want works, but I might break other features without knowing.
AI programming assistants often do this too, fixing one part of the code but breaking others. How to prevent this? The root cause isn’t AI’s capability but the lack of enough test cases to tell it what to do and not do. So, there must be comprehensive test cases to do development well.
Additionally, I’ve seen existing code with particularly inaccurate naming, which is common. The logic is chaotic, and the code’s meaning doesn’t match what it intends to express. In such cases, AI can’t understand it, and newcomers to the project can’t either, leading to high communication costs.
Some developers like reinventing the wheel. For example, there’s a widely recognized system and practice, like how to do something in code, how to connect to a database, or what library to use for web access, but they insist on rewriting it, resulting in a bunch of bugs. AI can’t handle this, and neither can people.
So, I think these are common issues, but with AI programming, these foundational tasks become increasingly important. Without AI programming, people rely on their undocumented knowledge to work, and others can’t help, nor can AI. This way, the company’s development efficiency is very low.
GeekPark: Yes. What you said gave me an idea. I wonder if all the software we use, important or not, will have companies saying, “We have too much epic code, and we’re going to use AI to rewrite it from the ground up.” It becomes a plan that’s easier to maintain and, of course, easier to cut down the programmer team. Do you think big companies will think this way?
Li Bojie: I think it’s possible. I remember when I was at Huawei, at the beginning of 2023, when ChatGPT just came out, we talked with an executive. We asked what the biggest help AI could provide to the company, and he said the biggest help might be “cutting more interns.” We were all scared because we were the grassroots employees, the first to be cut.
Of course, that’s a joke. I think what was said earlier is very good, that many projects in the future are worth slowly reconstructing with AI. There’s now a best practice that says in existing engineering implementation code, I can’t overthrow and rewrite everything at once, but I can gradually rewrite and organize the code better, slowly improving the documentation and testing.
For example, today, when I use it to modify a module, I let AI write the module’s documentation and improve the test cases. With AI, it might not take much human time, but the project moves a step towards being AI-friendly. If this project evolves like this for a few more years, the epic code of ten-plus years will slowly be replaced by AI-friendly stuff.
GeekPark: It sounds like if companies have AI rewriting their best practices, it might be better and more important. For new startups, whether making products or whatever, it’s like saying you should start using AI to write code when building products because AI won’t write spaghetti code, and you can set it up well from the start.
Li Bojie: Yes, I think that’s very right. I can usually easily tell if the code is written by a human or AI because human-written code is generally closer to spaghetti code. For example, if I see misspelled variable names, it’s definitely human-written.
GeekPark: Yes. But in the tech circle, opinions vary. Like Cursor, Windsurf, and Devin, these three hottest code-related AI agents, what are their differences?
If it’s possible to provide the model with the context of the code, including different coding best practices circulated among various companies, and make these things clear to the model, can it actually achieve a high level of completion? Because I’ve heard many entrepreneur friends say that the code written by AI isn’t very good. It might be very long and convoluted, while a human might solve it with just three lines of code. Is this issue possibly because the context isn’t clear or hasn’t been clearly communicated to the model?
Li Bojie: I think it’s possible. For example, a problem that humans can solve in three lines might be because there are already three tools available in the software. I can call the first tool in the first line and the second tool in the second line to solve it. But if it doesn’t know these tools exist and writes everything from scratch, the implemented code will definitely be longer, possibly very verbose and wasteful. So, I think it’s still due to unclear context and documentation. If given a clear enough context, I believe it would know which tool to use and wouldn’t rewrite everything from scratch.
Of course, there are some codes that might require very high intelligence to write, which I think AI still can’t handle. For example, even a strong model like Claude 3.7 Sonnet has only an 80% probability of completing end-to-end coding, whereas a human should be able to complete it 100% of the time. The remaining 20% still requires human intervention, such as performance optimization algorithms or some cutting-edge stuff.
For instance, having AI develop AI agents on its own isn’t quite feasible now. If you say, “I want to develop an AI agent,” and let Claude 3.7 Sonnet write one, you’ll find it can’t even spell the model’s name correctly because it still lacks knowledge of the latest information. Including writing prompts, many of the prompts it writes aren’t very reasonable. These belong to the very new knowledge of 2023 and 2024, which is still relatively scarce in the model’s training set, so these things still require human involvement.
Additionally, for example, having AI write very low-level Linux kernel code, it might not be particularly familiar with kernel programming either. I tried having it write once, and it crashed directly, so it’s not quite feasible in this area either.
Geek Park: OK. Last year, there was also a phenomenal product that I remember many venture capitals researched, called Eleven Labs. It seems to be able to do junior sales work, like an intelligent agent. It also raised a lot of money. In such vertical scenarios, what specific problems can’t general AI agents solve? Is it the tool-calling issue we mentioned earlier, or is it the proprietary domain data issue, or other aspects?
Li Bojie: Yes, I personally feel its main problems are a few points. The first might be some industry know-how, which is difficult to incorporate. Because the RL paradigm basically just appeared from DeepSeek R1, so previously, people might not have had time to gradually explore this paradigm. In this way, for example, in the marketing field you mentioned earlier, those industry terms or “how to do high EQ sales” might be difficult to clearly explain using prompts. The effect of SFT is also average, and the RL method has just come out, and people haven’t tried it yet, right? So, industry know-how is difficult to be incorporated into the model.
The second problem, I think, is the lack of many knowledge bases. Because many industries have a lot of “tribal knowledge,” which is hidden in people’s minds and not written down, and is passed down orally. Many of these are not documented. Things that aren’t documented, AI definitely can’t learn, so there will also be situations where work agents can’t do it.
This is also why, as you mentioned earlier, there are still opportunities for intelligent agent startups in some vertical fields. Because if a startup deeply cultivates a certain field, it will know the industry know-how of that field and can also collect some knowledge bases by itself. Whether these knowledge bases are trained into the model or agent using RAG or RL methods, the agent’s ability in that field will definitely be stronger than that of a general field agent.
Geek Park: Hmm, so in the RL field, doing it based on your professional, proprietary domain data is significantly more effective than SFT?
Li Bojie: They should have different positions. SFT mainly targets the format of replies, or the speaking style, role setting, etc. RL is more about capability, such as if I have five tools, I want it to know which tool to call in what scenario; or my negotiation strategy should be discussed first and then later, this kind of thinking ability, which needs to be learned using the RL method.
Geek Park: Understood. Yesterday, I also saw OpenAI release a new set of tools, which should be three tools released separately to make it easier for everyone to create AI Agents, namely Web Search, file search, and Computer Use. At the same time, it also launched a Responses API that allows for multi-turn conversations and open-sourced an Agent orchestration framework. What do you think OpenAI released these Agent tools and frameworks to solve?
Li Bojie: I personally feel that OpenAI is doing quite well. First, it released some APIs that weren’t released before, like the Computer Use model. Originally, it was a dedicated model without an API, but now there’s a model and an API, so I can use it.
Some people might say that there are many Vision LLMs on the market now, which can recognize images, and I can just throw a screenshot in and let it do it. But it’s definitely not that simple. Because Computer Use isn’t just about knowing there’s a button on the interface or a line of text. More importantly, it’s about planning: I know this task, and for example, if I give you a mobile app interface, it needs to know to click this first and then that to complete it, rather than just seeing a button at this step and clicking it randomly.
So, it has an agent’s planning and the ability to perceive the current state. This is a capability that general Vision LLMs find difficult to have. So, providing this API is very important, and we can also use this model to enhance our model’s capabilities and do our own RL. I can create a model for my type of application, such as UI interfaces or RPA tools, to understand where to click.
That’s the first model-related API. The second is the framework mentioned earlier. Originally, there was an open-source framework called SWARM released at the end of last year, and now that SWARM has been upgraded to the Agents SDK, including the Responses API, which replaced the original Assistant API.
I think the original Assistant API was poorly designed. It was very simple, like a half-baked memory that didn’t really implement memory. And it didn’t have any agent capabilities. But now, Responses should be a more complete API supporting agents.
I think one good thing about OpenAI’s work is that it’s generally professionally engineered. For example, OpenAI’s previously released Operator and their Deep Research are products with a relatively high degree of completion. These are things that can actually be used, not just a demo with some functions working and some not, thrown out directly. The same goes for the current Agent API. Its Responses API design can provide great inspiration for us developers when designing agents.
I think there are three good APIs worth referencing in the agent area:
The first is the newly released Responses API, which has well-designed synchronous and asynchronous interface suites and tool definition interfaces.
The second is MCP, which Jingyu mentioned earlier from Anthropic. It demonstrates how to integrate third-party tools into an ecosystem, and I think this design is very clever.
The third is the Realtime API released by OpenAI in November last year with GPT-4o. It solved an important problem: how to handle voice streams so that users can continue talking while the backend continues to work. Currently, most agents stop working when users talk to them, getting interrupted. But the Realtime system achieved pure asynchronous operation, allowing users to communicate while the agent continues to process tasks.
I think this represents a good development direction for future agents. Future agents might consist of two parts: one part is the “fast-thinking” agent, responsible for communicating with users, understanding user needs, and providing feedback; the other part is the “slow-thinking” agent, working quietly in the background, conducting research, writing code, or collecting data. This requires a combination of “fast” and “slow.”
This is very similar to human thinking. Part of the human brain is always active, able to quickly respond to external stimuli and dangers; while another part, the most energy-consuming part, is usually inactive and only gets activated by the fast-reacting part to complete specific tasks when needed, then goes back to a dormant state. This is somewhat like the future working mode of agents.
If the cost of large models is reduced to a certain extent, maybe in the future, we can have our own agent in our phones or computers, running 24/7, responding to various requests at any time, and possibly reminding you of some critical things in a timely manner. But once it needs to do something very complex, it needs to think carefully, it can summon more computing power at once, and complete more complex tasks like Manus.
Li Bojie: Yes, it’s quite interesting.
Geek Park: We’ve been broadcasting for almost two hours now, and the pace was very fast. From talking about general agents to vertical agents, I wonder if you could predict based on the current situation, as you mentioned that there might be more powerful comprehensive agents coming out by the end of the year. In which vertical fields do you think agents capable of executing complex tasks might emerge? In which industries are they most likely to emerge? For example, your PINE AI company is working on the voice set, could the voice set be the first to emerge?
Li Bojie: I personally feel that many fields can emerge. I’m not here to advertise my company, so I still want to discuss this issue seriously as a research scientist.
First, I personally feel that the programming field has already emerged, and there are already many application scenarios in programming, and in the future, the scope of what it can do might become larger.
Secondly, I think the education scenario will be quite useful. Because the education scenario actually overlaps with our voice to some extent, as voice is a modality, and education is an application scenario, they are orthogonal dimensions and can intersect.
Now I feel that many good teachers are actually quite limited, but there are many students and few good teachers. In this way, the current education is all one-to-many education. Is it possible in the future, like OpenAI’s Andrej Karpathy, who started Eureka Labs, which is focused on education? His vision is that in the future, we can have an AI teacher that can assist human teachers, and AI teachers can help teach various specialized courses.
This is actually something I find very exciting. It can allow everyone to learn exactly what is in their learning zone, rather than their comfort zone or panic zone. Because now, students with poor grades in the class are always in the panic zone, and students with good grades may be in the comfort zone, and the actual learning efficiency in the learning zone is relatively low. But I think AI might be able to solve this problem.
The third area, I think, is the field of communication between people. For example, now with all kinds of intermediaries, I think this might be a big area. Because now you need to find an intermediary to rent a house, like banks and many private banking services, which are just helping you make appointments, etc. They have some resources to help you with this. In the past, O2O automated and platformized standardized needs like ride-hailing and food delivery, but many areas’ needs are not standardized, so they still rely on intermediaries. I think in the future, AI agents might be able to handle these.
In the future, I might have an agent, and you might have an agent. If we want to schedule a meal, your agent might send a message to my agent, and my agent would determine that this person, Teacher Jingyu from Geek Park, is someone worth meeting, and I would accept your invitation, right? I would quickly add it to my calendar, and I might not even need to know about it. I just need to check in the evening and see that Jingyu has scheduled a dinner for tomorrow night, right? I just need to remember to go tomorrow night, and that’s it. And even when the time comes, the agent can remind me to wrap up my work and get ready to leave.
If this can be achieved, the efficiency of daily work and life will be very high. Because I find that, in my daily work, a lot of time is actually spent on these communication-related chores.
Geek Park: Yes.
Li Bojie: And maybe, as a research scientist, I have fewer of these, but for some communication-intensive professions, maybe 70% of the day is spent on chores. I think these chores can be replaced by AI, allowing people to focus on what they are truly interested in and can create value.
Geek Park: Yes, yes, yes. Well, we mentioned the code direction, intermediary type, and education type, and everyone should pay more attention to these directions, as they might emerge in the direction of agents.
Besides these general-purpose agents, today we recorded for two hours, and Bojie and my colleague Wanchen analyzed for everyone what exactly can be considered an AI agent. As a fundamentalist, what exactly should an AI agent be, and then we discussed Manus, the agent it is developing, and the technologies it uses, its relationship with large models, and the technical schools behind these AIs, and then we talked about non-direct scenarios of Agents.
Today, we really talked a lot about AI, and about the technologies behind Manus. I, at least, learned a lot, and I see many of our classmates left comments under our live broadcast to discuss together, which is really very happy.
Li Bojie: Yes, I am very happy to chat with everyone about agents today.
Geek Park: Next, let me see, next week should be NVIDIA’s GTC conference. Through its official website, we can see that AI Agent is also a key topic at this GTC. So everyone can look forward to it, this year should really be the year of AI Agent.
In addition to paying attention to AI agents, you can also pay attention to PINE AI where Bojie is, their functions and products in the voice aspect, we are very much looking forward to it. We also look forward to Bojie coming to our Geek Tech Talk again to share your valuable experience and insights in AI with everyone. So thank you very much to Bojie and Wanchen today.
Li Bojie: Thank you very much, Jingyu and Wanchen. I hope we can really make this the year of the agent, and by this time next year, maybe the agent can replace me to do this talk, right? It can speak as deeply as the real me, which I think is quite hopeful.
Geek Park: No, I still hope to communicate with you face-to-face, in person.
Geek Park: Yes, you can attend with your agent, and if it makes a mistake, you can say, “Let me interrupt here, it should be like this.”
Alright, thank you very much today, and thank you all for watching our Geek Tech Talk live broadcast. If nothing unexpected happens next week, we will definitely broadcast about GTC content, whether it’s AI Agent or robots, there will be related live broadcasts. Everyone, look forward to our Geek Park public account video, don’t miss any of our live broadcasts. Thank you, Bojie, see you next time, bye-bye!