OpenAI o3: The Dawn of AGI and ASI
This article was first published in a Zhihu answer to “What do you think of OpenAI’s latest o3 model? How powerful is it?“
When o1 first came out, many people doubted that it had not yet reached AGI (Artificial General Intelligence). The programming and mathematical capabilities demonstrated by o3 not only meet the threshold for AGI but even touch the edges of ASI (Artificial Superintelligence).
o3 further validates the value of RL and test-time scaling, providing a path to continue enhancing model intelligence and solving more difficult problems through post-training and increased inference time when high-quality pre-training data is nearly exhausted and model capabilities hit a “wall.”
Many have seen the specific performance metrics of o3, so I won’t repeat them. Here’s a summary:
- o3 defeated 99.9% of programmers in Codeforces programming competitions, ranking 175th among 168,076 programmers. Even the authors of o3 couldn’t beat it.
- o3 also shows significant improvement over o1 in meeting real-world programming needs. In the SWE-Bench software development test, the previously released o1-preview scored 41.3%, while o3 scored 71.7%. This means o3 can directly meet 70% of real-world needs and pass unit tests, leaving only 30% of the work for human programmers, which AI can also help significantly improve efficiency.
- It scored 96.7% on the AIME 2024 math test, equivalent to only missing one question in the American Mathematics Olympiad.
- In the GPQA Diamond test for PhD-level scientific questions, it exceeded o1 by 10 percentage points, while o1 was already at the average level of human PhD students.
- In graphical logic reasoning ARC-AGI, after fine-tuning, o3 reached 87.5%, surpassing the human average (85%).
However, o3 is not omnipotent, and its ability to handle real-world engineering tasks is not as strong as imagined. I found that in programming tasks within large engineering projects, the accuracy of o1 preview is not as good as Claude 3.5 Sonnet. o1 excels in well-defined, closed scientific problems. I don’t know about o3, but from the fact that it can only solve 71% in SWE Bench, it is still not as good as human software engineers, because a competent full-stack engineer cannot say they can only complete 70% of the requirements and not the remaining 30%. o3 surpasses 99.9% of humans on Codeforces because programming competition problems are well-defined, whereas real-world engineering tasks are not as straightforward.
Some people say o3 is too expensive now, costing $1000 per task, and that real people are cheaper. I want to refute this view from several angles:
- Most software project programming tasks only require the capabilities of o3 mini, which is stronger than o1 preview but cheaper. Even the cost of o1 preview is already lower than the development cost of human engineers, meaning for software development problems that AI can solve, the cost of AI is lower than manual programming by human engineers.
- Problems that o3 mini cannot solve and require the full version of o3 are often difficult problems that ordinary programmers or university math students cannot solve. Hiring a reliable top programmer or math expert to solve these problems would cost far more than $1000.
- The knowledge density of large models is continuously increasing, and the knowledge density law proposed by Professor Liu Zhi Yuan (model knowledge density doubles every 3.3 months) still holds for reasoning models. For example, in just half a year, o3 mini achieved the capabilities of o1 preview at 1/10th the cost. Therefore, with Moore’s Law for hardware and increased model knowledge density, inference costs will rapidly decrease.
Therefore, I believe the progress of large models in this wave confirms the view I have always believed: In a world with limited energy, AI is a more efficient form of intelligence compared to humans. I am pleased to see OpenAI leading the industry in exploring more efficient solutions for converting energy into intelligence.
Since the release of o1 in September, I have been confused: AI’s software development capabilities have surpassed humans, and AI’s intelligence surpassing humans is a foregone conclusion. So what can human programmers do?
If I were to compete with AI on Codeforces, not to mention o3, even o1 might beat me. I was recommended for the NOI algorithm competition, and I am confident in my coding abilities at work, such as when I interview candidates, I never let them run the code because I can usually spot errors in small programs of a few dozen lines at a glance. But in front of AI, I still feel a deep sense of powerlessness.
And for simple software development needs in real projects, I may not even beat Claude 3.5 Sonnet. Sometimes I feel that the output of the Composer Agent in Cursor (or a similar AI programming software Windsurf) is slow, and I wonder if I could write the simple changes faster myself. But after trying a few times, I often find that just as I locate the code to change in the repository, the AI has already finished writing. Most of the time, what I do is the least technical, or more like a product manager: clearly describing the requirements in natural language to the AI and then checking if the AI’s output meets my needs. (Of course, AI is not that strong today, and when it can’t handle something, I still have to modify the code myself.) In today’s o3 demo, you can also see that the slowest part of the entire demo is not the AI, but the person operating the AI.
So is AI Infra, a relatively specialized field, still reliant on human experts? For example, given a training task and hardware configuration for a 10,000-card cluster, how to choose the optimal parameters for PP, DP, and TP parallelism? I spent an entire evening calculating it on scratch paper. If I directly asked o1-preview, it would also answer incorrectly due to a lack of domain knowledge, such as not knowing the hardware parameters of the GPU and the tensor shape of LLaMA 70B. But if I find the relevant background knowledge from the knowledge base and tell o1-preview, point out its issues after its first output, and let it continue to improve, within 5 rounds of dialogue, half an hour (including prompt input time), o1-preview can calculate the same conclusion I reached after thinking all night. A good inference is that based on the model with the strongest reasoning ability, AI can greatly enhance the work efficiency of domain experts. A bad inference is that domain-specific knowledge and experience are not a moat for humans in front of AI.
Last night, I happened to discuss this issue with my wife, and she said that in September, I thought AI programming was amazing and that one person could write a large project without a team. But in reality?
First, AI’s code quality and software engineering capabilities are not as good as professional programmers. Since November, I have been independently responsible for most of the technical development work of a project. With Cursor and Claude 3.5 Sonnet, the code output of over 40,000 lines per month is indeed much higher than before, but AI’s code quality is still inferior to humans, such as not following the DRY principle and having a lot of code repetition. Due to incomplete test cases, AI often changes correct parts to incorrect ones, resulting in bugs throughout the project. Therefore, humans still need to act as gatekeepers for AI, similar to the role of committers in a software development team.
Second, going fast alone, going far together. One person’s thinking can easily get stuck in a dead end and lose motivation. The role of partners is not simply to help share the workload but more importantly to provide different perspectives, to pull you up when you’re pessimistic and desperate, to pour cold water when you’re complacent, to push progress when you’re distracted, and to change the route or even cut the feature when the technology doesn’t work.
Finally, technology is just one aspect of a company and not necessarily the most important one. For example, in this o3 launch event, the demos by two technical experts were clearly not as well-presented as those by Sam Altman. Both technical experts are people I greatly admire, but during the demo, they still found it difficult to make laypeople understand how impressive the technology is. In the demo, o3 mini wrote an Agent framework, ran it once, and the user wrote a program to evaluate itself within this Agent framework, also running it once, resulting in a 61% performance metric. We insiders were amazed, but Sam Altman reminded them to explain what was happening. I guess in the eyes of many laypeople, this demo was just developing a rough HTML interface, inputting a prompt, and getting an unknown test result, not grasping the power of AI bootstrapping.
It reminds me of when we made the first version of the voice call at the end of last year, creating digital avatars of Donald Trump and Elon Musk that could chat on the phone with similar voice and speaking styles. But investors were confused, asking what use this was and why users would want to call digital avatars of Trump and Musk. Some friends told me that while others could present a 100-point thing as 500 points, I presented a 100-point thing as 60 points. Therefore, for a company, a role like Sam Altman is very important. In explaining the value of technology, AI is currently no match for humans.
Considering these three points, I realized last night that humans should not compete with AI in intelligence. This is like the industrial revolution, where humans should not compete with machines in physical strength. But the industrial revolution did not replace humans because machines expanded the boundaries of human physical strength, allowing them to move things humans couldn’t and improving the efficiency of repetitive physical labor (such as farming and weaving), making human physical labor more dignified. Today’s AI also expands the boundaries of human intelligence, solving problems that require the most brainpower to think of answers and improving the efficiency of repetitive mental labor (such as filling out forms and writing PPTs), making human mental labor more dignified.
Like the industrial revolution, I believe AI is not the terminator of programmers; on the contrary, it can greatly enhance programmers’ work efficiency. There is no need to worry that the general improvement in programmer efficiency will lead to many people losing their jobs, because the current demand for software development in society is far from being met.
First is independent developers. I have many things I want to do but no time to develop them. Now that efficiency has improved, I can finally do the things I’ve always wanted to do. Currently, AI can only double my overall efficiency because there are many things AI cannot do or do well. But I believe with stronger models and agentic workflows, a tenfold increase in overall efficiency by 2025 is hopeful. In the coming year, I hope to use AI capabilities to fulfill a few wishes.
The second is the B2B industry. In traditional industries, there are many processes and knowledge that have not been digitized, which is also the biggest challenge for ERP implementation in many sectors. Previously, we thought that customized development was equivalent to outsourcing, and the costs could not be recovered, so many workflows were not digitized or automated. Now with AI, the cost of customized development for each industry has decreased, and the cost of organizing scattered knowledge has also decreased, allowing more industries to achieve digital transformation.
I hope that o3 mini and o3 can be available for use soon, and that domestic large models can quickly catch up with the capabilities of o1.