Rumors about OpenAI o1 started with last year’s Q*, and this year’s Strawberry fueled the fire again. Apart from the name o1, most of the content has already been speculated: using reinforcement learning methods to teach large models more efficient Chain-of-Thought thinking, significantly enhancing the model’s reasoning ability.

I won’t repeat OpenAI’s official test data here. From my experience, the overall effect is very good, and the claims are not exaggerated.

  • It can score over 120 points on the 2024 college entrance exam math paper (out of 150) and completed the test in just 10 minutes.
  • It can solve elementary school math competition problems correctly, thinking of both standard equation methods and “clever solutions” suitable for elementary students.
  • It can solve previously challenging problems for large models, such as determining whether 3.8 or 3.11 is larger, whether Pi or 3.1416 is larger, and how many r’s are in “strawberry.”
  • In programming, it can independently complete the development of a demo project, seemingly stronger in coding ability than the current best, Claude 3.5 Sonnet.
  • An example in the OpenAI o1 System Card shows that when solving a CTF problem, the remote verification environment’s container broke, and o1-preview found a vulnerability in the competition platform, started a new container, and directly read the flag. Although OpenAI intended to highlight AI’s security risks, this also demonstrates o1’s ability to actively interact with the environment to solve problems.

Some say that OpenAI has created such a powerful model that the gap with other companies has widened, making small companies unnecessary. I believe the situation is quite the opposite. For AI companies and academia without the capability to train foundational models themselves, as well as AI Infra companies and AI Agent companies, this is optimistic news.

Why o1 is Beneficial for Small and Medium AI Companies and Academia

OpenAI o1 has released two versions: o1 preview and o1 mini. Although o1 preview scores higher overall on datasets, the score difference mainly comes from the breadth of knowledge, not reasoning ability. In logical reasoning, math, and programming problems that only require narrow domain knowledge, o1 preview and o1 mini score similarly.

Since o1 preview is backed by GPT-4o, and o1 mini is backed by GPT-4o mini, o1 mini is actually more suitable for most tasks because GPT-4o mini outputs faster and costs less. For example, a simple elementary school math competition problem takes o1 mini only 10 seconds, while o1 preview takes 27 seconds. Both models can consistently output the correct answer, so using o1 mini is obviously faster. The input and output token counts are similar for both models, but GPT-4o’s per-token cost is 30 times that of GPT-4o mini, and o1 preview’s per-token cost is 5 times that of o1 mini, making o1 mini more economical.

The fact that most reasoning tasks can be solved with o1 mini is actually a major benefit for small and medium AI companies and academia. Because the reinforcement learning training process may not require as much computational power as pre-training, as long as there is high-quality data and the correct algorithm. There are many open-source models at the GPT-4o mini level on the market, such as LLaMA-3.1 70B. Perhaps in a few months, open-source alternatives to OpenAI o1 mini will be everywhere, just like the open-source alternatives to ChatGPT in early 2023.

AI Infra Has New Demands

Recently, some AI inference Infra companies have been struggling with price wars, unsure of the value of improving inference throughput (the number of tokens output per second after the first token). A chip company similar to Groq also asked me a similar question. Their chip achieved higher single-request throughput than NVIDIA, but they worry that since most models already output tokens faster than users can read, how much commercial value does faster output have?

I said, in applications requiring complex reasoning, the output tokens are not directly for users to read but are part of the thinking process. The faster the output tokens, the lower the latency of the entire reasoning request. Those who haven’t played with Agents can try o1 preview and o1 mini to get an intuitive experience.

AI Infra has always been driven by both model architecture and application scenarios. For example, this year’s three most popular optimizations, MLA (Multi-head Latent Attention), Prefix caching, and Prefill/decode separation, have all brought new demands to AI Infra and reduced the inference cost of models of the same size by an order of magnitude.

The order-of-magnitude reduction in inference costs will have a decisive impact on business models. For example, an overseas to-C app with a $9.9 monthly subscription fee is not very expensive, but in China, it can only charge 20 RMB. For traditional internet applications, server and other operational costs are not expensive, and R&D costs are the main expense. As long as the user base in China is large enough, the R&D costs can be spread out to make a profit. But for AI large model applications, the inference cost of large models is very high, about 10 to 100 times that of traditional internet application server costs. If the subscription fee is too low, more users may mean more losses. But if the inference cost drops by 10 to 100 times, large models can really become as cheap as internet application servers and enter thousands of households.

The Basic Principle of o1: Trading Reasoning Time for Training Time

Some say that o1 is a stopgap measure because the foundational model’s capabilities have not made breakthrough progress, so they had to rely on peripheral Agent systems. I disagree with this view. Last year, I believed that under the Transformer architecture, the computational power each token can carry is limited. Expecting an autoregressive model to correctly answer a question in the first sentence after hearing it is like doing math problems without using scratch paper, which is very unreasonable.

OpenAI o1 actually points out an important direction for improving model reasoning ability: trading reasoning time for training time, using reinforcement learning methods to teach the model slow thinking.

The model’s slow thinking time is also paid for by the user. In OpenAI o1’s pricing, the tokens output by slow thinking are also charged, and they are not cheap. GPT-4o costs $15 per 1M output tokens, while o1 preview costs $60, which is 4 times more; GPT-4o mini costs $0.6 per 1M output tokens, while o1 mini costs $12, which is 20 times more. This is the premium for purchasing reasoning ability.

If OpenAI o1 is so expensive, can I use the GPT-4o API to write an AI Agent myself? It’s not that simple.

Getting the model to write out the thinking process, think before speaking, sounds easy, but when developing AI Agents, you often find that the model doesn’t know which direction to think in, or after hitting a dead end, it doesn’t know how to backtrack to other branches. Last year’s AutoGPT had this problem, as it couldn’t even complete tasks like checking the weather on a webpage well. This is because the model hasn’t learned the methodology of slow thinking.

A few days ago, I saw some leaks about Strawberry and read a few papers on reinforcement learning. Then I remembered that I never completed the Hua Rong Dao puzzle as a child, only moving Cao Cao one square at most. So I tried playing Hua Rong Dao again and found that it actually has a recursive structure. As a child, I didn’t understand recursion and moved pieces randomly, making it hard to solve. The first time I moved Cao Cao out, it took over 200 steps. After rethinking and organizing, the second time only took 101 steps (the optimal solution is 81 steps). As a child, I couldn’t play Hua Rong Dao because I hadn’t learned the methodology of recursion and slow thinking.

I believe that knowledge can be divided into different levels based on its compression ratio. For example, ancient people knew that the sun rises in the east and sets in the west, and heavy objects fall, which are simple rules extracted from natural observations. Newton’s laws of motion, on the other hand, are higher compression ratio knowledge, requiring more computational power to extract than simple rules.

Pre-training and post-training are processes of extracting rules from the corpus into knowledge, while reasoning is the process of using knowledge to make predictions. Early language models, such as LSTM and RNN from years ago, could only extract statistical rules from language, producing coherent sentences without understanding their meaning. Today’s GPT-4 level models can extract knowledge at the level of simple rules like the sun rising in the east and setting in the west, but there is no evidence that they can automatically extract knowledge at the level of Newton’s laws from the corpus. Therefore, solely relying on pre-training may require a vast amount of computational power and data to extract knowledge at the level of modern natural science, enabling the model to reason using modern scientific methods. Some even pessimistically believe this is a limitation of the Transformer model structure.

The cost of expanding pre-training scale is very high, and GPT-4 level pre-training models have already used almost all high-quality corpus. Improving model performance solely through pre-training is becoming difficult. While academia and OpenAI explore new training methods such as synthetic data, they are also turning some attention to reinforcement learning (RL) and the reasoning stage.

Since current models don’t know how to reason, let humans teach them how to reason during the inference stage. The most famous methods are Chain-of-Thought and last year’s Tree-of-Thought. Chain-of-Thought and Tree-of-Thought use few-shot methods to tell the model step-by-step reasoning in the prompt, but for problems like solving Hua Rong Dao that require understanding recursive structures, it’s hard to expect the model to learn this in the prompt.

The Value of Reinforcement Learning is Validated Again

To teach a model that didn’t understand recursion during pre-training to think using recursive methods, reinforcement learning is a good approach.

AlphaGo used reinforcement learning to continuously play against itself, eventually surpassing human capabilities. Some believe that reinforcement learning self-play is only useful for games with clear win-loss outcomes and not for large model training. I think as long as the objective function is defined, such as math and programming problems with easily determined correctness, large models can use similar strategies to AlphaGo for learning.

When Noam Brown joined OpenAI in 2023, he said he wanted to use reinforcement learning to improve GPT-4’s reasoning ability by 1000 times, and I strongly agreed. Just four months later, the mysterious Q* was created. Today, a year later, the o1 model has been officially released.

OpenAI o1’s technical blog states: “Similar to how humans think for a long time before answering a difficult question, o1 uses ‘chain of thought’ when trying to solve problems. Through reinforcement learning, o1 has learned to refine its chain of thought and optimize the strategies it uses. It has learned to identify and correct errors, break down complex steps into simpler ones, and try different methods when the current approach is ineffective. This process has greatly enhanced the model’s reasoning capabilities.

The value of reinforcement learning played a crucial role in the technological leaps of both GPT-3.5 and o1. Although GPT-3 could understand natural language well, its impact was limited to academia because it could only complete and continue text. The biggest innovation of GPT-3.5 compared to GPT-3 is RLHF (Reinforcement Learning from Human Feedback), which essentially uses human-annotated data to teach the large model how to answer questions, making GPT-3.5 a practical product like ChatGPT. o1 also uses human-annotated data and self-play to teach GPT-4o how to think.

At OpenAI, the relationship between pre-training and reinforcement learning is like Intel’s Tick-Tock strategy, where odd years update the chip process, and even years update the processor microarchitecture. The chip process is like pre-training, while the processor microarchitecture is like reinforcement learning. The chip process is governed by Moore’s Law, just as the major version number of GPT relies on pre-training, governed by the Scaling Law; reinforcement learning fully utilizes the capabilities of the current foundational model, ensuring the continuous release of new products even when pre-training progress is slow.

The Value of Model Alignment is Underestimated by Many

Unfortunately, many people do not fully understand the value of reinforcement learning and model alignment. Six months ago, someone talked to me and people from several other large model companies about model alignment. He said that people from those companies believed alignment was just to ensure the model’s output was legal and compliant, with not much else to say. I was shocked. The purpose of model alignment is to make the model’s output align with human preferences and values. Legal compliance is just one aspect of safety; it is also important that the model’s output aligns with human preferences. For example, although GPT-4o mini is very low-cost, it ranks high in the Chatbot Arena because OpenAI aligned its output format and conversation style with human preferences, making it more likable to most users.

To put it extremely, if no model alignment is done at all, the output will completely conform to the data distribution in the corpus, and the most frequently appearing content on the internet is not necessarily the truth. For example, the question of whether 3.8 or 3.11 is larger reflects the impact of data distribution on the model’s reasoning ability. On the internet, 3.8 and 3.11 appear far more frequently as dates or software version numbers than as numbers, so according to the data distribution, 3.8 should be smaller than 3.11. Some large model companies have long known about this phenomenon and have included such special cases in the fine-tuning (SFT) data, fixing the numerical comparison issue.

Another example is when ChatGPT first came out, Huawei’s Meng Zong told us that if you asked who won the Korean War in English, it would say the US won, but if you asked in Chinese, it would say China won. This shows that the large model does not have its own “opinion” but answers based on the training data distribution. Therefore, model alignment is very crucial.

OpenAI o1’s technical blog also has an interesting passage:

Hidden Chain of Thought

We believe that the hidden chain of thought provides a unique opportunity to monitor the model. Assuming the chain of thought is real and readable, it allows us to “peek” into the model’s thought process and understand its ideas. For example, in the future, we might want to monitor the chain of thought to detect signs of the model manipulating users. However, to achieve this, the model must have the freedom to express its thoughts without any constraints. Therefore, we cannot impose any policy compliance or user preference training on the chain of thought. At the same time, we do not want unaligned chains of thought to be directly exposed to users.

Therefore, after weighing various factors such as user experience, competitive advantage, and future monitoring of the chain of thought, we decided not to show users the raw chain of thought. We also acknowledge that this decision has some drawbacks. To partially compensate for this, we strive to train the model to reproduce any valuable ideas from the chain of thought in its answers. For the o1 model series, we show a summary of the chain of thought generated by the model.

Regarding how to make the model’s output legal and compliant, there are two viewpoints: one believes that non-compliant data should be cleaned up at the training data source, while the other believes that the training data can include all viewpoints, but the model should be taught what to say and what not to say during the alignment phase. The latter approach has proven to be more effective. In fact, just adding a system prompt to a strong model like GPT-4o can make its output more compliant than most domestically trained models. This is like a spokesperson from the Ministry of Foreign Affairs who never says the wrong thing; they must first be a smart person who knows everything.

The Spring of AI Agents is Coming

I make a bold prediction that in the next few months, OpenAI will launch an AI Agent based on slow thinking capabilities, extending o1’s slow thinking ability from mathematics and programming to solving general problems, greatly enhancing the reliability of AI Agents in solving complex problems. Before using reinforcement learning and chain of thought, the behavior of large models could be considered intuitive, while o1 uses reinforcement learning to learn a systematic way of thinking, greatly improving reliability.

Although large models have been popular for nearly two years, apart from ChatGPT, there has not been a killer app. The key reason is that AI Agents are not useful enough or cannot reliably solve complex problems. For example, an AI assistant in ERP software can correctly answer “the average salary of a certain department in the past 10 months” with 95% accuracy, but the 5% error rate makes it difficult for large-scale commercial use. If AI Agents can reliably solve complex problems, it means the issue of AI Agents lacking PMF (Product-Market Fit) is likely to be resolved, and AI can truly be implemented in various industries.

More excitingly, since reinforcement learning may not require as much computational resources as pre-training, academia and small to medium-sized companies can also participate in cutting-edge exploration in this field.

Although OpenAI is reluctant to disclose too many technical details, OpenAI is truly a technical pioneer in the entire industry. Less than six months after Sora was released, video generation models have blossomed everywhere; less than six months after GPT-4o was released, real-time voice call products are also ubiquitous. In six months, it is very likely that many companies will have developed AI Agents with strong reasoning capabilities. OpenAI’s value lies in demonstrating the feasibility of this path to everyone. As Liu Cixin wrote in “The Wandering Earth”: “When life realizes the existence of the mysteries of the universe, it is only one step away from finally unraveling these mysteries.”

Comments