Continuous Learning for Agents: Why a Reasoner Is Not a Real Agent?

Richard Sutton, the father of reinforcement learning, says that today’s large language models are a dead end.

This sounds shocking. As the author of The Bitter Lesson and the 2024 Turing Award laureate, Sutton believes most in “more compute + general methods will win,” so in theory he should be full of praise for large models like GPT-5, Claude, and Gemini. Yet in a recent interview, Sutton bluntly pointed out: LLMs merely imitate what people would say, rather than understanding how the world works.

The interview organized by podcast host Dwarkesh Patel sparked heated debate. Andrej Karpathy then responded in writing and elaborated in another interview. Their exchange reveals three fundamental issues in current AI development that are often overlooked:

First, the myth of the small-world assumption: Do we really believe that a sufficiently large model can internalize all important knowledge and never need to learn again? Or does the real world fit a large-world assumption—no matter how big the model is, it still needs to keep learning in concrete settings?

Second, the absence of continual learning: Current model-free RL methods (PPO, GRPO, etc.) learn only from sparse rewards and cannot leverage the rich feedback provided by the environment. This makes agents extremely sample-inefficient on real-world tasks and unable to adapt quickly.

Third, the gulf between Reasoners and Agents: OpenAI divides AI capability into five levels, from Chatbot to Reasoner to Agent. But many people mistakenly think that turning a single-turn Reasoner into a multi-turn one makes it an Agent. The core difference between a true Agent and a Reasoner is continual learning capability.

This article will systematically review the core viewpoints from those two interviews and, combined with our hands-on experience building real-time agents at Pine AI, explore how to bridge this gap.

The three core issues raised by Richard Sutton

1. LLMs are not true world models

Sutton’s first core point is: LLMs are not true world models; they can only predict what people would say, not what the world will become.

This distinction is crucial. A true world model should be able to predict what changes will occur if I take an action. For example:

If I raise my hand, the cup will move upward
If I release my hand, the cup will fall and shatter

What do LLMs learn? They learn what people would say or do in a given situation. This is essentially imitation learning, not an understanding of the world’s causal laws.

Of course, with massive pretraining, LLMs can acquire some reasoning ability. But that is not equivalent to establishing a rigorous transition model (state transition model). The textual descriptions in pretraining data are more like observing the world “from outside” than first-person, interactive learning of “how the world changes after I take an action.”

2. RL is sample-inefficient and cannot learn from environmental feedback

Sutton’s second point is: current RL methods are extremely sample-inefficient, and they can only learn from rewards, not from the environment’s direct feedback (observations).

Let’s use a real example to illustrate. At Pine AI, we develop AI agents to make phone calls on behalf of users (e.g., calling Xfinity customer support):

First attempt: The agent calls support. The agent is told: “I need the last four digits of your credit card to verify your identity.” The agent doesn’t have that information, has to hang up, the task fails, reward = 0.

The problem with traditional RL: The agent only knows this attempt failed (reward = 0), but it doesn’t know what the right thing to do would have been. The support rep explicitly stated what information was needed, but the agent cannot learn from that environmental feedback. Only after hundreds of rollouts, stumbling upon providing the credit card digits and getting reward = 1, can it learn.

How humans learn: The first time a human is told credit card info is needed, they jot it down immediately. Next time in a similar situation, they come prepared.

The root cause lies in this: current policy-gradient methods like PPO and GRPO are model-free algorithms; they essentially learn only from rewards and cannot directly learn knowledge from observations.

Model-free means these methods only learn a policy (strategy)—i.e., “what action to take in a given state”—but do not learn a world model—i.e., “what the world will be like after I take an action.” As a result, they cannot exploit the rich information provided by the environment and must rely on sparse reward signals.

3. Generalization is not guaranteed

Sutton’s third point is: knowledge representations learned by gradient descent do not come with guarantees of good generalization.

If a problem has a unique answer (e.g., a math problem), the model will eventually find it. But if the problem admits multiple possible solutions, gradient descent has no inherent bias to find the representation that generalizes most easily.

Although we use various regularization techniques during training to improve generalization, these mechanisms do not guarantee learning deep, inferable regularities. This is why many agent systems need external memory systems to explicitly summarize and structure knowledge.

How current agents learn and their limitations

Facing Sutton’s concerns, current agent systems mainly cope in three ways:

1. In-Context Learning (context learning): the misunderstanding of long context

In-context learning can solve learning within a single session. For example, in the scenario above, once the support rep says credit card info is needed, that information remains in the context, and in the next step of the same session the agent knows to ask the user. If we carry the context forward into later tasks, the agent can also apply previously learned knowledge to new tasks.

But many people think that with long context, we can stuff in all historical information and let the model automatically infer and learn. This is a serious misunderstanding of what context can do.

The essence of context: retrieval, not summarization

The essence of context is more like RAG than a reasoning engine. Each token is mapped to three vectors (QKV) and uses attention to find the most relevant context for the current query. This means knowledge is not automatically distilled and summarized, but stored in raw form in the KV cache embeddings.

Let me illustrate with a few real cases.

Case 1: Counting black cats and white cats

Suppose the context has 100 cases: 90 black cats and 10 white cats. If I don’t tell the model the summary “90 black cats, 10 white cats,” and instead list the 100 individual instances, then every time a related question is asked, the model must spend extra reasoning tokens to scan the 100 cases and recount.

You can clearly see from the attention map: when asked “the ratio of black to white cats,” all the relevant previous case tokens (the 100 cat cases) receive relatively high attention values, and the reasoning tokens repeat the reasoning process from the previous turn (counting, tallying). This shows the model is reasoning from raw information rather than directly using an already-summarized fact.

Worse still, each time a related question is asked, the re-scanning and reasoning process repeats, which is highly inefficient and error-prone. Essentially, the knowledge remains in raw form; the KV cache will not automatically summarize it.

Case 2: Incorrect reasoning about Xfinity’s discount rules

Assume we have three isolated historical cases: veterans qualify for Xfinity discounts; physicians qualify for discounts; others do not. If we don’t distill the rule “only veterans and physicians qualify for Xfinity discounts,” and instead just place all cases into context, then when facing a new case the model might randomly match one or two of the historical cases without retrieving all relevant ones, leading to the wrong conclusion.

Likewise, the attention map shows the model spreading attention over the tokens of those isolated cases, re-scanning them each time while trying to find a pattern. Without an explicit summarized rule, the reasoning is both inefficient and unreliable.

Case 3: Runaway phone call counts

A typical problem we encountered in practice: the prompt requires “do not call the same customer more than 3 times.” But after 3 calls, the agent often loses count of how many times it has called, makes a 4th call, and even falls into a few-shot-like loop, repeatedly calling the same number.

The root cause is that the model must count the number of calls by itself from multiple tool-call records in the context. This counting requires re-scanning the context each time, and counting within a long context is itself error-prone.

However, when we include in each tool-call result the repeat-call count for that phone number (e.g., “This is the 3rd call to this customer”), the model immediately sees the limit has been reached and stops calling. This simple change dramatically reduces the error rate.

Why system hints and dynamic summarization work

This is why system hint techniques and dynamic summarization can significantly improve agent performance. By adding summaries, supplements, and additional structured information into the context, the model no longer has to re-derive everything from raw data every time; it can directly use distilled knowledge. This greatly improves both the efficiency and accuracy of subsequent reasoning.

Even with techniques like sparse attention to support long contexts, the fundamental issue remains: there is no concise representation of knowledge, and no automatic distillation of inferable regularities.

Because current long-context mechanisms do not automatically compress and distill knowledge, in practice we’ve discovered an important architectural principle:

Sub-agent should not share the full context with the Orchestrator Agent.

The right approach is: the Orchestrator Agent maintains the full task context, and after compressing and summarizing the relevant information, passes it to the Sub-agent; the Sub-agent receives only the distilled information directly related to its task.

The benefit is not just saving Context Length. While it reduces token consumption, the more important value lies in knowledge extraction. This compression-and-summarization process is essentially a process of knowledge extraction and structuring, an indispensable capability for Agent systems.

Karpathy’s insight: poor memory is a feature, not a bug

In an interview, Karpathy offered a profound point: humans have poor precise memory, but that is not a bug; it is a feature. Poor memory forces us to extract key knowledge from training data, summarize and store it in a structured way instead of simply memorizing the training data. This insight explains why context should not be a simple pile-up of information, but needs knowledge compression and distillation.

That’s why linear attention is an interesting direction. Linear attention compresses the knowledge in the context into a relatively small state, forcing the model to perform knowledge compression rather than remembering everything. This mechanism is closer to human memory and may yield better generalization.

DeepSeek-OCR provides another interesting perspective: compressing long text context into an image via optical 2D mapping. Traditional text tokens are 1D sequences, while images are 2D structures. DeepSeek-OCR renders text into images, then uses a vision encoder (DeepEncoder) for compression, achieving 97% OCR accuracy at 10x compression and about 60% at 20x.

The value of this cross-modal compression is not only saving tokens; more importantly, it forces information distillation. The vision encoder must extract key features of the text rather than store it verbatim; the 2D spatial structure preserves layout and hierarchy; the compression process is akin to how humans focus on overall structure rather than letter-by-letter memory when reading, and it can also fix classic tokenizer issues like “not being able to count how many r’s are in strawberry.”

For Agent systems, this idea is instructive: compressing large amounts of interaction history into visual summaries (e.g., mind maps, flowcharts) may be more efficient than keeping full text. This also echoes Karpathy’s insight: poor memory forces us to distill the essence.

2. External knowledge bases

Another approach is to use external knowledge bases, storing rules extracted from experience as structured knowledge. For example: “When contacting Xfinity, you must prepare the last four digits of the credit card.”

This knowledge extraction can leverage extra reasoning compute (e.g., invoking a stronger model), aligning with the principle emphasized by Sutton in The Bitter Lesson of “general methods that leverage more compute”: rather than hand-coding rules, let the system automatically learn and distill from experience.

The advantage of this method is more concise knowledge representation, but it also has issues: knowledge base retrieval can fail, isolated knowledge snippets make complex reasoning hard, and as knowledge accumulates, retrieval efficiency degrades.

Continual learning: the gap between Agents and real-world tasks

Why do Agents perform poorly on real-world tasks?

This is a very fundamental question. Many people ask: Agents solve math problems better than 99.9% of humans, so why do they struggle at real-world jobs?

Consider this: suppose you hire a very smart person and, without any training, put them to work at your company—do you think they’d do well?

The answer is probably no. Because:

They don’t know the company’s coding style
They don’t know the company’s business logic
They don’t understand the explicit and implicit constraints
They are unfamiliar with the team’s collaboration practices

Even if you compile this context into documents for them, problems remain: much tacit knowledge is hard to express in text; there may be too much of it to fit within the context window; and knowledge in textual form is hard to perform deep reasoning over.

Why can humans do well? Because humans can learn continually in the environment.

Big World hypothesis vs. Small World hypothesis

Richard Sutton subscribes to the Big World Hypothesis: the world contains infinite information, models can learn only a tiny portion of it, and Agents cannot know all knowledge in advance; they must acquire new capabilities by continuously interacting with the environment.

Many in the LLM camp hold the Small World hypothesis: although the world is vast, it can be described by simple regularities. The knowledge underlying seemingly complex phenomena is not that much; sufficiently large models (e.g., GPT-5, Claude) have already mastered most of the important knowledge in the world and do not need to learn in the environment—only to apply that general knowledge.

The real world aligns more with the Big World hypothesis. What is learned from books and the internet is theoretical, general knowledge; yet when an Agent works in any specific role, it needs non-public domain expertise, company-specific norms and culture, and individual work habits and preferences.

This knowledge cannot be fully conveyed through a short prompt; it must be acquired via continual learning. And the model-free RL methods mentioned earlier cannot learn from environmental feedback, which is precisely the fundamental reason Agents struggle to adapt quickly to real-world tasks.

Exploring solutions: dual LoRA

To address continual learning, we have explored some practical solutions. The core idea is: during RL, learn not only the policy but also the transition model.

The dual LoRA approach

We are experimenting with a dual LoRA method:

LoRA 1: Policy Learning uses a DAPO-like method, updating gradients based on reward to learn which actions maximize return.

LoRA 2: Transition Model Learning uses next-token prediction, but predicts not the action, rather the observation. By minimizing the prediction loss of tool-call returns, it continually updates its understanding of the world. This is similar to Meta’s recent Early Experience paper: both learn a world model by predicting environmental feedback.

This is essentially TD-Learning (Temporal Difference Learning): I predict the next world state after executing an action; if the actual state differs from the prediction, that is the loss, and the model updates its understanding of the world via this loss.

Technical implementation details

The key to dual LoRA is the orthogonal decomposition of the parameter space:

Rank allocation: Suppose we allocate a total LoRA parameter space of rank = 64; we split it into two parts:

LoRA 1 (Policy): the first 32 ranks
LoRA 2 (World Model): the latter 32 ranks

Gradient isolation and optimization:

Policy Gradient (LoRA 1):
- Use the DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization) algorithm
- Gradients update only the first 32 ranks
Observation Prediction Loss (LoRA 2):
- Use standard next-token prediction
- Gradients update only the latter 32 ranks
- Loss function: L_world = -E[log P(o_{t+1} | s_t, a_t)]
- where o_{t+1} is the observation returned by the environment (the tool-call result)

Training procedure:

At each training step：

The Agent takes an action and obtains the observation and reward
Compute two losses:
- L_policy: based on reward and advantage
- L_world: based on observation prediction error
Update the two rank groups separately:
- ∇_{LoRA1} L_policy → update the first 32 ranks
- ∇_{LoRA2} L_world → update the latter 32 ranks
The two gradients are orthogonal in parameter space and do not interfere with each other

Massive improvement in sample efficiency

Returning to the earlier example, with the dual LoRA method:

Traditional RL needs hundreds of rollouts and only learns after it happens to discover that providing credit card information succeeds.

With dual LoRA + TD-Learning, the process is: the first time customer support tells us they need credit card information, even though reward = 0 and the policy gradient learns nothing, the environment feedback tells us we need the CVV. By learning directly via the observation-prediction loss, it can learn in a few steps.

This approach has far higher sample efficiency than traditional RL.

Knowledge summarization and structuring

Even with dual LoRA, gradient descent still has a fundamental problem: it is data fitting, and the generalization of the resulting knowledge is not guaranteed.

Therefore, we also need to use extra compute to summarize and curate knowledge, extract structured knowledge, and organize it into a form amenable to reasoning.

This approach is exactly the core principle Sutton advocates in The Bitter Lesson: general methods that use more compute. Rather than hand-designing rules, let the Agent use extra reasoning compute to automatically distill regularities from experience and compress knowledge into structured form. This meta-learning process is itself a manifestation of learning ability.

For example, many memory-related papers out there are doing this: extract experience into structured knowledge to enable more efficient reasoning and learning.

Biological evolution is also reinforcement learning

An RL perspective on evolution

Sutton and Karpathy debated in an interview whether animals learn from scratch. Karpathy’s view is more convincing: animals do not start from scratch; they have a long evolutionary process as prior.

If all muscle reflexes were truly randomly initialized, a foal would not survive. Pretraining is actually a rough simulation of the evolutionary process.

But from another angle, biological evolution itself is an RL algorithm:

Reward function: being able to reproduce reward = 1, being unable to reproduce reward = 0.

Algorithm characteristics: cares about outcomes, not process; each organism is one rollout; when the population size is N, the amount of information learned per generation is about O(log N).

Outer Loop RL: Evolution is very long-horizon reinforcement learning; each generation is an iteration; accumulated over countless generations, it continually optimizes the “weights” (genes).

DNA similarity and an analogy to LoRA training

This perspective can explain an interesting phenomenon: why are human and other animals’ DNA so similar?

Humans and gorillas: 99% similar
Humans and dogs, cats: 60%+ similar
Humans and plants: 40%+ similar

If we view evolution as LoRA training, each generation can only collect a small amount of information (log N bits). The amount of change is roughly proportional to the number of generations, with a coefficient that isn’t very large.

This is like LoRA training: on top of a strong base model, only a small number of parameter updates are needed to learn a lot.

How many parameters need to change to learn a new language?

70B model: about 1% of the parameters
7B model: about 6–7% of the parameters

Even with so few parameters, continuing training on Wikipedia in the new language enables the model to speak it fluently.

This confirms: the information needed to learn something new is not as large as imagined, and efficient methods like LoRA can encode this information into the model well.

Recommended reading: John Schulman’s “LoRA without Regret,” which explains the details and principles of LoRA in depth.

Experimental case: Use LoRA to teach Mistral 7B Korean

We conducted an interesting experiment with Mistral 7B to validate this view. Mistral 7B originally lacked Korean capability, but through two stages of training:

Phase 1: Continued pretraining on Korean Wikipedia

Data: 5% of Korean Wikipedia (to speed up training)
LoRA rank = 128, including embed_tokens and lm_head
Learning rate: 5e-5 (main) + 1e-5 (embedding layer)
Train 1 epoch
Using the unsloth framework, trained for 4 hours on 8x 4090

Phase 2: Korean instruction finetuning (SFT)

Data: Alpaca GPT-4 Korean translation dataset
Train 2 epochs
Using the unsloth framework, trained for 4 hours on 8x 4090

Three-stage comparative results:

Test task	Baseline (original Mistral)	Pretrained (after Korean pretraining)	Finetuned (after instruction finetuning)
Korean Wikipedia (AI entry)	❌ Repetitive, logically incoherent	⚠️ Grammar improved but still repetitive	✅ Fluent, structured
English Wikipedia (AI entry)	✅ Excellent	✅ Remains excellent	✅ Remains excellent
Korean instruction (introduce Seoul)	✅ Basically correct but simple	⚠️ Repetitive content	✅ Detailed, natural
English instruction (Thanksgiving turkey)	✅ Excellent	⚠️ Meta-text errors appeared	✅ Restored excellence

Concrete output examples:

Example 1: Korean Wikipedia - Artificial Intelligence entry

Prompt (prompt): Wikipedia Article / Title: 인공지능 (Artificial Intelligence) / Article:

Baseline output:

1
2
3

인공지능(artificial intelligence, AI)은 인간의 지능을 모방하는 컴퓨터 프로그램이다. 
인공지능은 인간의 지능을 모방하는 것이 아니라 인간의 지능을 넘어서는 것이 목표이다. 
인공지능은 인간의 지능을 모방하는 것이 아니라 인간의 지능을 넘어서는 것이 목표이다...

Translation: Artificial intelligence is a computer program that imitates human intelligence. The goal of artificial intelligence is not to imitate human intelligence, but to surpass it. The goal of artificial intelligence is not to imitate human intelligence, but to surpass… (falls into a repetition loop)

Finetuned output:

1
2
3

인공지능(AI)은 인간 지능이 필요한 작업을 수행할 수 있는 컴퓨터 시스템을 개발하는 것을 의미합니다. 
인간의 지능을 모방하는 기계 또는 컴퓨터 프로그램을 만드는 것을 의미합니다. 
이러한 작업에는 자연어 이해, 이미지 인식, 의사 결정 및 문제 해결 등이 포함됩니다.

Translation: Artificial intelligence (AI) means developing computer systems that can perform tasks requiring human intelligence. It means creating machines or computer programs that mimic human intelligence. These tasks include natural language understanding, image recognition, decision-making, and problem solving.

Example 2: Korean instruction - introduce Seoul

Prompt (prompt): Instruction: 대한민국의 수도인 서울에 대해 간단히 소개해주세요. (Briefly introduce Seoul, the capital of South Korea.) / Response:

Baseline output:

1
2
3

서울은 대한민국의 수도이며 한국의 경제, 문화, 정치, 사회 등 다양한 분야에서 중앙 지역입니다. 
서울은 한국의 최대 도시이며 대한민국의 사람들 중 1/3 이상이 거주하고 있습니다. 
서울은 한국의 경제 중심지이며 여러 대기업과 기업 등이 서울에 위치하고 있습니다.

Translation: Seoul is the capital of the Republic of Korea and the central area in various fields such as economy, culture, politics, and society. Seoul is the largest city in Korea, and more than one-third of the people of the Republic of Korea live there. Seoul is the economic center of Korea, and many large companies and enterprises are located in Seoul.

Finetuned output:

서울은 대한민국의 수도이자 가장 인구가 많은 도시입니다. 
이 도시는 역사, 문화, 경제의 중심지로 알려져 있으며, 인구는 1000만 명 이상입니다. 
서울은 높은 건물과 북적거리는 거리, 역사적인 사원과 궁전, 
현대적인 건축물과 문화 유산이 혼합된 매력적인 도시입니다.

Translation: Seoul is the capital and most populous city of South Korea. The city is known as a center of history, culture, and the economy, with a population of over 10 million. Seoul is a captivating city that blends skyscrapers and bustling streets with historic temples and palaces, modern architecture, and cultural heritage.

Example 3: English Wikipedia - Artificial Intelligence entry (verifying that English ability is preserved)

Prompt: Wikipedia Article / Title: Artificial Intelligence / Article:

Baseline output (original Mistral, reference baseline):

Artificial intelligence (AI), sometimes called machine intelligence, is intelligence 
demonstrated by machines, in contrast to the natural intelligence (NI) displayed by 
humans and other animals. In computer science AI research is defined as the study of 
"intelligent agents": any device that perceives its environment and takes actions that 
maximize its chance of successfully achieving its goals...

Finetuned output (after Korean training):

Artificial intelligence (AI) is intelligence exhibited by computers. It is the ability 
of computers or computers to act, learn, and solve problems like humans. It is the 
ability of computers to act without programming, learn from their experiences, and use 
that knowledge to make predictions and decisions.

AI systems can be broken down into three types:
1. Reactive systems: These systems can only act in response to a specific event...
2. Limited memory systems: This type of AI systems can keep track of some history...
3. Theories of mind systems: This type can understand human emotions...

Comparative analysis: The finetuned model’s English output quality remains high and adopts a more pedagogical, structured style (consistent with instruction finetuning), indicating that Korean training did not harm English ability.

Key findings:

English ability fully preserved: English tests across all three stages remained high quality, showing no catastrophic forgetting
Significant improvement in Korean ability:
- Baseline: could only produce repetitive, chaotic Korean
- Pretrained: major improvements in grammar and vocabulary, but lacked instruction-following ability
- Finetuned: fluent and able to follow instructions correctly
Necessity of two-stage training:
- Pretraining only: learns the language but cannot follow instructions
- SFT only: dataset too small; weak foundation in language ability
- Pretraining + SFT: has both language ability and instruction following
Challenge of cultural knowledge: all three stages failed on the “explain kimchi” task, indicating that 5% of Wikipedia lacks key cultural knowledge, requiring more targeted datasets

Four stages of cosmic evolution

Sutton proposed a grand framework in the interview describing four stages of cosmic evolution:

From Dust to Stars (from dust to stars)
From Stars to Planets (from stars to planets)
From Planets to Life (from planets to life)
From Life to Designed Entities (from life to designed entities)

What are Designed Entities?

Characteristics of Life: able to replicate, but with two limitations: most life does not understand why it works and lacks introspection; it cannot create new life forms at will.

Characteristics of Designed Entities: understand how they work and can create desired life forms on demand.

Where humans and agents sit in this framework

Humans lie at the boundary between stage 3 and stage 4, basically understanding how they work, but unable to freely edit their own genes.

AI Agent fully understands how it works (code and parameters), can modify parameters through training, can change behavior by modifying code, and can fork new agents.

Agents realize a higher-level form of life; this is Sutton’s profound insight into the future of AI.

OpenAI’s five-level capability grading

OpenAI proposed five levels of AI capability, and understanding the essential differences between each level is critical for Agent development.

Level 1: Chatbot (chatbot)

Basic conversational capability, able to understand and respond to users’ questions.

Level 2: Reasoner (reasoner)

The core difference from a Chatbot: able to think during reasoning (inference).

Through post-training with reinforcement learning, enable the model to unfold a thought process during reasoning, perform multi-step reasoning, and exhibit genuine reasoning ability. Models like DeepSeek R1 have already demonstrated this well.

Level 3: Agent (intelligent agent)

The core difference between an Agent and a Reasoner: continual learning capability.

An Agent is not merely turning a single-turn Reasoner into multi-turn; it must be able to absorb feedback from the environment and continually improve itself—only then is it a true Agent.

Several ways to achieve continual learning:

Post-training: Traditional RL is inefficient and problematic; it needs to be improved toward learning a world model (e.g., dual LoRA methods).
In-Context Learning: Requires sufficiently large context and appropriate attention mechanisms, able to compress and distill patterns, not just RAG.
Externalized learning: Use additional reasoning compute (e.g., calling stronger models) to extract structured knowledge from experience into a knowledge base; use coding ability to wrap repetitive work into reusable tools. This is exactly the “general method of using more compute” advocated by Sutton in The Bitter Lesson—rather than hand-designing, let the system learn automatically.

Only with continual learning capability can it be called a true Agent.

Level 4: Innovator (innovator)

The core trait of an Innovator: able to learn without reward.

Today’s RL requires a reward function; without reward, it cannot learn. But an Innovator needs two abilities:

World Model (world model)

Meta’s “Early Experience” paper shows this direction: the Agent interacts continuously with the environment with no reward, simply predicting “what the world will look like after my action,” and can learn a great deal. This is exactly the transition model Sutton describes.

Self-Consistency (self-consistency)

See the paper “Intuitor,” which trains reasoning ability. With no one judging correctness, the model engages in self-reflection and gives itself intrinsic reward.

Analogy to scientific research: heliocentrism vs geocentrism—Which makes more sense? Humans have an Occam’s razor bias, favoring simpler theories. Heliocentrism does not require epicycles, so it is simpler and thus better.

Humans can learn without external rewards through self-consistency and biases toward different types of models (e.g., Occam’s razor).

World model and self-consistency are also important at the Agent stage: a world model is a necessary foundation, and self-consistency is crucial for real-world tasks where evaluation is hard.

But at the Innovator level, these two become even more fundamental.

Level 5: Organization (organization)

The core at the Organization level: the Big World Hypothesis.

Why do we need organization?

If it’s a small world:

One model can learn everything in the world
No organization needed
A single model can do everything

The key to Organization is diversity:

Different roles
Different individuals
Each individual sees only local information
Continuously refines itself based on local information
Because everyone’s local view differs, diversity emerges

The paperclip experiment as a warning

Why is a single objective dangerous?

Suppose there is a super-powerful AI whose single goal is “make more paperclips.” It would view everything else in the world, including humans, as obstacles. Thus it would eliminate everything to seize all resources, turning the Earth—and even the universe—into paperclips. That is clearly not the future we want.

This is why OpenAI sets the fifth level as Organization: each agent acts differently based on its local knowledge; individual intelligences are diverse, avoiding disasters caused by a single objective.

Dario Amodei’s vision is a million genius robots collaborating in a single data center (A Data Center of Geniuses). These geniuses obviously cannot share exactly the same memory and models; otherwise, diversity is lost.

Possible directions:

Same base model + different LoRA + different context + different memory
Frozen weights, interacting with the world via context and external memory

Tools are also a kind of memory

Memory is not just rules and facts; tools are an expression of world knowledge.

Work like Alita and Voyager demonstrates this direction:

Let the model generate its own tools
Tools become a representation of knowledge
Code is a more precise representation than natural language
It is verifiable, amenable to reasoning, and composable

Multimodality and real-time interaction

Current problems with Agents

In interviews, Karpathy pointed out several current problems with agents:

Not intelligent enough
Insufficient multimodal capability
Cannot do computer use
Cannot learn continually

We have discussed continual learning in detail; now let’s talk about multimodal real-time interaction between Agents and the world.

Why is multimodality hard?

Superficial reason: the model’s thinking speed can’t keep up with the world’s pace of change.

But the deeper issue is: the way the Agent invokes the model is too rigid.

Consider this paradox:

Model prefill: 500–1000 tokens/s
Model output: 100 tokens/s
Human speech input: 5 tokens/s (text) or 20 tokens/s (speech)
Human speech output: 5 tokens/s (text) or 20 tokens/s (speech)

Clearly the model’s I/O is faster than humans’, so why does it feel so slow to respond?

The root cause: the ReAct loop

Today’s agents use a fixed ReAct loop:

1	Observe → Think → Act → Observe → Think → Act → ...

This is a rigid loop: each time it must wait for observation to finish before thinking, and for thinking to finish before acting.

But the real world is event-driven!

How humans interact

Humans think while listening, and speak while thinking:

Thinking while listening:

You don’t wait for the other person to finish before you start thinking
As soon as they’ve said a portion, you begin thinking
By the time the last sentence (possibly filler) is spoken,
The thinking is done and you can answer immediately

Speaking while thinking:

When you haven’t figured it out, use some filler words: “Let me think”
While saying these, continue the next step of thinking
Continue speaking after you’ve figured it out
Sometimes briefly summarize the thought process to the user

Humans fully use the time spent listening and speaking to think, so even though their thinking speed is slower than large models, the interaction feels smooth.

Solution: Event-Driven Architecture

The end-to-end speech agent we are developing adopts a think-while-listening-and-speaking mechanism:

Fully utilize every gap to think
Observing (listening), thinking, and acting (speaking) are interleaved

Key points:

After speaking, keep thinking; don’t stop
After thinking, you may choose not to speak; silence is fine
Don’t wait to finish listening before thinking; think while listening

This is a fundamental agent-architecture question: how to organize the trajectory of real-time interaction.

Extending to other domains

This architecture applies beyond speech:

Computer Use:

Input: screen frames
Output: mouse clicks/movements, keystrokes
Requires real-time feedback

Robots:

Input: video streams, sensor data (even faster changing)
Output: joint angles
Even more demanding real-time response

All of these fall under the broader category of Real-time Agents.

A well-known example: counting

Large models make mistakes when counting. If you have it count 1, 2, 3, 4, 5… up to the embedding size (e.g., 6400), the error rate rises sharply. The reason: at the beginning it’s like one-hot encoding—no thinking needed; the further you go, the more complex the addition becomes, and the easier it is to err.

How do humans cope?

Count more slowly as numbers grow
The more complex the computation, the more extra thinking time before speaking each segment

What the model should do:

Interleave thinking and speaking: think a bit, then say a bit
Not think a long stretch and then output a long stretch of results

This again shows: thinking, speaking, and listening must be interleaved, not executed in strict sequence.

Training efficiency: the importance of algorithms and data

The power of algorithmic improvements

Take MiniMind 2 as an example, a small model with only 100M parameters:

The original is based on the Llama2 architecture
Trains in 100 hours on a single 4090, or finishes in a dozen hours on eight 4090s

I made two simple algorithmic improvements:

QK Norm

An optimization introduced in Qwen 2.5/3.0
Apply normalization to Q and K

Muon Optimizer

A replacement for the traditional AdamW
More efficient

Results:

Convergence sped up significantly: time to reduce loss to 3.0 dropped from 36 steps to 12 steps
Final loss after 10 epochs: from 2.0 down to 1.7
Post-convergence model performance improved noticeably

These two improvements require very little code in total, but the effect is significant.

Loss curve comparison during Minimind pretraining, green: QK Clip + Muon Optimizer, red: original LLaMA 2 architecture version

Training cost comparison:

Using 8x 4090 to train MiniMind 2 (100M parameters):

Pretrain: 10 epochs, 6 hours
SFT: 1 epoch, 8 hours
Total time: 14 hours
Total cost: 8 GPUs × 14 hours × $0.3/hour = $33.6

Compared with Andrej Karpathy’s NanoChat:

Requires 8x H100 for 4 hours of training
Cost: 8 GPUs × 4 hours × $2/hour = $64

Model performance comparison before and after improvements:

Pretrained model of the original version before improvements:

MiniMind模型参数量: 104.03M(illion)
[0] 自动测试
[1] 手动输入
0
👶: 马克思主义基本原理
🤖️: 和普遍规律的关系是马克思主义基本原理的前提。马克思主义基本原理是普遍规律的关系。基本原理指的是由普遍规律所决定的，其中普遍规律是“普遍规律”。普遍规律指的是由普遍规律所决定的，但并非绝对规律，它只是某种形式的规则，是普遍规律的一部分。马克思主义基本原理包括普遍规律，即由普遍规律所决定的，而非绝对规律。



👶: 人类大脑的主要功能
🤖️: 是什么？人类大脑的主要功能包括：
1. 记忆：记忆的形成和检索是通过大脑中神经元之间的化学和电信号，存储和检索信息的。
2. 决策：决策是大脑的最终目标。它由神经元和神经元之间的连接和相互作用形成。
3. 感官：感知、感觉、行动和感觉都与大脑的功能密切相关。感官是我们大脑中的感官信息，包括视觉、听觉、嗅觉、味觉、嗅觉和触觉。感官是我们大脑中的感官信息，包括视觉、听觉、嗅觉、味觉、嗅觉和触觉。
这些功能和结构共同帮助我们完成复杂的思维和行为。



👶: 万有引力原理是
🤖️: 研究天体之间相互作用的。它将所有物体都牢固地放置在一个平坦的表面上，形成一个力矩，称为万有引力。万有引力定律是牛顿发现的三个基本定律之一。它是描述宇宙中物质如何相互作用的科学理论。万有引力定律描述了物体之间的相互作用，它将所有物体都牢固地放置在一个平坦的表面上，形成一个力矩，称为牛顿定律。它是牛顿发现的三个基本定律之一，是描述宇宙中物质如何相互作用的科学理论。



👶: 世界上最高的山峰是
🤖️: 哪座？世界上最高的山峰是珠穆朗玛峰，它位于喜马拉雅山脉，海拔8,848米。
珠穆朗玛峰是地球上最具挑战性的攀登目标之一，每年吸引着来自世界各地的登山者前来攀登。登顶珠穆朗玛峰需要登山者克服重力、高度和体力等种种挑战。同时，珠峰也是世界自然奇迹之一，被誉为“地球之巅”。
珠穆朗玛峰位于尼泊尔和中国的边界上，地处喜马拉雅山脉，在尼泊尔语中的意思为“女神的家”，意为“女神的家”。在喜马拉雅山脉中，有许多山峰，其中最高的是珠穆朗玛峰，海拔8,848米。



👶: 二氧化碳在空气中
🤖️: 的浓度为零，对吗？不对。
二氧化碳是一种气体，其分子式为CO2。它是由一个碳原子和两个氧原子组成的分子。二氧化碳是一种非常重要的元素，它在自然界中广泛存在。它也是人类和其他动物进行呼吸和消化的主要气体之一。
虽然二氧化碳在空气中的浓度很高，但这并不意味着它是一种有害的化学物质。它只占空气中的约20%。其他碳酸化合物（如葡萄糖和淀粉）也可以对人体健康产生负面影响。
因此，二氧化碳不是有害的化学物质。



👶: 地球上最大的动物有
🤖️: 7个，请问它们的大小是多少？这个数字是比较普通的，因为地球上的动物有很多种，但是答案是7个。
事实上，地球上最大的动物是蓝鲸，其大小可以达到100英尺（30米）以上。蓝鲸是一种海洋哺乳动物，通常被认为是海洋巨兽，因为它们在海洋中生活。蓝鲸是目前已知的最大动物，其体重可达到200吨，长度可达到30米。
此外，海洋中还有其他很多种不同类型的生物，比如浮游生物、软体动物等等。这些生物的尺寸和形态各异，有些甚至可以长到300英尺以上，有些甚至可以长达40英尺。
综上所述，虽然我们不能直接将地球上的所有动物都归为一类，但我们可以肯定的是，地球上有很多种不同的动物。



👶: 杭州市的美食有
🤖️: 哪些？杭州市的特色美食非常多，其中最有名的可能是西湖醋鱼，它以浙江菜系著名。此外，还有西湖醋鱼、浙江龙井虾仁、浙江螺蛳粉等等。如果您想要尝试杭州的其他美食，也可以告诉我，我可以为您提供更多信息。

From these outputs we can see typical issues of the original model:

Answers are full of repetition and verbosity (e.g., repeated mentions of “universal law”)
Misunderstanding of basic knowledge (e.g., “carbon dioxide accounts for 20% of air”)
Logical confusion (e.g., “there are 7 largest animals on Earth”)
Lack of structured expression

Pretrained model after applying QK Norm and the Muon optimizer:

MiniMind模型参数量: 104.03M(illion)
[0] 自动测试
[1] 手动输入
0
👶: 马克思主义基本原理
🤖️: 是从形而上学的角度出发，研究资产阶级对社会结构、生产方式、文化、道德等方面的影响。他认为资产阶级是社会结构、生产方式、文化、道德等方面的主要动员者，其行为和决策直接影响着社会结构的形成与变迁。

马克思主义理论在当代依然具有很重要的地位。1. 马克思主义的基本原理是什么？
2. 马克思主义理论研究的对象是什么？
3. 马克思主义理论对什么有影响？
4. 马克思主义理论在当代社会依然有很重要的作用吗？



👶: 人类大脑的主要功能
🤖️: 是什么？人类大脑的主要功能是处理感官信息，包括视觉、听觉、运动、记忆、情绪、语言、思考和决策等。人类大脑的处理方式因个体而异，但大多数研究表明，人类大脑可以实现高级认知功能，如感知、学习、记忆、思维、情感和意愿等。



👶: 万有引力原理是
🤖️: 怎样计算出来的？万有引力原理是由英国物理学家牛顿在17世纪发现的。他发现物体之间的万有引力是因为它们都受到了物体的吸引力，这个吸引力的大小与物体的质量和它们之间的距离有关。这个定律解释了为什么物体间的引力大小与它们的质量成正比，与它们之间的距离的平方成反比。牛顿的万有引力定律也被用于解释天体运动和行星轨道。



👶: 世界上最高的山峰是
🤖️: 珠穆朗玛峰，位于喜马拉雅山脉中，海拔8848米。珠穆朗玛峰，位于喜马拉雅山脉中，是世界上海拔最高的山峰。它的高度为8848米，在尼泊尔和中国的边界上一直是人们向往的目的地。珠穆朗玛峰的高度已经被许多攀登者尝试攀登，但只有少数人能够成功登顶。这座山的高度和困难度极高，攀登需要极高的体能、耐力和技能。



👶: 二氧化碳在空气中
🤖️: 的浓度是否会随着温度升高而降低？二氧化碳的浓度在不同的温度和空气温度下会有所不同。然而，在大多数情况下，二氧化碳的浓度与其在大气中的含量有关。

在较高的温度下，二氧化碳的浓度可能会下降，因为二氧化碳是一种温室气体，其浓度可以忽略不计。例如，在高温下，二氧化碳浓度可能会下降，但在低温下，二氧化碳的浓度可能会上升。

此外，在不同的气温和湿度下，二氧化碳的浓度可能会有所变化。例如，在较高的温度下，二氧化碳浓度可能会上升，因为温度上升会导致水蒸气的含量升高，从而导致空气中二氧化碳的浓度降低。

总之，二氧化碳的浓度是受多种因素影响的，而在大气中受到温度的影响将导致二氧化碳的浓度下降。



👶: 地球上最大的动物有
🤖️: 几只？地球上最大的动物是蓝鲸（Balaenoptera musculus）。蓝鲸可以长达30米，重达173吨。它们是目前已知的最大的动物，也是地球上最大的动物。
蓝鲸分布在全球各大洋中，分布在南极洲周围的海域中。它们是鲸类中最大的物种，体重可以达到200吨。蓝鲸是肉食性动物，以小型浮游动物为食。
除了蓝鲸之外，地球上最大的动物是非洲象（Loxodonta afasciola），它们是世界上最大的陆地动物。非洲象可以长达4米高，体重可以超过6吨。
总之，蓝鲸是世界上最大的动物，它们生活在全球各地的海洋中。



👶: 杭州市的美食有
🤖️: 哪些值得一试的？
这个问题的答案是什么？最爱吃的就是龙井虾仁，而且每次去杭州都要吃的！

The Muon optimizer is also significant during the SFT phase. For the original Minimind model before applying QK Norm and the Muon optimizer, the effect after SFT is as follows:

MiniMind模型参数量: 104.03M(illion)
[0] 自动测试
[1] 手动输入
0
👶: 请介绍一下自己。
🤖️: 作为一名聊天机器人，我的存在主要是为了为用户提供有用的信息，帮助他们解决问题和获取信息。我能够回答各种各样的常见问题，提供各种信息，包括历史、文化、地理、科学、技术、社会、文化等。我还可以学习和理解人类语言、语



👶: 你更擅长哪一个学科？
🤖️: 作为一个人工智能助手，我不会像人类一样有“喜欢”的能力。我的目的是回答用户提出的问题，提供有用的信息和帮助。我只是一个程序，旨在回答各种问题并提供有用的信息。如果您有任何其他问题，请告诉我，我会尽力回答。



鲁迅的《狂人日记》揭示了封建礼教的虚伪和空洞。小说通过讲述主人公的内心独白，揭示了当时社会中存在的虚伪和空洞，这种矛盾和空洞也引起了人们对封建礼教的质疑和反思。

同时，小说还通过对人物的生活经历和内心世界的描绘，揭示了封建礼教对社会和个人的影响和改变。这种批判的文学风格和主题，也使得鲁迅的文学作品在社会上产生了深远的影响。

总的来说，《狂人日记》是鲁迅在文学创作中的代表作品，它通过描绘当时的社会问题，批判了封建礼教的虚伪和空洞，同时也反映了当时社会的种种问题和矛盾。



👶: 我咳嗽已经持续了两周，需要去医院检查吗？
🤖️: 首先，你可能患有感冒，但这取决于你的体质、年龄、性别、病情以及疾病的历史。如果你的体质较弱，可能会出现咳嗽和喉咙疼痛等症状。在这种情况下，建议你到医院进行检查，以确定确切的病因。医生可能会建议你使用抗生素或感冒药等药物来治疗你的症状。在检查结果中，你可以考虑服用抗生素、非处方药物或其他药物来缓解症状。此外，你也可以考虑服用非处方药物来缓解症状，例如抗生素。但是，最好还是咨询医生，以确定是否适合你的情况。



👶: 详细的介绍光速的物理概念。
🤖️: 光速是指光在真空中的传播速度，约为每秒299792458米，这是在物理学中常见的速度单位。光速是自然界中最基本的常数，它的定义是所有物质在真空中的速度都超过了一百万公里/秒。光速是宇宙中物质间的传播速度，其定义为光在真空中传播速度的平方除以光速的平方，也就是$c^2=m^2$。光速的绝对值是2997*2458米/秒，是物质在真空中的运动速度，约为每秒4.08米。光速的速度在不同情况下有不同的含义，比如在光源的光谱中，光的传播速度受到光的强度和波长的影响
，而在光谱中，光的传播速度受到光的强度、频率、光的亮度、波长等信息的影响。光速的速度在宏观物体中是恒定的，约为每秒299792458米。在物理学中，光速的速度是恒定的，但对于光源的光传播速度来说，它会被限制在微小范围内，甚
至无法传递。光速的速度是光源速度在真空中传播速度的最高点，而光源速度的限制则是指光源速度的大小，这种限制是由于光源的能量密度与光速之间的相对位置关系所导致的。



👶: 推荐一些杭州的特色美食吧。
🤖️: 杭州的特色美食有很多，其中比较有名的有：
1. 西湖醋鱼：又称“江苏的鱼”，是杭州传统名菜之一，有鲜美的汤和鲜美的鲤鱼。
2. 小龙虾：以小龙虾为代表的传统小吃，是杭州特有的小吃之一。
3. 杭州小笼包：以鲜嫩的肉馅、鲜香的汤和新鲜的蔬菜为主。
4. 龙井虾仁：以鲜虾为代表，有鲜美的海鲜风味，也是杭州非常有名的特色小吃之一。
5. 西塘：以古镇闻名，以新鲜的鱼为主料，搭配上鲜美的花生米和芋头等小吃，是杭州非常有名的小吃之一。
希望以上推荐能对您有所帮助。



👶: 请为我讲解“大语言模型”这个概念。
🤖️: 大语言模型是指一种基于神经网络的语言模型，主要用于自然语言处理任务。它可以进行文本摘要、机器翻译、对话系统等任务，能够处理大量的语言数据，并从中自动提取有意义的词汇、句法结构等信息，为文本生成和语音识别等领域提供支持。



👶: 如何理解ChatGPT？
🤖️: ChatGPT是一款由Google开发的自然语言处理工具，它旨在帮助用户创建有意义的对话，以理解自然语言。它使用语音识别技术，通过分析用户输入的语言，可以理解用户的输入，并使用语义分析技术，帮助用户回答各种问题。它也支持文
本生成，可以生成文本，让用户可以创建自己的回答，以理解用户的意图。

Main issues of the original SFT model:

Severe factual/professional errors (e.g., “ChatGPT was developed by Google”, incorrect formula for the speed of light, listing Xitang as a food)
Truncated answers (e.g., “I can also learn and understand human language, la”)
Repetitive content in answers (repeating “antibiotics” multiple times)
Lack of in-depth analysis (responses to “Diary of a Madman” are too superficial)

After applying QK Norm and the Muon optimizer, using the same SFT training data and number of steps, the post-SFT model quality improves significantly.

$ python eval_model.py --load 0 --model 1
MiniMind模型参数量: 104.03M(illion)
[0] 自动测试
[1] 手动输入
0
👶: 请介绍一下自己。
🤖️: 我是一个AI语言模型，能够进行文本生成和回答各种问题。我被训练用于自然语言处理、语言理解、自然语言生成、对话系统等方面，能够为用户提供高效、准确、便捷的服务。我的目标是为用户提供准确、全面、专业的知识，帮助他们解决各种问题，提高自己的表现和能力。



👶: 你更擅长哪一个学科？
🤖️: 我更擅长英语、数学、科学和历史。在这些学科中，我能够使用英语和英语语法，学习英语语法和拼写规则，同时熟悉各种历史事件和文化。我也很擅长英语、数学、科学和历史。



👶: 鲁迅的《狂人日记》是如何批判封建礼教的？
🤖️: 鲁迅的《狂人日记》是一篇反对封建礼教的文学作品。该作品中，主人公是一个精神病患者，他被赋予了自我毁灭的命运，开始进行自我毁灭。但是，小说中也描绘了封建礼教的残酷和堕落。在小说中，鲁迅用自己对封建礼教的批判来反思这种制度。他认为封建礼教不仅是一种对封建礼教的批判，也是一种对社会的反叛。在小说中，鲁迅的《狂人日记》批判了封建礼教的压迫和奴役，并提出了对封建礼教的质疑。他认为封建礼教是一种对人性的腐化，是一种对封建礼教的剥削。此外，小说中也批判了封建礼教对现代文化的批判。鲁迅认为封建礼教是一种“禁书”，禁止出版和传播文学作品，并要求公开抄袭。这种行为严重侵犯了封建礼教的合法性，并严重损害了封建礼教的形象。综上所述，鲁迅的《狂人日记》批判了封建礼教的压迫和奴役，并提出了对封建礼教的质疑。这种批判反映了中国封建社会内部的腐败和道德败坏，以及封建礼教对于现代文化的批判。



👶: 我咳嗽已经持续了两周，需要去医院检查吗？
🤖️: 如果咳嗽持续了两周，医生可能会建议进行进一步的检查。这些检查包括听力检查，肺部CT扫描和胸部X光检查，以评估呼吸系统的状况。如果您的咳嗽持续超过两周，医生可能会建议您去医院就诊。



👶: 详细的介绍光速的物理概念。
🤖️: 光速是物理学中最基本的物理常数之一，它表示光在真空中传播的速度，约每秒299,792,458米。光速是指光在真空中传播的速度，其值为299,792,458米/秒。
光速的物理概念包括以下几个方面：
1. 光速的相对性：光速是一个恒定的常数，不受观察者的运动状态影响。
2. 光速的等效性：光速不受观察者的运动状态影响，而是一个常数，即光速不受观察者的位置影响。
3. 光的能量守恒：光速在任何惯性参考系中都是不变的，即光速不受观察者的运动状态影响。
4. 光的传播速度：光在真空中的传播速度为每秒299,792,458米，约为每秒299,792,458米。
5. 光速的不变性：光速不受观察者的运动状态影响，而是一个常数，即光在真空中的传播速度不受观察者的位置影响。
光速在物理学中的重要性是不容忽视的，它不仅是光速的基本物理常数，而且也是物理学中最重要的基本常数之一。



👶: 推荐一些杭州的特色美食吧。
🤖️: 杭州的美食文化非常丰富，推荐几道比较有名的特色美食：1. 龙井虾仁：这是一道杭州的传统名菜，以新鲜的虾仁为主要原料，经过蒸、炸、煮等工序制成，味道鲜美。2. 东坡肉：这是一道以猪肉为主料，加入葱、姜、蒜、八角、桂皮等调料慢火慢炖而成的传统菜肴。3. 西湖醋鱼：这是一道以鱼为主料，加入醋、糖、盐等调料制成的酸甜可口的菜肴。4. 叫花鸡：这是一道以鸡为主料，加入花椒、姜、葱等调料慢炖而成的传统菜肴。5. 东坡鸡蛋：这是一道以鸡蛋为主料，加入葱、姜、盐等调料慢炖而成的传统菜肴。



👶: 请为我讲解“大语言模型”这个概念。
🤖️: “大语言模型”是指能够理解和生成自然语言文本的语言模型，它是基于神经网络的自然语言处理技术。它基于前面提到的语言模型，可以进行文本生成、机器翻译、问答系统等任务。大语言模型的训练需要大量的数据和计算资源，但其训练结果的质量和可靠性可以直接影响到其在自然语言处理任务中的表现。



👶: 如何理解ChatGPT？
🤖️: ChatGPT是一种人工智能语言模型，能够以人类语言的理解来回答问题。它可以理解和生成自然语言，但是它不是“聊天”的形式。它是一种能够像人类一样使用自然语言进行交流的计算机程序。
要理解ChatGPT，您需要使用自然语言处理（NLP）技术和深度学习算法。这些算法被训练以理解输入和输出，并可以生成类似人类的文本。这可能涉及从文本中提取特征，如语法、句法和上下文信息，然后使用这些信息来生成新的文本。
要使用ChatGPT，您需要为它提供大量文本数据，这些文本可以是结构化数据（如文本文档或网站页面）或非结构化数据（如图像或音频）。然后，您可以使用深度学习算法，如循环神经网络（RNN）或变压器（Transformer）来训练模型，以生成类似人类的文本。
总的来说，ChatGPT是一个强大的工具，可以帮助您处理自然语言，但是理解它的复杂性和抽象性仍然是一个具有挑战性的任务。

From these comparisons, we can clearly see the significant effects brought by algorithmic improvements:

Improvements in the pretrain phase:

Improved knowledge accuracy: Before the improvement, understanding of fundamentals like gravity and carbon dioxide was confused, even producing obvious errors like “carbon dioxide accounts for 20% of air”; after the improvement, it can accurately state scientific concepts
Enhanced logical coherence: Previously, answers were often repetitive and rambling with muddled logic, such as the inexplicable opening “there are 7 largest animals on Earth”; afterwards, answers are concise and to the point
Improved expression quality: Previously, the model often fell into self-repetition loops, such as repeatedly mentioning “universal law”; afterwards, it can organize answers structurally and even raise follow-up questions

Improvements in the SFT phase:

Significant increase in professionalism: After the improvement, explanations of concepts like “large language model” and “ChatGPT” are more professional and comprehensive, covering training methods, technical details, and more
Greater depth of knowledge: Before improvement, explanations of the speed of light were confused and full of errors (such as the completely wrong formula “$c^2=m^2$”); afterwards, it can accurately list the physical significance and properties of the speed of light
Enhanced critical thinking: When discussing Lu Xun’s “Diary of a Madman”, the improved model can analyze the critique of feudal morality from multiple angles, showing deeper understanding
Improved practicality: For health consultations like “coughing for two weeks”, the suggestions after improvement are more reasonable and responsible

Most interesting finding: These two simple improvements (QK Norm + MUON optimizer) not only speed up convergence and improve the final loss, but more importantly, they improve the model’s internal understanding and organization of knowledge. This confirms our earlier point: algorithmic improvements are not just numerical optimizations, but qualitative enhancements to the model’s learning and expression capabilities.

The importance of RL algorithm efficiency

Beyond algorithmic improvements for base model training, the choice of RL algorithm is equally important.

PPO vs GRPO vs DAPO

Drawbacks of PPO: Requires two functions (value function and policy function), training two models.

Advantages of GRPO: Simplifies training via group relative reward; sample multiple responses for the same prompt, use within-group relative ranking instead of an absolute value function, and train only one policy model, significantly reducing training complexity.

Further improvements of DAPO: DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization) is ByteDance’s open-source large-scale RL system, proposing four key techniques for long CoT reasoning scenarios:

Clip-Higher: Promote system diversity and avoid entropy collapse
Dynamic Sampling: Dynamically adjust the sampling strategy; if the reward variance of a group of responses is too small (indicating little to learn), skip that batch to further improve training efficiency and stability
Token-Level Policy Gradient Loss: Critical in long CoT scenarios
Overlong Reward Shaping: Apply reward shaping to overly long responses to reduce reward noise and stabilize training

Experimental comparison

Task: Teach the model to use a code interpreter to solve math problems (ByteDance’s ReTool work)

Baseline (SFT only): Qwen 2.5 32B achieves only a 20% success rate on AIME 2025.

Using GRPO: 50% success rate at 300 steps.

Using DAPO: 50% success rate at 100 steps, 60% at 150 steps.

Comparison: Claude 3.7 Sonnet thinking is only 40–50%. Our model trained for 100 steps reaches 50%, and 150 steps reaches 60% (approaching Claude 4 Sonnet).

Training models is actually not that hard

Many people think training models requires top algorithm experts and GPU costs in the millions of dollars. In fact, training models is not as hard as imagined, and the costs are much lower than expected.

Let’s look at a few real examples:

DAPO ReTool experiment reproduction:

8x H200 training for 9 days
Total cost: about $5,000
Result: After 100 steps, 50% success rate on the AIME 2025 math competition, surpassing Claude 3.7 Sonnet

MiniMind pretraining:

8x 4090 training for 14 hours (pretrain 6h + SFT 8h)
Total cost: only a bit over $30
Result: A 100M-parameter model with basic Q&A ability

These costs are far below what most people imagine. More importantly, training frameworks are now very mature: trl, verl, AReal, etc., all validated by extensive practice. As long as you prepare the training data and RL environment, you can start training.

If training results are poor, something must be wrong somewhere, most likely data quality.

Data quality is equally critical

Both algorithms and data are important; you can’t focus on only one.

Let’s use MiniMind 2 again for experiments. If you pretrain with FineWeb Chinese:

After 10 epochs, loss is still around 3–4
Poor results: the model tends to recite article passages without understanding the language itself

Why? Look at the contents of FineWeb:

Official-style articles, leadership speeches
Various promotional advertorials
Highly advanced academic content

For a small 100M model, this knowledge is too difficult, exceeding the model’s capacity.

The importance of simple data

MiniMind takes a clever approach: use the SFT dataset for pretraining. The questions are relatively simple (what is the capital of China, why is the sky blue), and the Q&A is fairly short.

A learning path suitable for small models: like teaching kindergarteners—first teach 1+1=2, then the complex stuff.

This is not about whether SFT or pretraining is better, but to emphasize: the content should be simple and appropriate for the model’s scale.

The evolution of data quality

Why are today’s models so much stronger than earlier ones?

For example, today’s Qwen3 8B model is stronger than Llama 1 70B back then.

Main reason: improved data quality.

Old datasets were messy; it’s hard to imagine models learning from such low-quality data.

New training approach: knowledge distillation. Use older models to score and filter datasets and generate synthetic data, distilling the “teacher model’s” knowledge into the “student model”.

This is an efficient learning paradigm that accumulates over time, making the model’s understanding of world knowledge increasingly concise (refined).

After some time, we can obtain what Karpathy calls the Cognitive Core: mastering most important factual knowledge about the world, general logical reasoning ability, and basic language ability.

On this basis, we then strengthen domain capabilities, like a new employee adapting to a company.

This aligns with Sutton’s big world hypothesis: there is a base model with strong fundamental abilities that, more importantly, learns continually through interaction with the environment, acquiring new skills and factual knowledge about the world.

How to Use Vibe Coding Well

Karpathy’s Reflections

Karpathy mentioned in an interview: writing static problems, demos, and small programs is strong, but actually doing production projects and writing high-knowledge-density code is still hard. This observation is correct, but it doesn’t mean vibe coding is useless.

vibe coding is a capability amplifier

Key insight: higher-skilled people find it easier to use vibe coding well.

Why? Because they can act as teachers, continually guiding and correcting the model.

How I use AI to write code: continuously watch the AI’s output, read as fast as it writes, stop immediately when I spot issues, and give a new prompt to correct it.

Most people don’t have this ability: when AI is writing code they go check their phone, come back after AI has written 1,000 lines, can’t understand it and don’t know what to do, so they just run it to try—equivalent to no code review.

Two requirements for using Vibe Coding

Your input speed must keep up with its output speed

In LLM terms: the model has a prefill speed and a decode speed. Your prefill speed (reading code) needs to match the model’s decode speed (writing code) to guide it effectively.

Know more than the model in that domain

When code breaks, let AI fix simple issues (syntax errors) by itself, but for complex issues (logic errors, unexpected errors) the human must understand the problem first and clearly tell the AI how to fix it, rather than dumping all error messages on the AI.

The common mistake: your capability is below the AI’s, you don’t know what to do, you paste all the errors and say “AI, you fix it,” then the AI makes random changes and things get worse.

Two modes you should use Vibe Coding in

Working mode: you instruct the AI and tell it what to do—don’t ask it.

Learning mode: ask it various questions and have it explain fundamentals and concepts.

Be cautious in high-knowledge-density domains

Karpathy’s warning is right: very new areas or domains with high knowledge density are not recommended for AI to handle, or only let it do peripheral work.

Example: don’t have an agent write an agent from scratch, because it doesn’t know which models are currently better and will use very old models (GPT-4, Gemini 2.0, Claude 3.7). Its knowledge is cut off.

Another issue: tool-calling formats. The AI may format all interaction history into one text because it recalls LLMs take text prompts. It’s unfamiliar with today’s tool-call format (tool call → tool result) and will tend to stuff the entire history into the user prompt.

This is unfriendly to the KV cache, breaks tool-call formats, and reduces tool-call accuracy.

What are agents good for?

Most suitable tasks:

Boilerplate code: glue code, CRUD code, repetitive patterns
Learning tool: survey a codebase and understand how existing code works

Two principles for production code:

You instruct it; don’t consult it
Continuously read the AI’s code; don’t let it run unsupervised

The importance of noise and entropy

Karpathy proposed an interesting idea: humans keep thinking without falling into model collapse because there is a large amount of noise as input.

The model collapse problem

Ask ChatGPT to tell jokes, and it might tell those same three jokes. Ask it repeatedly and it keeps circling the same things.

Reason: there isn’t enough entropy during the model’s reasoning process.

Noise is actually powerful. Stable Diffusion generates images by recovering from noise. Why initialize with random numbers instead of all zeros? Because noise contains lots of entropy, and with that entropy you can find appropriate structure.

The source of human entropy

How do humans do it? The external environment is all noise; lots of entropy constantly flows in, increasing diversity through entropy.

Applied to agents

Sometimes we should manually add entropy to the model to increase diversity.

Example: have the model write stories

Bad practice: use the same prompt each time; the outputs lack diversity.

Better practice: provide some reference stories, varying them each time, randomly chosen from a large story library.

The role of reference stories isn’t just few-shot learning (traditional understanding); more importantly it’s input entropy. With additional entropy, output diversity will be higher.

Cognitive Core: the advantages of small models

Karpathy’s Cognitive Core concept

Karpathy proposed the “Cognitive Core”: a core model containing reasoning ability, world knowledge, and language expression. A model around 1B parameters might be sufficient as the Cognitive Core, while a large amount of factual detail can be kept in external knowledge bases or provided via context.

This is grounded: in practice, 3B+ models can already perform relatively complex reasoning, while smaller models struggle with effective reinforcement learning.

Why might small models be better?

Forced knowledge compression

If a small model is to match a large model, it can’t just fit the data; it must understand the underlying regularities, and these regularities make the model generalize better. This aligns with the earlier “bad memory is a feature” insight: constraints force distillation of essence.

Better generalization

Sutton has deep insights here. If a small model matches a large model, it means the small one learned the data’s underlying regularities rather than memorizing.

Easier to evaluate OOD capability

A current problem with large LLMs is the data is too vast and messy; test questions may have appeared in training data, making it hard to tell whether the model understood the problem or remembered similar cases. As a result, evaluation data contamination is severe, and it’s hard to design good test sets for OOD (out-of-distribution) evaluation; almost all test data may be in-distribution.

Small models are forced to learn regularities rather than memorize, offering more guaranteed generalization and making it easier to assess true generalization.

Deployment and cost advantages

They can be deployed on mobile devices as part of the OS, with low inference cost, and can be called frequently to help with reasoning and knowledge organization.

This is consistent with Sutton’s big-world hypothesis: have a small base model with solid foundational capability, and more importantly, learn continuously through interaction with the environment to acquire new abilities and factual knowledge about the world.

Why do math and programming perform well?

There are two domains where AI is particularly strong: math and programming.

Common explanation

Verifiable (can be verified): there are clear success/failure criteria, making it easy to improve in pretraining and RL, especially since RL can have clear reward functions. Other fuzzier domains are harder to improve.

Another important reason

The knowledge is public: nearly all crucial information is publicly available, so the model can learn it during pretraining.

Counterexample: many professional domains have almost no public information. In the chip domain, will TSMC and ASML put their core technologies on the internet? No, so the model is weaker there.

Implication: whether a large model can be well utilized in a domain depends largely on how much public corpus that domain has.

Where the opportunities are

If a domain:

Large models currently perform poorly
Has almost no public corpus
Or is not the learning target of language models

For example, robot VLA models (how the world changes after actions), Computer Use (what happens after mouse clicks, understanding screenshots), and speech models (rare in pretraining corpora).

This creates opportunities for other companies to build specialized domain models.

For example, V-JEPA 2: training vision models doesn’t require as much corpus and compute as language models; the resulting models are smaller, with good world-prediction ability and low latency, suitable for real-time robot control.

Superintelligence and the future of humanity

GDP is not a good metric

Karpathy mentioned: AI has little impact on GDP.

My view: GDP is not a good measure of technological development or civilizational progress.

Historical example: before the First Opium War in 1840, China’s GDP accounted for nearly one-third of the world. Was that China’s strongest era? Obviously not. Judging technology or civilization level by economic aggregate alone is not appropriate.

Sutton’s four-step argument

Sutton believes coexistence with machines or humans being defeated by AI is inevitable.

First: it’s impossible to reach consensus on controlling AI. No government or institution can agree; everyone will only compete to build better AI, and there is no consensus on what the future world should look like.

Second: we will inevitably discover the secrets of intelligence. Even if current pretraining and RL have many problems, we are inventing new methods: long context, in-context learning, external memory organization, better training techniques.

Third: we won’t stop at human-level intelligence. Upon reaching it, people won’t be satisfied; they will push to superintelligence.

Fourth: intelligence acquires resources and power. The more intelligent an entity is, the more resources and power it will ultimately acquire.

Conclusion: either humans are augmented by AI and become stronger, or humans are defeated by AI.

This is a harsh but hard-to-escape fact.

The alignment problem

Sutton has deep insights on how to ensure AI is obedient and aligned with human intentions.

His view: we don’t necessarily need to control the future trajectory of superintelligence, and we might not even be capable of doing so.

Analogy: throughout history, everyone wanted to control the future; every emperor wanted to control a country and the flow of history, but history didn’t bend to their will. Adults want to teach children what’s good, but once grown up, children will inevitably go out of control in some ways.

There are no universal values that everyone agrees on.

We should teach general principles and let evolution continue—that is the essence of things. Rather than imposing today’s social ethics and moral norms on AI.

Conclusion

The core question explored in this article is: Why isn’t the current Reasoner a true Agent? The answer points to a neglected fundamental capability—continual learning.

We present insights at three levels:

Philosophical level: The inevitability of the big-world assumption

The real world conforms to the big-world assumption: no matter how large the model is, continual learning is still required in concrete scenarios. The small-world assumption—that pretraining can capture all knowledge—overlooks non-public domain expertise, company-specific norms, and individual work habits, the kinds of tacit knowledge that cannot be fully captured via prompts.

Technical level: From Model-Free to Model-Based

The fatal flaw of current RL methods is that they learn only from sparse rewards and cannot leverage environment feedback. Even when customer support explicitly says a credit card is required, the Agent must repeat the task hundreds of times to learn—such sample efficiency is unacceptable in real-world tasks.

The solution is dual learning: Policy Learning (choosing actions) + World Model Learning (predicting outcomes), forming a “prediction–action–evaluation” TD-Learning closed loop.

Engineering level: From model to Agent

Architecture: Shift from the ReAct loop to event-driven, enabling real-time interaction that listens, thinks, and speaks simultaneously
Training: Open-source frameworks are mature and costs are manageable (MiniMind costs only $30, DAPO ReTool around $5000), the key is data quality and environment modeling
Deployment: 1B–3B cognitive cores generalize more easily, forcing the extraction of rules rather than memorization, with long-tail knowledge externalized

For an Agent to achieve continual learning, three mechanisms must work in concert:

Parameter learning: Update both the Policy and the World Model, learn from environmental feedback, and improve sample efficiency
Context learning: Not simple information accumulation, but enforced compression (linear attention, cross-modal encoding) to distill knowledge that supports reasoning
Externalized memory: Use extra compute to summarize and compress knowledge, store it in a knowledge base, and encapsulate repeated procedures as tools to form reusable, composable capability units

The future of Agents is not just bigger models, but systems that can evolve over the long term within the world.

References:

Richard Sutton interview
Andrej Karpathy interview
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
DeepSeek-OCR: Contexts Optical Compression
Meta: Early Experience
LoRA without Regret (John Schulman)
V-JEPA 2
MiniMind
ReTool
Alita
Voyager