Continuous Learning for Agents: Why a Reasoner Is Not a Real Agent?
Richard Sutton, the father of reinforcement learning, says that today’s large language models are a dead end.
This sounds shocking. As the author of The Bitter Lesson and the 2024 Turing Award laureate, Sutton believes most in “more compute + general methods will win,” so in theory he should be full of praise for large models like GPT-5, Claude, and Gemini. Yet in a recent interview, Sutton bluntly pointed out: LLMs merely imitate what people would say, rather than understanding how the world works.
The interview organized by podcast host Dwarkesh Patel sparked heated debate. Andrej Karpathy then responded in writing and elaborated in another interview. Their exchange reveals three fundamental issues in current AI development that are often overlooked:
First, the myth of the small-world assumption: Do we really believe that a sufficiently large model can internalize all important knowledge and never need to learn again? Or does the real world fit a large-world assumption—no matter how big the model is, it still needs to keep learning in concrete settings?
Second, the absence of continual learning: Current model-free RL methods (PPO, GRPO, etc.) learn only from sparse rewards and cannot leverage the rich feedback provided by the environment. This makes agents extremely sample-inefficient on real-world tasks and unable to adapt quickly.
Third, the gulf between Reasoners and Agents: OpenAI divides AI capability into five levels, from Chatbot to Reasoner to Agent. But many people mistakenly think that turning a single-turn Reasoner into a multi-turn one makes it an Agent. The core difference between a true Agent and a Reasoner is continual learning capability.
This article will systematically review the core viewpoints from those two interviews and, combined with our hands-on experience building real-time agents at Pine AI, explore how to bridge this gap.
The three core issues raised by Richard Sutton
1. LLMs are not true world models
Sutton’s first core point is: LLMs are not true world models; they can only predict what people would say, not what the world will become.
This distinction is crucial. A true world model should be able to predict what changes will occur if I take an action. For example:
- If I raise my hand, the cup will move upward
 - If I release my hand, the cup will fall and shatter
 
What do LLMs learn? They learn what people would say or do in a given situation. This is essentially imitation learning, not an understanding of the world’s causal laws.
Of course, with massive pretraining, LLMs can acquire some reasoning ability. But that is not equivalent to establishing a rigorous transition model (state transition model). The textual descriptions in pretraining data are more like observing the world “from outside” than first-person, interactive learning of “how the world changes after I take an action.”
2. RL is sample-inefficient and cannot learn from environmental feedback
Sutton’s second point is: current RL methods are extremely sample-inefficient, and they can only learn from rewards, not from the environment’s direct feedback (observations).
Let’s use a real example to illustrate. At Pine AI, we develop AI agents to make phone calls on behalf of users (e.g., calling Xfinity customer support):
First attempt: The agent calls support. The agent is told: “I need the last four digits of your credit card to verify your identity.” The agent doesn’t have that information, has to hang up, the task fails, reward = 0.
The problem with traditional RL: The agent only knows this attempt failed (reward = 0), but it doesn’t know what the right thing to do would have been. The support rep explicitly stated what information was needed, but the agent cannot learn from that environmental feedback. Only after hundreds of rollouts, stumbling upon providing the credit card digits and getting reward = 1, can it learn.
How humans learn: The first time a human is told credit card info is needed, they jot it down immediately. Next time in a similar situation, they come prepared.
The root cause lies in this: current policy-gradient methods like PPO and GRPO are model-free algorithms; they essentially learn only from rewards and cannot directly learn knowledge from observations.
Model-free means these methods only learn a policy (strategy)—i.e., “what action to take in a given state”—but do not learn a world model—i.e., “what the world will be like after I take an action.” As a result, they cannot exploit the rich information provided by the environment and must rely on sparse reward signals.
3. Generalization is not guaranteed
Sutton’s third point is: knowledge representations learned by gradient descent do not come with guarantees of good generalization.
If a problem has a unique answer (e.g., a math problem), the model will eventually find it. But if the problem admits multiple possible solutions, gradient descent has no inherent bias to find the representation that generalizes most easily.
Although we use various regularization techniques during training to improve generalization, these mechanisms do not guarantee learning deep, inferable regularities. This is why many agent systems need external memory systems to explicitly summarize and structure knowledge.
How current agents learn and their limitations
Facing Sutton’s concerns, current agent systems mainly cope in three ways:
1. In-Context Learning (context learning): the misunderstanding of long context
In-context learning can solve learning within a single session. For example, in the scenario above, once the support rep says credit card info is needed, that information remains in the context, and in the next step of the same session the agent knows to ask the user. If we carry the context forward into later tasks, the agent can also apply previously learned knowledge to new tasks.
But many people think that with long context, we can stuff in all historical information and let the model automatically infer and learn. This is a serious misunderstanding of what context can do.
The essence of context: retrieval, not summarization
The essence of context is more like RAG than a reasoning engine. Each token is mapped to three vectors (QKV) and uses attention to find the most relevant context for the current query. This means knowledge is not automatically distilled and summarized, but stored in raw form in the KV cache embeddings.
Let me illustrate with a few real cases.
Case 1: Counting black cats and white cats
Suppose the context has 100 cases: 90 black cats and 10 white cats. If I don’t tell the model the summary “90 black cats, 10 white cats,” and instead list the 100 individual instances, then every time a related question is asked, the model must spend extra reasoning tokens to scan the 100 cases and recount.
You can clearly see from the attention map: when asked “the ratio of black to white cats,” all the relevant previous case tokens (the 100 cat cases) receive relatively high attention values, and the reasoning tokens repeat the reasoning process from the previous turn (counting, tallying). This shows the model is reasoning from raw information rather than directly using an already-summarized fact.
Worse still, each time a related question is asked, the re-scanning and reasoning process repeats, which is highly inefficient and error-prone. Essentially, the knowledge remains in raw form; the KV cache will not automatically summarize it.
Case 2: Incorrect reasoning about Xfinity’s discount rules
Assume we have three isolated historical cases: veterans qualify for Xfinity discounts; physicians qualify for discounts; others do not. If we don’t distill the rule “only veterans and physicians qualify for Xfinity discounts,” and instead just place all cases into context, then when facing a new case the model might randomly match one or two of the historical cases without retrieving all relevant ones, leading to the wrong conclusion.
Likewise, the attention map shows the model spreading attention over the tokens of those isolated cases, re-scanning them each time while trying to find a pattern. Without an explicit summarized rule, the reasoning is both inefficient and unreliable.
Case 3: Runaway phone call counts
A typical problem we encountered in practice: the prompt requires “do not call the same customer more than 3 times.” But after 3 calls, the agent often loses count of how many times it has called, makes a 4th call, and even falls into a few-shot-like loop, repeatedly calling the same number.
The root cause is that the model must count the number of calls by itself from multiple tool-call records in the context. This counting requires re-scanning the context each time, and counting within a long context is itself error-prone.
However, when we include in each tool-call result the repeat-call count for that phone number (e.g., “This is the 3rd call to this customer”), the model immediately sees the limit has been reached and stops calling. This simple change dramatically reduces the error rate.
Why system hints and dynamic summarization work
This is why system hint techniques and dynamic summarization can significantly improve agent performance. By adding summaries, supplements, and additional structured information into the context, the model no longer has to re-derive everything from raw data every time; it can directly use distilled knowledge. This greatly improves both the efficiency and accuracy of subsequent reasoning.
Even with techniques like sparse attention to support long contexts, the fundamental issue remains: there is no concise representation of knowledge, and no automatic distillation of inferable regularities.
Because current long-context mechanisms do not automatically compress and distill knowledge, in practice we’ve discovered an important architectural principle:
Sub-agent should not share the full context with the Orchestrator Agent.
The right approach is: the Orchestrator Agent maintains the full task context, and after compressing and summarizing the relevant information, passes it to the Sub-agent; the Sub-agent receives only the distilled information directly related to its task.
The benefit is not just saving Context Length. While it reduces token consumption, the more important value lies in knowledge extraction. This compression-and-summarization process is essentially a process of knowledge extraction and structuring, an indispensable capability for Agent systems.
Karpathy’s insight: poor memory is a feature, not a bug
In an interview, Karpathy offered a profound point: humans have poor precise memory, but that is not a bug; it is a feature. Poor memory forces us to extract key knowledge from training data, summarize and store it in a structured way instead of simply memorizing the training data. This insight explains why context should not be a simple pile-up of information, but needs knowledge compression and distillation.
That’s why linear attention is an interesting direction. Linear attention compresses the knowledge in the context into a relatively small state, forcing the model to perform knowledge compression rather than remembering everything. This mechanism is closer to human memory and may yield better generalization.
Cross-modal compression: lessons from DeepSeek-OCR
DeepSeek-OCR provides another interesting perspective: compressing long text context into an image via optical 2D mapping. Traditional text tokens are 1D sequences, while images are 2D structures. DeepSeek-OCR renders text into images, then uses a vision encoder (DeepEncoder) for compression, achieving 97% OCR accuracy at 10x compression and about 60% at 20x.
The value of this cross-modal compression is not only saving tokens; more importantly, it forces information distillation. The vision encoder must extract key features of the text rather than store it verbatim; the 2D spatial structure preserves layout and hierarchy; the compression process is akin to how humans focus on overall structure rather than letter-by-letter memory when reading, and it can also fix classic tokenizer issues like “not being able to count how many r’s are in strawberry.”
For Agent systems, this idea is instructive: compressing large amounts of interaction history into visual summaries (e.g., mind maps, flowcharts) may be more efficient than keeping full text. This also echoes Karpathy’s insight: poor memory forces us to distill the essence.
2. External knowledge bases
Another approach is to use external knowledge bases, storing rules extracted from experience as structured knowledge. For example: “When contacting Xfinity, you must prepare the last four digits of the credit card.”
This knowledge extraction can leverage extra reasoning compute (e.g., invoking a stronger model), aligning with the principle emphasized by Sutton in The Bitter Lesson of “general methods that leverage more compute”: rather than hand-coding rules, let the system automatically learn and distill from experience.
The advantage of this method is more concise knowledge representation, but it also has issues: knowledge base retrieval can fail, isolated knowledge snippets make complex reasoning hard, and as knowledge accumulates, retrieval efficiency degrades.
Continual learning: the gap between Agents and real-world tasks
Why do Agents perform poorly on real-world tasks?
This is a very fundamental question. Many people ask: Agents solve math problems better than 99.9% of humans, so why do they struggle at real-world jobs?
Consider this: suppose you hire a very smart person and, without any training, put them to work at your company—do you think they’d do well?
The answer is probably no. Because:
- They don’t know the company’s coding style
 - They don’t know the company’s business logic
 - They don’t understand the explicit and implicit constraints
 - They are unfamiliar with the team’s collaboration practices
 
Even if you compile this context into documents for them, problems remain: much tacit knowledge is hard to express in text; there may be too much of it to fit within the context window; and knowledge in textual form is hard to perform deep reasoning over.
Why can humans do well? Because humans can learn continually in the environment.
Big World hypothesis vs. Small World hypothesis
Richard Sutton subscribes to the Big World Hypothesis: the world contains infinite information, models can learn only a tiny portion of it, and Agents cannot know all knowledge in advance; they must acquire new capabilities by continuously interacting with the environment.
Many in the LLM camp hold the Small World hypothesis: although the world is vast, it can be described by simple regularities. The knowledge underlying seemingly complex phenomena is not that much; sufficiently large models (e.g., GPT-5, Claude) have already mastered most of the important knowledge in the world and do not need to learn in the environment—only to apply that general knowledge.
The real world aligns more with the Big World hypothesis. What is learned from books and the internet is theoretical, general knowledge; yet when an Agent works in any specific role, it needs non-public domain expertise, company-specific norms and culture, and individual work habits and preferences.
This knowledge cannot be fully conveyed through a short prompt; it must be acquired via continual learning. And the model-free RL methods mentioned earlier cannot learn from environmental feedback, which is precisely the fundamental reason Agents struggle to adapt quickly to real-world tasks.
Exploring solutions: dual LoRA
To address continual learning, we have explored some practical solutions. The core idea is: during RL, learn not only the policy but also the transition model.
The dual LoRA approach
We are experimenting with a dual LoRA method:
LoRA 1: Policy Learning uses a DAPO-like method, updating gradients based on reward to learn which actions maximize return.
LoRA 2: Transition Model Learning uses next-token prediction, but predicts not the action, rather the observation. By minimizing the prediction loss of tool-call returns, it continually updates its understanding of the world. This is similar to Meta’s recent Early Experience paper: both learn a world model by predicting environmental feedback.
This is essentially TD-Learning (Temporal Difference Learning): I predict the next world state after executing an action; if the actual state differs from the prediction, that is the loss, and the model updates its understanding of the world via this loss.
Technical implementation details
The key to dual LoRA is the orthogonal decomposition of the parameter space:
Rank allocation: Suppose we allocate a total LoRA parameter space of rank = 64; we split it into two parts:
- LoRA 1 (Policy): the first 32 ranks
 - LoRA 2 (World Model): the latter 32 ranks
 
Gradient isolation and optimization:
Policy Gradient (LoRA 1):
- Use the DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization) algorithm
 - Gradients update only the first 32 ranks
 
Observation Prediction Loss (LoRA 2):
- Use standard next-token prediction
 - Gradients update only the latter 32 ranks
 - Loss function: 
L_world = -E[log P(o_{t+1} | s_t, a_t)] - where o_{t+1} is the observation returned by the environment (the tool-call result)
 
Training procedure:
At each training step:
- The Agent takes an action and obtains the observation and reward
 - Compute two losses:
L_policy: based on reward and advantageL_world: based on observation prediction error
 - Update the two rank groups separately:
∇_{LoRA1} L_policy→ update the first 32 ranks∇_{LoRA2} L_world→ update the latter 32 ranks
 - The two gradients are orthogonal in parameter space and do not interfere with each other
 
Massive improvement in sample efficiency
Returning to the earlier example, with the dual LoRA method:
Traditional RL needs hundreds of rollouts and only learns after it happens to discover that providing credit card information succeeds.
With dual LoRA + TD-Learning, the process is: the first time customer support tells us they need credit card information, even though reward = 0 and the policy gradient learns nothing, the environment feedback tells us we need the CVV. By learning directly via the observation-prediction loss, it can learn in a few steps.
This approach has far higher sample efficiency than traditional RL.
Knowledge summarization and structuring
Even with dual LoRA, gradient descent still has a fundamental problem: it is data fitting, and the generalization of the resulting knowledge is not guaranteed.
Therefore, we also need to use extra compute to summarize and curate knowledge, extract structured knowledge, and organize it into a form amenable to reasoning.
This approach is exactly the core principle Sutton advocates in The Bitter Lesson: general methods that use more compute. Rather than hand-designing rules, let the Agent use extra reasoning compute to automatically distill regularities from experience and compress knowledge into structured form. This meta-learning process is itself a manifestation of learning ability.
For example, many memory-related papers out there are doing this: extract experience into structured knowledge to enable more efficient reasoning and learning.
Biological evolution is also reinforcement learning
An RL perspective on evolution
Sutton and Karpathy debated in an interview whether animals learn from scratch. Karpathy’s view is more convincing: animals do not start from scratch; they have a long evolutionary process as prior.
If all muscle reflexes were truly randomly initialized, a foal would not survive. Pretraining is actually a rough simulation of the evolutionary process.
But from another angle, biological evolution itself is an RL algorithm:
Reward function: being able to reproduce reward = 1, being unable to reproduce reward = 0.
Algorithm characteristics: cares about outcomes, not process; each organism is one rollout; when the population size is N, the amount of information learned per generation is about O(log N).
Outer Loop RL: Evolution is very long-horizon reinforcement learning; each generation is an iteration; accumulated over countless generations, it continually optimizes the “weights” (genes).
DNA similarity and an analogy to LoRA training
This perspective can explain an interesting phenomenon: why are human and other animals’ DNA so similar?
- Humans and gorillas: 99% similar
 - Humans and dogs, cats: 60%+ similar
 - Humans and plants: 40%+ similar
 
If we view evolution as LoRA training, each generation can only collect a small amount of information (log N bits). The amount of change is roughly proportional to the number of generations, with a coefficient that isn’t very large.
This is like LoRA training: on top of a strong base model, only a small number of parameter updates are needed to learn a lot.
How many parameters need to change to learn a new language?
- 70B model: about 1% of the parameters
 - 7B model: about 6–7% of the parameters
 
Even with so few parameters, continuing training on Wikipedia in the new language enables the model to speak it fluently.
This confirms: the information needed to learn something new is not as large as imagined, and efficient methods like LoRA can encode this information into the model well.
Recommended reading: John Schulman’s “LoRA without Regret,” which explains the details and principles of LoRA in depth.
Experimental case: Use LoRA to teach Mistral 7B Korean
We conducted an interesting experiment with Mistral 7B to validate this view. Mistral 7B originally lacked Korean capability, but through two stages of training:
Phase 1: Continued pretraining on Korean Wikipedia
- Data: 5% of Korean Wikipedia (to speed up training)
 - LoRA rank = 128, including embed_tokens and lm_head
 - Learning rate: 5e-5 (main) + 1e-5 (embedding layer)
 - Train 1 epoch
 - Using the unsloth framework, trained for 4 hours on 8x 4090
 
Phase 2: Korean instruction finetuning (SFT)
- Data: Alpaca GPT-4 Korean translation dataset
 - Train 2 epochs
 - Using the unsloth framework, trained for 4 hours on 8x 4090
 
Three-stage comparative results:
| Test task | Baseline (original Mistral) | Pretrained (after Korean pretraining) | Finetuned (after instruction finetuning) | 
|---|---|---|---|
| Korean Wikipedia (AI entry) | ❌ Repetitive, logically incoherent | ⚠️ Grammar improved but still repetitive | ✅ Fluent, structured | 
| English Wikipedia (AI entry) | ✅ Excellent | ✅ Remains excellent | ✅ Remains excellent | 
| Korean instruction (introduce Seoul) | ✅ Basically correct but simple | ⚠️ Repetitive content | ✅ Detailed, natural | 
| English instruction (Thanksgiving turkey) | ✅ Excellent | ⚠️ Meta-text errors appeared | ✅ Restored excellence | 
Concrete output examples:
Example 1: Korean Wikipedia - Artificial Intelligence entry
Prompt (prompt): Wikipedia Article / Title: 인공지능 (Artificial Intelligence) / Article:
Baseline output:
1  | 인공지능(artificial intelligence, AI)은 인간의 지능을 모방하는 컴퓨터 프로그램이다.  | 
Translation: Artificial intelligence is a computer program that imitates human intelligence. The goal of artificial intelligence is not to imitate human intelligence, but to surpass it. The goal of artificial intelligence is not to imitate human intelligence, but to surpass… (falls into a repetition loop)
Finetuned output:
1  | 인공지능(AI)은 인간 지능이 필요한 작업을 수행할 수 있는 컴퓨터 시스템을 개발하는 것을 의미합니다.  | 
Translation: Artificial intelligence (AI) means developing computer systems that can perform tasks requiring human intelligence. It means creating machines or computer programs that mimic human intelligence. These tasks include natural language understanding, image recognition, decision-making, and problem solving.
Example 2: Korean instruction - introduce Seoul
Prompt (prompt): Instruction: 대한민국의 수도인 서울에 대해 간단히 소개해주세요. (Briefly introduce Seoul, the capital of South Korea.) / Response:
Baseline output:
1  | 서울은 대한민국의 수도이며 한국의 경제, 문화, 정치, 사회 등 다양한 분야에서 중앙 지역입니다.  | 
Translation: Seoul is the capital of the Republic of Korea and the central area in various fields such as economy, culture, politics, and society. Seoul is the largest city in Korea, and more than one-third of the people of the Republic of Korea live there. Seoul is the economic center of Korea, and many large companies and enterprises are located in Seoul.
Finetuned output:
1  | 서울은 대한민국의 수도이자 가장 인구가 많은 도시입니다.  | 
Translation: Seoul is the capital and most populous city of South Korea. The city is known as a center of history, culture, and the economy, with a population of over 10 million. Seoul is a captivating city that blends skyscrapers and bustling streets with historic temples and palaces, modern architecture, and cultural heritage.
Example 3: English Wikipedia - Artificial Intelligence entry (verifying that English ability is preserved)
Prompt: Wikipedia Article / Title: Artificial Intelligence / Article:
Baseline output (original Mistral, reference baseline):
1  | Artificial intelligence (AI), sometimes called machine intelligence, is intelligence  | 
Finetuned output (after Korean training):
1  | Artificial intelligence (AI) is intelligence exhibited by computers. It is the ability  | 
Comparative analysis: The finetuned model’s English output quality remains high and adopts a more pedagogical, structured style (consistent with instruction finetuning), indicating that Korean training did not harm English ability.
Key findings:
English ability fully preserved: English tests across all three stages remained high quality, showing no catastrophic forgetting
Significant improvement in Korean ability:
- Baseline: could only produce repetitive, chaotic Korean
 - Pretrained: major improvements in grammar and vocabulary, but lacked instruction-following ability
 - Finetuned: fluent and able to follow instructions correctly
 
Necessity of two-stage training:
- Pretraining only: learns the language but cannot follow instructions
 - SFT only: dataset too small; weak foundation in language ability
 - Pretraining + SFT: has both language ability and instruction following
 
Challenge of cultural knowledge: all three stages failed on the “explain kimchi” task, indicating that 5% of Wikipedia lacks key cultural knowledge, requiring more targeted datasets
Four stages of cosmic evolution
Sutton proposed a grand framework in the interview describing four stages of cosmic evolution:
- From Dust to Stars (from dust to stars)
 - From Stars to Planets (from stars to planets)
 - From Planets to Life (from planets to life)
 - From Life to Designed Entities (from life to designed entities)
 
What are Designed Entities?
Characteristics of Life: able to replicate, but with two limitations: most life does not understand why it works and lacks introspection; it cannot create new life forms at will.
Characteristics of Designed Entities: understand how they work and can create desired life forms on demand.
Where humans and agents sit in this framework
Humans lie at the boundary between stage 3 and stage 4, basically understanding how they work, but unable to freely edit their own genes.
AI Agent fully understands how it works (code and parameters), can modify parameters through training, can change behavior by modifying code, and can fork new agents.
Agents realize a higher-level form of life; this is Sutton’s profound insight into the future of AI.
OpenAI’s five-level capability grading
OpenAI proposed five levels of AI capability, and understanding the essential differences between each level is critical for Agent development.
Level 1: Chatbot (chatbot)
Basic conversational capability, able to understand and respond to users’ questions.
Level 2: Reasoner (reasoner)
The core difference from a Chatbot: able to think during reasoning (inference).
Through post-training with reinforcement learning, enable the model to unfold a thought process during reasoning, perform multi-step reasoning, and exhibit genuine reasoning ability. Models like DeepSeek R1 have already demonstrated this well.
Level 3: Agent (intelligent agent)
The core difference between an Agent and a Reasoner: continual learning capability.
An Agent is not merely turning a single-turn Reasoner into multi-turn; it must be able to absorb feedback from the environment and continually improve itself—only then is it a true Agent.
Several ways to achieve continual learning:
Post-training: Traditional RL is inefficient and problematic; it needs to be improved toward learning a world model (e.g., dual LoRA methods).
In-Context Learning: Requires sufficiently large context and appropriate attention mechanisms, able to compress and distill patterns, not just RAG.
Externalized learning: Use additional reasoning compute (e.g., calling stronger models) to extract structured knowledge from experience into a knowledge base; use coding ability to wrap repetitive work into reusable tools. This is exactly the “general method of using more compute” advocated by Sutton in The Bitter Lesson—rather than hand-designing, let the system learn automatically.
Only with continual learning capability can it be called a true Agent.
Level 4: Innovator (innovator)
The core trait of an Innovator: able to learn without reward.
Today’s RL requires a reward function; without reward, it cannot learn. But an Innovator needs two abilities:
- World Model (world model)
 
Meta’s “Early Experience” paper shows this direction: the Agent interacts continuously with the environment with no reward, simply predicting “what the world will look like after my action,” and can learn a great deal. This is exactly the transition model Sutton describes.
- Self-Consistency (self-consistency)
 
See the paper “Intuitor,” which trains reasoning ability. With no one judging correctness, the model engages in self-reflection and gives itself intrinsic reward.
Analogy to scientific research: heliocentrism vs geocentrism—Which makes more sense? Humans have an Occam’s razor bias, favoring simpler theories. Heliocentrism does not require epicycles, so it is simpler and thus better.
Humans can learn without external rewards through self-consistency and biases toward different types of models (e.g., Occam’s razor).
World model and self-consistency are also important at the Agent stage: a world model is a necessary foundation, and self-consistency is crucial for real-world tasks where evaluation is hard.
But at the Innovator level, these two become even more fundamental.
Level 5: Organization (organization)
The core at the Organization level: the Big World Hypothesis.
Why do we need organization?
If it’s a small world:
- One model can learn everything in the world
 - No organization needed
 - A single model can do everything
 
The key to Organization is diversity:
- Different roles
 - Different individuals
 - Each individual sees only local information
 - Continuously refines itself based on local information
 - Because everyone’s local view differs, diversity emerges
 
The paperclip experiment as a warning
Why is a single objective dangerous?
Suppose there is a super-powerful AI whose single goal is “make more paperclips.” It would view everything else in the world, including humans, as obstacles. Thus it would eliminate everything to seize all resources, turning the Earth—and even the universe—into paperclips. That is clearly not the future we want.
This is why OpenAI sets the fifth level as Organization: each agent acts differently based on its local knowledge; individual intelligences are diverse, avoiding disasters caused by a single objective.
Dario Amodei’s vision is a million genius robots collaborating in a single data center (A Data Center of Geniuses). These geniuses obviously cannot share exactly the same memory and models; otherwise, diversity is lost.
Possible directions:
- Same base model + different LoRA + different context + different memory
 - Frozen weights, interacting with the world via context and external memory
 
Tools are also a kind of memory
Memory is not just rules and facts; tools are an expression of world knowledge.
Work like Alita and Voyager demonstrates this direction:
- Let the model generate its own tools
 - Tools become a representation of knowledge
 - Code is a more precise representation than natural language
 - It is verifiable, amenable to reasoning, and composable
 
Multimodality and real-time interaction
Current problems with Agents
In interviews, Karpathy pointed out several current problems with agents:
- Not intelligent enough
 - Insufficient multimodal capability
 - Cannot do computer use
 - Cannot learn continually
 
We have discussed continual learning in detail; now let’s talk about multimodal real-time interaction between Agents and the world.
Why is multimodality hard?
Superficial reason: the model’s thinking speed can’t keep up with the world’s pace of change.
But the deeper issue is: the way the Agent invokes the model is too rigid.
Consider this paradox:
- Model prefill: 500–1000 tokens/s
 - Model output: 100 tokens/s
 - Human speech input: 5 tokens/s (text) or 20 tokens/s (speech)
 - Human speech output: 5 tokens/s (text) or 20 tokens/s (speech)
 
Clearly the model’s I/O is faster than humans’, so why does it feel so slow to respond?
The root cause: the ReAct loop
Today’s agents use a fixed ReAct loop:
1  | Observe → Think → Act → Observe → Think → Act → ...  | 
This is a rigid loop: each time it must wait for observation to finish before thinking, and for thinking to finish before acting.
But the real world is event-driven!
How humans interact
Humans think while listening, and speak while thinking:
Thinking while listening:
- You don’t wait for the other person to finish before you start thinking
 - As soon as they’ve said a portion, you begin thinking
 - By the time the last sentence (possibly filler) is spoken,
 - The thinking is done and you can answer immediately
 
Speaking while thinking:
- When you haven’t figured it out, use some filler words: “Let me think”
 - While saying these, continue the next step of thinking
 - Continue speaking after you’ve figured it out
 - Sometimes briefly summarize the thought process to the user
 
Humans fully use the time spent listening and speaking to think, so even though their thinking speed is slower than large models, the interaction feels smooth.
Solution: Event-Driven Architecture
The end-to-end speech agent we are developing adopts a think-while-listening-and-speaking mechanism:
- Fully utilize every gap to think
 - Observing (listening), thinking, and acting (speaking) are interleaved
 
Key points:
- After speaking, keep thinking; don’t stop
 - After thinking, you may choose not to speak; silence is fine
 - Don’t wait to finish listening before thinking; think while listening
 
This is a fundamental agent-architecture question: how to organize the trajectory of real-time interaction.
Extending to other domains
This architecture applies beyond speech:
Computer Use:
- Input: screen frames
 - Output: mouse clicks/movements, keystrokes
 - Requires real-time feedback
 
Robots:
- Input: video streams, sensor data (even faster changing)
 - Output: joint angles
 - Even more demanding real-time response
 
All of these fall under the broader category of Real-time Agents.
A well-known example: counting
Large models make mistakes when counting. If you have it count 1, 2, 3, 4, 5… up to the embedding size (e.g., 6400), the error rate rises sharply. The reason: at the beginning it’s like one-hot encoding—no thinking needed; the further you go, the more complex the addition becomes, and the easier it is to err.
How do humans cope?
- Count more slowly as numbers grow
 - The more complex the computation, the more extra thinking time before speaking each segment
 
What the model should do:
- Interleave thinking and speaking: think a bit, then say a bit
 - Not think a long stretch and then output a long stretch of results
 
This again shows: thinking, speaking, and listening must be interleaved, not executed in strict sequence.
Training efficiency: the importance of algorithms and data
The power of algorithmic improvements
Take MiniMind 2 as an example, a small model with only 100M parameters:
- The original is based on the Llama2 architecture
 - Trains in 100 hours on a single 4090, or finishes in a dozen hours on eight 4090s
 
I made two simple algorithmic improvements:
- QK Norm
 
- An optimization introduced in Qwen 2.5/3.0
 - Apply normalization to Q and K
 
- Muon Optimizer
 
- A replacement for the traditional AdamW
 - More efficient
 
Results:
- Convergence sped up significantly: time to reduce loss to 3.0 dropped from 36 steps to 12 steps
 - Final loss after 10 epochs: from 2.0 down to 1.7
 - Post-convergence model performance improved noticeably
 
These two improvements require very little code in total, but the effect is significant.
Loss curve comparison during Minimind pretraining, green: QK Clip + Muon Optimizer, red: original LLaMA 2 architecture version
Training cost comparison:
Using 8x 4090 to train MiniMind 2 (100M parameters):
- Pretrain: 10 epochs, 6 hours
 - SFT: 1 epoch, 8 hours
 - Total time: 14 hours
 - Total cost: 8 GPUs × 14 hours × $0.3/hour = $33.6
 
Compared with Andrej Karpathy’s NanoChat:
- Requires 8x H100 for 4 hours of training
 - Cost: 8 GPUs × 4 hours × $2/hour = $64
 
Model performance comparison before and after improvements:
Pretrained model of the original version before improvements:
1  | MiniMind模型参数量: 104.03M(illion)  | 
From these outputs we can see typical issues of the original model:
- Answers are full of repetition and verbosity (e.g., repeated mentions of “universal law”)
 - Misunderstanding of basic knowledge (e.g., “carbon dioxide accounts for 20% of air”)
 - Logical confusion (e.g., “there are 7 largest animals on Earth”)
 - Lack of structured expression
 
Pretrained model after applying QK Norm and the Muon optimizer:
1  | MiniMind模型参数量: 104.03M(illion)  | 
The Muon optimizer is also significant during the SFT phase. For the original Minimind model before applying QK Norm and the Muon optimizer, the effect after SFT is as follows:
1  | MiniMind模型参数量: 104.03M(illion)  | 
Main issues of the original SFT model:
- Severe factual/professional errors (e.g., “ChatGPT was developed by Google”, incorrect formula for the speed of light, listing Xitang as a food)
 - Truncated answers (e.g., “I can also learn and understand human language, la”)
 - Repetitive content in answers (repeating “antibiotics” multiple times)
 - Lack of in-depth analysis (responses to “Diary of a Madman” are too superficial)
 
After applying QK Norm and the Muon optimizer, using the same SFT training data and number of steps, the post-SFT model quality improves significantly.
1  | $ python eval_model.py --load 0 --model 1  | 
From these comparisons, we can clearly see the significant effects brought by algorithmic improvements:
Improvements in the pretrain phase:
- Improved knowledge accuracy: Before the improvement, understanding of fundamentals like gravity and carbon dioxide was confused, even producing obvious errors like “carbon dioxide accounts for 20% of air”; after the improvement, it can accurately state scientific concepts
 - Enhanced logical coherence: Previously, answers were often repetitive and rambling with muddled logic, such as the inexplicable opening “there are 7 largest animals on Earth”; afterwards, answers are concise and to the point
 - Improved expression quality: Previously, the model often fell into self-repetition loops, such as repeatedly mentioning “universal law”; afterwards, it can organize answers structurally and even raise follow-up questions
 
Improvements in the SFT phase:
- Significant increase in professionalism: After the improvement, explanations of concepts like “large language model” and “ChatGPT” are more professional and comprehensive, covering training methods, technical details, and more
 - Greater depth of knowledge: Before improvement, explanations of the speed of light were confused and full of errors (such as the completely wrong formula “$c^2=m^2$”); afterwards, it can accurately list the physical significance and properties of the speed of light
 - Enhanced critical thinking: When discussing Lu Xun’s “Diary of a Madman”, the improved model can analyze the critique of feudal morality from multiple angles, showing deeper understanding
 - Improved practicality: For health consultations like “coughing for two weeks”, the suggestions after improvement are more reasonable and responsible
 
Most interesting finding: These two simple improvements (QK Norm + MUON optimizer) not only speed up convergence and improve the final loss, but more importantly, they improve the model’s internal understanding and organization of knowledge. This confirms our earlier point: algorithmic improvements are not just numerical optimizations, but qualitative enhancements to the model’s learning and expression capabilities.
The importance of RL algorithm efficiency
Beyond algorithmic improvements for base model training, the choice of RL algorithm is equally important.
PPO vs GRPO vs DAPO
Drawbacks of PPO: Requires two functions (value function and policy function), training two models.
Advantages of GRPO: Simplifies training via group relative reward; sample multiple responses for the same prompt, use within-group relative ranking instead of an absolute value function, and train only one policy model, significantly reducing training complexity.
Further improvements of DAPO: DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization) is ByteDance’s open-source large-scale RL system, proposing four key techniques for long CoT reasoning scenarios:
- Clip-Higher: Promote system diversity and avoid entropy collapse
 - Dynamic Sampling: Dynamically adjust the sampling strategy; if the reward variance of a group of responses is too small (indicating little to learn), skip that batch to further improve training efficiency and stability
 - Token-Level Policy Gradient Loss: Critical in long CoT scenarios
 - Overlong Reward Shaping: Apply reward shaping to overly long responses to reduce reward noise and stabilize training
 
Experimental comparison
Task: Teach the model to use a code interpreter to solve math problems (ByteDance’s ReTool work)
Baseline (SFT only): Qwen 2.5 32B achieves only a 20% success rate on AIME 2025.
Using GRPO: 50% success rate at 300 steps.
Using DAPO: 50% success rate at 100 steps, 60% at 150 steps.
Comparison: Claude 3.7 Sonnet thinking is only 40–50%. Our model trained for 100 steps reaches 50%, and 150 steps reaches 60% (approaching Claude 4 Sonnet).




Training models is actually not that hard
Many people think training models requires top algorithm experts and GPU costs in the millions of dollars. In fact, training models is not as hard as imagined, and the costs are much lower than expected.
Let’s look at a few real examples:
DAPO ReTool experiment reproduction:
- 8x H200 training for 9 days
 - Total cost: about $5,000
 - Result: After 100 steps, 50% success rate on the AIME 2025 math competition, surpassing Claude 3.7 Sonnet
 
MiniMind pretraining:
- 8x 4090 training for 14 hours (pretrain 6h + SFT 8h)
 - Total cost: only a bit over $30
 - Result: A 100M-parameter model with basic Q&A ability
 
These costs are far below what most people imagine. More importantly, training frameworks are now very mature: trl, verl, AReal, etc., all validated by extensive practice. As long as you prepare the training data and RL environment, you can start training.
If training results are poor, something must be wrong somewhere, most likely data quality.
Data quality is equally critical
Both algorithms and data are important; you can’t focus on only one.
Let’s use MiniMind 2 again for experiments. If you pretrain with FineWeb Chinese:
- After 10 epochs, loss is still around 3–4
 - Poor results: the model tends to recite article passages without understanding the language itself
 
Why? Look at the contents of FineWeb:
- Official-style articles, leadership speeches
 - Various promotional advertorials
 - Highly advanced academic content
 
For a small 100M model, this knowledge is too difficult, exceeding the model’s capacity.
The importance of simple data
MiniMind takes a clever approach: use the SFT dataset for pretraining. The questions are relatively simple (what is the capital of China, why is the sky blue), and the Q&A is fairly short.
A learning path suitable for small models: like teaching kindergarteners—first teach 1+1=2, then the complex stuff.
This is not about whether SFT or pretraining is better, but to emphasize: the content should be simple and appropriate for the model’s scale.
The evolution of data quality
Why are today’s models so much stronger than earlier ones?
For example, today’s Qwen3 8B model is stronger than Llama 1 70B back then.
Main reason: improved data quality.
Old datasets were messy; it’s hard to imagine models learning from such low-quality data.
New training approach: knowledge distillation. Use older models to score and filter datasets and generate synthetic data, distilling the “teacher model’s” knowledge into the “student model”.
This is an efficient learning paradigm that accumulates over time, making the model’s understanding of world knowledge increasingly concise (refined).
After some time, we can obtain what Karpathy calls the Cognitive Core: mastering most important factual knowledge about the world, general logical reasoning ability, and basic language ability.
On this basis, we then strengthen domain capabilities, like a new employee adapting to a company.
This aligns with Sutton’s big world hypothesis: there is a base model with strong fundamental abilities that, more importantly, learns continually through interaction with the environment, acquiring new skills and factual knowledge about the world.
How to Use Vibe Coding Well
Karpathy’s Reflections
Karpathy mentioned in an interview: writing static problems, demos, and small programs is strong, but actually doing production projects and writing high-knowledge-density code is still hard. This observation is correct, but it doesn’t mean vibe coding is useless.
vibe coding is a capability amplifier
Key insight: higher-skilled people find it easier to use vibe coding well.
Why? Because they can act as teachers, continually guiding and correcting the model.
How I use AI to write code: continuously watch the AI’s output, read as fast as it writes, stop immediately when I spot issues, and give a new prompt to correct it.
Most people don’t have this ability: when AI is writing code they go check their phone, come back after AI has written 1,000 lines, can’t understand it and don’t know what to do, so they just run it to try—equivalent to no code review.
Two requirements for using Vibe Coding
- Your input speed must keep up with its output speed
 
In LLM terms: the model has a prefill speed and a decode speed. Your prefill speed (reading code) needs to match the model’s decode speed (writing code) to guide it effectively.
- Know more than the model in that domain
 
When code breaks, let AI fix simple issues (syntax errors) by itself, but for complex issues (logic errors, unexpected errors) the human must understand the problem first and clearly tell the AI how to fix it, rather than dumping all error messages on the AI.
The common mistake: your capability is below the AI’s, you don’t know what to do, you paste all the errors and say “AI, you fix it,” then the AI makes random changes and things get worse.
Two modes you should use Vibe Coding in
Working mode: you instruct the AI and tell it what to do—don’t ask it.
Learning mode: ask it various questions and have it explain fundamentals and concepts.
Be cautious in high-knowledge-density domains
Karpathy’s warning is right: very new areas or domains with high knowledge density are not recommended for AI to handle, or only let it do peripheral work.
Example: don’t have an agent write an agent from scratch, because it doesn’t know which models are currently better and will use very old models (GPT-4, Gemini 2.0, Claude 3.7). Its knowledge is cut off.
Another issue: tool-calling formats. The AI may format all interaction history into one text because it recalls LLMs take text prompts. It’s unfamiliar with today’s tool-call format (tool call → tool result) and will tend to stuff the entire history into the user prompt.
This is unfriendly to the KV cache, breaks tool-call formats, and reduces tool-call accuracy.
What are agents good for?
Most suitable tasks:
- Boilerplate code: glue code, CRUD code, repetitive patterns
 - Learning tool: survey a codebase and understand how existing code works
 
Two principles for production code:
- You instruct it; don’t consult it
 - Continuously read the AI’s code; don’t let it run unsupervised
 
The importance of noise and entropy
Karpathy proposed an interesting idea: humans keep thinking without falling into model collapse because there is a large amount of noise as input.
The model collapse problem
Ask ChatGPT to tell jokes, and it might tell those same three jokes. Ask it repeatedly and it keeps circling the same things.
Reason: there isn’t enough entropy during the model’s reasoning process.
Noise is actually powerful. Stable Diffusion generates images by recovering from noise. Why initialize with random numbers instead of all zeros? Because noise contains lots of entropy, and with that entropy you can find appropriate structure.
The source of human entropy
How do humans do it? The external environment is all noise; lots of entropy constantly flows in, increasing diversity through entropy.
Applied to agents
Sometimes we should manually add entropy to the model to increase diversity.
Example: have the model write stories
Bad practice: use the same prompt each time; the outputs lack diversity.
Better practice: provide some reference stories, varying them each time, randomly chosen from a large story library.
The role of reference stories isn’t just few-shot learning (traditional understanding); more importantly it’s input entropy. With additional entropy, output diversity will be higher.
Cognitive Core: the advantages of small models
Karpathy’s Cognitive Core concept
Karpathy proposed the “Cognitive Core”: a core model containing reasoning ability, world knowledge, and language expression. A model around 1B parameters might be sufficient as the Cognitive Core, while a large amount of factual detail can be kept in external knowledge bases or provided via context.
This is grounded: in practice, 3B+ models can already perform relatively complex reasoning, while smaller models struggle with effective reinforcement learning.
Why might small models be better?
- Forced knowledge compression
 
If a small model is to match a large model, it can’t just fit the data; it must understand the underlying regularities, and these regularities make the model generalize better. This aligns with the earlier “bad memory is a feature” insight: constraints force distillation of essence.
- Better generalization
 
Sutton has deep insights here. If a small model matches a large model, it means the small one learned the data’s underlying regularities rather than memorizing.
- Easier to evaluate OOD capability
 
A current problem with large LLMs is the data is too vast and messy; test questions may have appeared in training data, making it hard to tell whether the model understood the problem or remembered similar cases. As a result, evaluation data contamination is severe, and it’s hard to design good test sets for OOD (out-of-distribution) evaluation; almost all test data may be in-distribution.
Small models are forced to learn regularities rather than memorize, offering more guaranteed generalization and making it easier to assess true generalization.
- Deployment and cost advantages
 
They can be deployed on mobile devices as part of the OS, with low inference cost, and can be called frequently to help with reasoning and knowledge organization.
This is consistent with Sutton’s big-world hypothesis: have a small base model with solid foundational capability, and more importantly, learn continuously through interaction with the environment to acquire new abilities and factual knowledge about the world.
Why do math and programming perform well?
There are two domains where AI is particularly strong: math and programming.
Common explanation
Verifiable (can be verified): there are clear success/failure criteria, making it easy to improve in pretraining and RL, especially since RL can have clear reward functions. Other fuzzier domains are harder to improve.
Another important reason
The knowledge is public: nearly all crucial information is publicly available, so the model can learn it during pretraining.
Counterexample: many professional domains have almost no public information. In the chip domain, will TSMC and ASML put their core technologies on the internet? No, so the model is weaker there.
Implication: whether a large model can be well utilized in a domain depends largely on how much public corpus that domain has.
Where the opportunities are
If a domain:
- Large models currently perform poorly
 - Has almost no public corpus
 - Or is not the learning target of language models
 
For example, robot VLA models (how the world changes after actions), Computer Use (what happens after mouse clicks, understanding screenshots), and speech models (rare in pretraining corpora).
This creates opportunities for other companies to build specialized domain models.
For example, V-JEPA 2: training vision models doesn’t require as much corpus and compute as language models; the resulting models are smaller, with good world-prediction ability and low latency, suitable for real-time robot control.
Superintelligence and the future of humanity
GDP is not a good metric
Karpathy mentioned: AI has little impact on GDP.
My view: GDP is not a good measure of technological development or civilizational progress.
Historical example: before the First Opium War in 1840, China’s GDP accounted for nearly one-third of the world. Was that China’s strongest era? Obviously not. Judging technology or civilization level by economic aggregate alone is not appropriate.
Sutton’s four-step argument
Sutton believes coexistence with machines or humans being defeated by AI is inevitable.
First: it’s impossible to reach consensus on controlling AI. No government or institution can agree; everyone will only compete to build better AI, and there is no consensus on what the future world should look like.
Second: we will inevitably discover the secrets of intelligence. Even if current pretraining and RL have many problems, we are inventing new methods: long context, in-context learning, external memory organization, better training techniques.
Third: we won’t stop at human-level intelligence. Upon reaching it, people won’t be satisfied; they will push to superintelligence.
Fourth: intelligence acquires resources and power. The more intelligent an entity is, the more resources and power it will ultimately acquire.
Conclusion: either humans are augmented by AI and become stronger, or humans are defeated by AI.
This is a harsh but hard-to-escape fact.
The alignment problem
Sutton has deep insights on how to ensure AI is obedient and aligned with human intentions.
His view: we don’t necessarily need to control the future trajectory of superintelligence, and we might not even be capable of doing so.
Analogy: throughout history, everyone wanted to control the future; every emperor wanted to control a country and the flow of history, but history didn’t bend to their will. Adults want to teach children what’s good, but once grown up, children will inevitably go out of control in some ways.
There are no universal values that everyone agrees on.
We should teach general principles and let evolution continue—that is the essence of things. Rather than imposing today’s social ethics and moral norms on AI.
Conclusion
The core question explored in this article is: Why isn’t the current Reasoner a true Agent? The answer points to a neglected fundamental capability—continual learning.
We present insights at three levels:
Philosophical level: The inevitability of the big-world assumption
The real world conforms to the big-world assumption: no matter how large the model is, continual learning is still required in concrete scenarios. The small-world assumption—that pretraining can capture all knowledge—overlooks non-public domain expertise, company-specific norms, and individual work habits, the kinds of tacit knowledge that cannot be fully captured via prompts.
Technical level: From Model-Free to Model-Based
The fatal flaw of current RL methods is that they learn only from sparse rewards and cannot leverage environment feedback. Even when customer support explicitly says a credit card is required, the Agent must repeat the task hundreds of times to learn—such sample efficiency is unacceptable in real-world tasks.
The solution is dual learning: Policy Learning (choosing actions) + World Model Learning (predicting outcomes), forming a “prediction–action–evaluation” TD-Learning closed loop.
Engineering level: From model to Agent
- Architecture: Shift from the ReAct loop to event-driven, enabling real-time interaction that listens, thinks, and speaks simultaneously
 - Training: Open-source frameworks are mature and costs are manageable (MiniMind costs only $30, DAPO ReTool around $5000), the key is data quality and environment modeling
 - Deployment: 1B–3B cognitive cores generalize more easily, forcing the extraction of rules rather than memorization, with long-tail knowledge externalized
 
For an Agent to achieve continual learning, three mechanisms must work in concert:
- Parameter learning: Update both the Policy and the World Model, learn from environmental feedback, and improve sample efficiency
 - Context learning: Not simple information accumulation, but enforced compression (linear attention, cross-modal encoding) to distill knowledge that supports reasoning
 - Externalized memory: Use extra compute to summarize and compress knowledge, store it in a knowledge base, and encapsulate repeated procedures as tools to form reusable, composable capability units
 
The future of Agents is not just bigger models, but systems that can evolve over the long term within the world.
References:
- Richard Sutton interview
 - Andrej Karpathy interview
 - DAPO: An Open-Source LLM Reinforcement Learning System at Scale
 - DeepSeek-OCR: Contexts Optical Compression
 - Meta: Early Experience
 - LoRA without Regret (John Schulman)
 - V-JEPA 2
 - MiniMind
 - ReTool
 - Alita
 - Voyager