[This article is adapted from a Zhihu answer. It was written the old-fashioned way, by hand, and is not AI-generated.]

For People and Models Alike, Context Is What Matters Most

Yesterday morning I was in a bad mood. I read two technical reports and felt like almost every well-known technical report had someone I knew on it, while I myself hadn’t produced anything.

Then I heard a part of Jiayi Weng’s interview. Roughly, he said: “I think the first profession to be replaced by AI is researcher. Next to be replaced is infra engineer like me. The hardest to replace is sales, because convincing someone to pay is not that easy for AI; it still needs human-to-human communication.”

That instantly cheered me up, because what we do is exactly communication and negotiation with people. This thing isn’t as hard as I imagined, and yet someone as senior as Jiayi Weng thinks it’s unlikely AI can do it well… I think one explanation is context.

In the interview, Jiayi Weng repeatedly emphasized the importance of context. He feels that his own work at OpenAI isn’t that hard and doesn’t require especially high IQ; if you swapped him out for someone else and gave them all of his context, they could do it too. I actually have exactly the same thought, and whenever I tell friends this, they all say I’m being too modest.

He believes that the biggest problem in teamwork is inconsistent context. For example, he writes some code and another person takes it over. The successor’s context is different from his, and problems appear. He thinks one of the age-old problems of human organizations is that it’s very hard to maintain consistency in context sharing across the org chart, which leads to bloated infra and organizational structures.

In his view, the biggest reason AI can’t replace people in the short term is also context. AI does not live in the same environment as humans; inside a company, the context it can access is far less than a human employee’s. So it’s hard for it to complete work fully autonomously.

For this reason, people at OpenAI tend to overestimate AI’s impact on humans. He said when strawberry (o1) came out, he thought that within a year or two AI would be able to clean up the “shit mountain” of infra code for him, but even today it still can’t. Technological change in the world is very, very slow and incremental.

To sum up, context is what matters most for both people and models. I’m sometimes amazed to see people copy code from Cursor into ChatGPT, then paste the modified code back into Cursor.

This is not an intelligence problem; it’s a context problem. Fei Xiaotong has a wonderful passage in From the Soil:

In the eyes of city dwellers, country folk are “stupid.” Of course we remember quite a few friends who advocated for rural work, who linked stupidity with illness and poverty, and took these as symptoms of China’s countryside. For illness and poverty, we seem to still have some objective standards we can talk about; but on what basis do we say rural people are “stupid”? A villager hears a car honking repeatedly behind him on the road, panics, and doesn’t know whether to dodge east or west. The driver slams the brakes, sticks half his head out the window, and spits out at the old peasant: “Idiot!”—if that’s what we call stupidity, it really wrongs them. Once I took students to the countryside. Corn was growing in the fields, and a young lady, posing as an insider, said: “The wheat is growing so tall this year.” A friend from the village standing nearby, though he didn’t spit at her, gave a slight smile that could easily be translated as “Idiot.” Country folk have never seen the ways of the city, so they don’t understand how to deal with cars. That’s a knowledge problem, not an intelligence problem—just like how city people go to the countryside and don’t even know how to shoo away a dog.

If Life Is a Game, Your Score Is the Number of People Who Remember Your Name

On how to evaluate a life, Jiayi Weng said that as early as his third year of high school he had this idea: “If life is a game, your score is the number of people who remember your name.” More concretely, it means doing things that are meaningful to others so that more people remember you. To that end, he did two things:

  • He created open-source projects like Tianshou (a reinforcement learning framework) and non-profit websites like tuixue (U.S. visa status checker).
  • He pursued getting his name on as many OpenAI technical reports as possible.

This is actually quite similar to me, and you could even say that many people in the competition scene end up with a similar life evaluation standard.

So when I was in school, I worked with classmates from the LUG (Linux User Group) to build a course review community and many online services. All of these were non-profit. Today, when undergrads from USTC contact me, about half of them mention the course review site.

During my PhD, my biggest regret is that several papers’ corresponding projects weren’t open-sourced because they had commercial value for Microsoft. To this day, only one research project I was involved in has been open-sourced—the AKG operator generator from Huawei. The main work I did at Huawei, Unified Bus, didn’t even have its project name disclosed until last year’s open standard release.

Many people doing quant trading fall into self-doubt after earning their first pot of gold, because quant is a “get rich quietly” game; unless you’re a big enough name, not many people will remember you.

I’m digressing. In the interview, the host raised a pointed question: You said at first that you wanted to break free from external evaluation standards like GPA and titles, and only pursue intrinsic reward (just being happy yourself). Now this “doing things meaningful for others so more people remember you” seems to be an external standard again, doesn’t it?

Jiayi Weng said this external recognition is not recognition by an existing evaluation system, but a kind of consensus—people genuinely, from the heart, giving you a thumbs-up.

He also said he keeps adjusting his evaluation criteria and doesn’t let them trap him—for example, he hasn’t done open-source projects in many years. He believes external evaluation systems are there for quick filtering of people and are hard to change in the short term, and that they should become more personalized.

The host also asked a few more sharp questions: Is OpenAI’s culture of secrecy at odds with your original intention of “breaking information asymmetry”? OpenAI initially promised to build AGI that benefits all of humanity; should OpenAI open source or stay closed source to better fit “benefiting all humanity”? If OpenAI open sourced, wouldn’t it get community feedback more easily and iterate faster?

Jiayi Weng said this is a trade-off. There is a trade-off between doing open source/breaking information asymmetry and doing the most impactful thing.

On why OpenAI doesn’t open source, he sees it as a game theory problem. Two basic assumptions: (1) Training the best models requires a lot of money. (2) There will always be people who are extremely profit-driven.

If OpenAI open sourced its models, others could just take them, fine-tune them a bit, and then capture all the profits. OpenAI wouldn’t make money or attract investment, and therefore couldn’t afford to train the best models anymore. Under game theory, that means the best models can only be closed source.

He believes that under these two assumptions, “benefiting all of humanity” means letting every user use the best AI model.

If OpenAI had unlimited resources, he’d be very happy to open source the RL infra they’ve built over the last two or three years. He even discussed with John Schulman whether to open source it.

Infra Iteration Speed Is the Lifeline of Model Companies

In Jiayi Weng’s view, the lifeline of a base-model company is the iteration speed of its infra. DeepSeek’s internal infra is very good; its internal iteration is very fast. That’s what really spooked OpenAI. It’s not leaderboard numbers that woke OpenAI up—OpenAI stopped leaderboard chasing a long time ago.

Imagine plotting a curve with the number of iterations on the x-axis and success rate on the y-axis. The slope of this curve is critical.

Startups have low communication cost and small codebases and only need to consider specific use cases, so their iteration speed is fastest. But as a company grows, it has to consider a wide variety of use cases, and iteration naturally slows. One of the age-old problems of human organizations is the difficulty of keeping context sharing consistent across the org chart, which leads to bloated infra and organizational structures.

Therefore, he thinks that if a model had unlimited context, the biggest application scenario would be acting as CEO, responsible for organizational context sharing.

Compared with LLM post-training, Jiayi Weng believes that there is no fundamentally new challenge in Agent post-training; at root they are the same thing, the only difference is the environment. But the cost of trial and error for Agents in their environments is very high.

For example, his own Infra Engineer role is not easy for AI to replace in the short term, for two reasons:

  1. AI infra is almost entirely out of distribution relative to the datasets;
  2. The cost of verifying (trial and error) in AI infra is very high.

I think that for Agents to be useful in high trial-and-error-cost niche areas like AI infra, the key is to solve RL’s sample efficiency problem. As works like Nested Learning show, if we can scale up few-shot in-context learning, the problem of Agents’ autonomous (continual) learning might be solvable. From this perspective, long context and RL are two paths to the same destination.

Jiayi Weng also thinks that new paradigms for RL and new paradigms for pretraining are both possible; every day comes with new challenges.

He said that inside OpenAI, people didn’t actually see ChatGPT as something revolutionary, because it was the result of gradual evolution, and they never expected ChatGPT to have such a huge impact. He thinks building a good model isn’t that hard; direction matters most. As long as you’re going in the right direction and do every single thing well and correctly, you can get there.

Right now, OpenAI’s bottleneck is infra throughput, i.e., how many bugs can be fixed per unit time. The current problem is that infra still has too many bugs, so they haven’t fully scaled up, and they are refactoring infra.

Self-Interested Aside: The Boundary Between Infra and Applications Is Undergoing a Major Transformation

I’ll sneak in a bit of a personal take at the end: I think the boundary between Infra and applications is undergoing a major transformation, shifting from the OS to the LLM context.

Traditional Infra usually works on everything below the OS; the most important thing is to propose new OS abstractions (for example, UNIX). The names of the two top systems conferences (OSDI - Operating Systems Design and Implementation, SOSP - Symposium on Operating Systems Principles) are both centered on OS.

In the future, as LLM inference costs drop, LLMs will become the infrastructure for most applications. Applications will only need to care about the LLM context and won’t need to care about OS-level details.

I think SOTA models in 2025 have gradually started to turn this corner, but haven’t fully done so yet. For models from mid-2025 and earlier, if you ask them to write some text classification code, they’re very likely to write a bunch of rules full of edge cases. I have to be extremely careful to force them to call an LLM to do the text classification, to avoid them producing that kind of awful “spaghetti” code. But today’s SOTA models, with only a bit of prompting, will fluently use an LLM to do text classification.

For people working in Infra, recognizing this transformation is extremely important. Previously, many of us were dedicated to improving programmability at the OS layer, optimizing performance on diverse workloads, and trying to strike some trade-off between performance and programmability. But if LLMs come to dominate in the future, then Infra only needs to serve this single LLM workload well; OS-level programmability and the performance of other workloads may no longer be that important. Of course, legacy systems will continue to exist for a long time, but just like the stock prices of NVIDIA and Intel, their relative importance will change.

This is why I suggest some Infra folks pay attention to Agents. How Agents should best use the LLM context today is like memory management in the 1970s: there are many possible solutions, and we are far from converging. But it’s clear that a good Agent is not something you design arbitrarily; there is some emerging consensus—for example, treating a coding agent and a file system as the foundation of all general-purpose agents, and using progressive disclosure like Claude Skills. The performance and programmability of Agents have once again become a new kind of trade-off. This is where the next UNIX might be born.

Supplement: On Determinism

Quite a few friends are interested in Jiayi Weng’s view that “the fate of the world/people can be predicted; God does not play dice.” Here are some of my personal thoughts:

Some propositions are relatively easy to predict (low perplexity), and some are very hard to predict (high perplexity). Things that involve the big picture and are determined by physical laws are generally easier to predict. Things that involve details and are determined by human nature are generally much harder to predict. When I discussed determinism with our CEO, he said: don’t ignore human nature; human nature is very complex and very hard to predict.

To give a not-necessarily-perfect example: if you ask whether there will be any follow-up between Li Xinye and Hua Shimei, that has low perplexity; but if you ask exactly when there will be follow-up, that has very high perplexity.

Fortune-tellers only make predictions about things with low perplexity. That’s why fortune-telling can “work”. In fact, the way fortune-telling works is very similar to today’s reasoning models: it’s essentially a logical reasoning framework (the Bagua and such are just symbols used in the reasoning), plus a predictive model that the master has constructed from their experience (a probabilistic model for each step of reasoning).

Comments