Silicon Valley AI Observations: The Million-Dollar-Salary Model Wars and How Startups Survive

(This article is the invited talk I gave at AWS re:Invent 2025 Beijing Meetup)

Click here to view Slides (HTML), Download PDF version

Thanks to AWS for the invitation, which gave me the opportunity to attend AWS re:Invent 2025. During this trip to the US, I not only attended this world-class tech conference, but was also fortunate enough to have in-depth conversations with frontline practitioners from top Silicon Valley AI companies such as OpenAI, Anthropic, and Google DeepMind. Most of the viewpoints were cross-validated by experts from different companies.

From the re:Invent venue in Las Vegas, to NeurIPS in San Diego, and then to AI companies in the Bay Area, more than ten days of intensive exchanges taught me a great deal. Mainly in the following aspects:

Practical experience of AI-assisted programming (Vibe Coding): An analysis of the differences in efficiency improvement in different scenarios—from 3–5x efficiency gains in startups, to why the effect is limited in big tech and research institutions.

Organization and resource allocation in foundation model companies: An analysis of the strengths and weaknesses of companies like Google, OpenAI, xAI, Anthropic, including compute resources, compensation structure, and the current state of collaboration between model teams and application teams.

A frontline perspective on Scaling Law: Frontline researchers generally believe that Scaling Law is far from over, which diverges from the public statements of top scientists such as Ilya Sutskever and Richard Sutton. Engineering approaches can address sampling efficiency and generalization issues, and there is still substantial room for improvement in foundation models.

Scientific methodology for application development: An introduction to the rubric-based evaluation systems that top AI application companies widely adopt.

Core techniques of Context Engineering: A discussion of three major techniques to cope with context rot: dynamic system prompts, dynamic loading of prompts (skills), sub-agents plus context summarization. Also, the design pattern of using the file system as the agent interaction bus.

Strategic choices for startups: Based on real-world constraints of resources and talent, an analysis of the areas startups should avoid (general benchmarks) and the directions they should focus on (vertical domains + context engineering).

I. Vibe Coding (AI Programming)

1. Polarized views on Vibe Coding

After in-depth discussions with practitioners from multiple Silicon Valley companies (including top AI coding startups, OpenAI, Google, Anthropic, etc.), we found a shared understanding: the effectiveness of Vibe Coding is highly polarized.

In some scenarios, efficiency gains can reach 4–5x, but in others, AI is almost useless or even counterproductive. The key factors are: the nature of the task, the type of organization, and the type of code.

Scenario 1: MVP development at startups (3–5x efficiency gain)

Why does it take off?

From 0 to 1 prototype development:
- The most important things are speed and finding PMF (Product Market Fit)
- No need to patch complex existing systems
- Code quality requirements are relatively low; the focus is on rapid validation
- New features can be shipped weekly or even daily for rapid experimentation
Relatively simple tech stack:
- Typically based on mainstream frameworks (React, Django, FastAPI, etc.)
- AI has abundant training data on these mainstream stacks
- Large amounts of boilerplate code can be generated directly
Small team, low communication overhead:
- No need for cross-department coordination
- Fast decision-making and execution
- Simple code review processes

Typical tasks:

CRUD business logic
Simple API development
Frontend forms and pages
Data processing scripts

Scenario 2: One-off scripts and CRUD code (applies to all companies, 3–5x efficiency gain)

Universally efficient scenarios:

One-off scripts:
- Data analysis scripts
- Data migration scripts
- Batch processing tools
- Disposable scripts with low quality requirements
Glue / boilerplate code:
- Configuration files
- Data transformation layers
- API call wrappers
- Test case generation

Why does it work well?

Task boundaries are clear, with well-defined inputs and outputs
No need for deep understanding of complex business logic
Even if it fails, the blast radius is limited
Even researchers at OpenAI and Google heavily use AI to write this type of code

Scenario 3: Daily development at big tech (limited efficiency gain)

Why is the effect discounted?

Coding is a small fraction of the job:
- Big tech engineers don’t spend most of their time writing code
- Time allocation: meetings 30%, negotiation/coordination 20%, documentation 20%, bug hunting 15%, coding 15%
- AI can only optimize the last 15% of the time
High system complexity:
- Requires deep understanding of existing architecture
- Involves code owned by multiple teams
- Must consider backward compatibility
- AI can easily introduce regressions
Strict code review:
- Multiple rounds of code review
- Must pass various linters and tests
- Long deployment pipelines
- Even if AI writes code quickly, the downstream processes are unchanged

Tasks suitable for AI:

Repetitive work during refactoring
Filling in test coverage
Simple bug fixes
Documentation generation

Scenario 4: Research code (almost not applicable)

Why can’t AI help?

Intellectually intensive code:
- Modifying model architectures (e.g., attention structures)
- Adjusting training algorithms
- Optimizing data mixtures
- You might only change 3 lines, but think about them for a long time
AI itself doesn’t understand:
- This is cutting-edge research, absent in the training data
- Requires deep theoretical background
- Requires innovative thinking
- AI currently cannot substitute for this
Highly customized:
- Every research project is unique
- There is no “standard practice”
- Requires大量实验和调试 (a large amount of experimentation and debugging)

How researchers use AI:

Only for auxiliary scripts (data analysis, visualization)
Not for core algorithms
Keep the core logic for humans

Scenario 5: Core infrastructure code (not suitable, may be harmful)

Why be cautious?

Performance-sensitive:
- Extremely strict latency requirements
- Needs manual optimization
- AI-generated code is usually not optimal
High stability requirements:
- Any bug may cause large-scale outages
- Requires deep consideration of edge cases
- AI tends to miss corner cases
Security-critical:
- Authentication and authorization code
- Encryption and signing logic
- No room for oversight

Conclusion: This type of code is best written by humans.

2. Best practices for Vibe Coding (from frontline Silicon Valley teams)

How do you prevent AI from messing up your codebase? Summarizing the experience of multiple companies, we arrived at an engineering workflow:

2.1 PR line limit

Core principle: strictly cap a single PR at 500 lines

What this means for AI:
- Overly long context leads to hallucinations and logical breakdowns
- Clear task boundaries make it less likely for AI to “go off track”
- Reduces coupling risk between different modules
What this means for humans:
- 500 lines is roughly the cognitive limit for human code review
- Beyond this, review quality drops significantly
Specific thresholds (vary by scenario):
- Backend / core agent code: 500 lines
- Frontend React code: can be relaxed up to 2000 lines
- Test cases: can exceed 500 lines (e.g., generating 100 test cases)
How to achieve this:
- Human responsibility: do requirement breakdown and task decomposition in advance
- This is one of the “core human values” in Vibe Coding

2.2 Fully automated multi-agent collaboration workflow

A leading AI coding company’s fully automated process:

Step 1: Auto-trigger coding

When an issue appears in production or a new feature is requested
The system automatically creates a bug/feature ticket
Automatically triggers a coding agent (such as OpenAI Codex or Claude Code) to start writing code

Step 2: “Three-court trial” review mechanism

After code is generated, it is not submitted as a PR directly
3–5 different review agents are automatically launched to run in parallel
These agents are configured with:
- Different foundation models
- Some are third-party code review services that are purchased

Step 3: Automated testing

Run the automated test suite simultaneously
Including unit tests and integration tests
All tests must pass

Step 4: Conditions for creating a PR

Must satisfy:
- All 3–5 review agents deem it “problem-free”
- All automated tests pass
Once satisfied: automatically create a PR and notify relevant people for review

Step 5: Human Final Review

Human does the final approval
Check business logic issues the AI might have missed
Check whether it fits the overall project architecture
Decide whether to merge

For simple problems: this workflow can be fully automated, with humans only doing the final confirmation

2.3 Test-driven quality assurance

Core idea: Treat AI as a “not very reliable junior developer”

Comprehensive test cases
- There must be sufficient test coverage
- Test cases are the “safety net” for AI-written code
- Code without tests is not allowed to be casually modified by AI
Protection of test cases
- Don’t let the AI casually delete or modify test cases
- Only allow the AI to modify tests within designated scopes
- Prevent the AI from deleting tests just to “make tests pass”
Rubric-based Benchmark
- Establish internal evaluation standards (similar to the Evaluation discussed in Chapter 6)
- Different scenarios have different evaluation metrics
- Can score automatically to quickly verify the effect of changes

2.4 Handling large-scale refactors

For large refactors or new feature development that cannot be completed within 500 lines:

Human–AI collaboration workflow:

Human writes the Design Document first
- Detailed technical design document
- Clear architecture and module breakdown
- Define interfaces and data flow
Task decomposition
- Break down into multiple subtasks based on the Design Doc
- Keep each subtask within 500 lines
Division of labor between human and AI
- Core business logic: written by humans
  - Parts with extremely high performance requirements
  - Parts requiring deep understanding of the business
  - Parts involving complex architectural decisions
- Peripheral code: written by AI
  - Data transformation and formatting
  - Wrapping API calls
  - Repetitive CRUD operations
  - Test case generation
Iteration and integration
- AI writes the first version
- Human reviews and decides what to keep or modify
- Gradual integration with continuous testing

2.6 Code Ownership principles

Core principle: Everyone has clearly defined Code Ownership

II. Application development at top AI companies: the “scientific methodology”

From talking with the application development teams at Google DeepMind, OpenAI, and Anthropic, what shocked me most was the scientific and rigorous way top AI companies build applications. They are not “trying” prompts; they are measuring systems.

Although the three companies differ in specific practices, their core philosophy is highly consistent: data-driven, strictly evaluated, continuously iterated.

1. Rigorous Evaluation System

1.1 Core ideas of rubric-based evaluation

Top AI companies don’t ship applications based on “gut feeling”; they build a Rubric-based Evaluation System.

Launch criteria:

The system automatically scores every metric for every case
Before any version goes live, all metrics must meet certain thresholds
If any metric fails, it must be specially approved by senior leadership with a clear follow-up remediation plan

2. Data flywheel and automated iteration

2.1 Construction and maintenance of test datasets

Dedicated owners for dataset construction:

Each functional module has its own dedicated test dataset
There are dedicated people responsible for constructing and maintaining datasets based on online bad cases
Continuous updates: new issues discovered are added to the dataset

2.2 Fully automated evaluation workflow

Evaluation cycle:

Usually takes several hours to run through a few hundred cases
Submit the job at night, review results in the morning

Automated scoring:

Use an LLM as the judge
Each metric has its own prompt for evaluation
Scoring is done according to predefined rubrics

Human Eval:

Have humans label data and evaluate results
Achieving human labels that outperform an LLM judge is extremely costly

3. Division of labor between foundation model teams and application teams

Priorities of the model team:

The four major benchmarks:
- Math (mathematical ability)
- Coding (coding ability)
- Computer Use (computer operation)
- Deep Research
Long-term model capability improvements:
- Intelligence improvement
- General capability enhancement
- These are considered most important for the model’s long-term development
Vertical domain needs:
- Very low priority
- Basically won’t get a response

Common feature: application teams can hardly influence model teams

Important insight: application development inside foundation model companies ≈ application startups on the outside

They face the same constraints:
- Use the same foundation model APIs
- Cannot influence model training directions
- Cannot request targeted optimization
Two advantages for app teams inside foundation model companies:
- They can more easily get model teams to review prompts and improve context engineering. But startups also have opportunities to talk to foundation model engineers. For example, AWS re:Invent offers face-to-face discussions with Anthropic; at the Anthropic booth their engineers gave a lot of concrete suggestions for Agent optimization.
- Token costs are internally priced, much cheaper than external API calls. For example, Cursor’s API cost is much higher than the subscription fee for Claude Code.
Advantages for external startups:
- Can use multiple models: Google internally can only use Gemini
- Can build mashup agents: mixing OpenAI + Anthropic + Gemini
- More flexible technology choices

III. The many faces of Silicon Valley giants: strengths, weaknesses, and inside stories

1. Google DeepMind: four strengths and two soft spots

Strength 1: Founders’ strong commitment + powerful organizational capability

Sergey Brin’s return:
- After ChatGPT launched, Google cofounder Sergey Brin personally returned to the company
Demis Hassabis’s leadership:
- DeepMind CEO, extremely strong in both technology and management
- Can unite thousands of smart people
- Avoids severe internal friction and office politics
- Although the merger caused some frictions, overall the team has strong cohesion
Comparison with other companies:
- Meta: Zuckerberg doesn’t personally oversee AI, delegating to Alexandar Wang, who also doesn’t manage the details and delegates to teams
- Microsoft, Apple: senior leadership has limited understanding of AI technical details
Why was the Gemini App merged into DeepMind? Building applications is essentially research:
- Requires a scientific methodology
- Requires大量实验和数据驱动
- Requires a complete evaluation system
- This mindset is consistent with doing research

Strength 2: Overwhelming compute resources

TPU + GPU dual track:
- Self-developed TPUs with ongoing production capacity
- Years of accumulated Nvidia GPU purchases
- Total compute may be several times that of OpenAI
Advantages in model scale:
- OpenAI’s main models: GPT-4o series, hundreds of billions of parameters
  - Although they’ve evolved several generations, parameter counts haven’t increased dramatically
  - Mainly use test-time scaling (chains of thought) to improve capabilities
- Google’s models:
  - Gemini 2.5 Pro / 3 Pro: trillions of parameters, an order of magnitude larger than OpenAI’s main models
  - Gemini 3 Flash’s parameter scale is roughly the same as GPT-4o and much larger than Gemini 2.5 Flash
Why doesn’t OpenAI use larger models?
- Insufficient compute: both training and serving require huge resources
- Too many users: ChatGPT is still the flagship of the AI industry, with over 1 billion users and massive API usage
- Gemini has about 600 million users and around 1/5 of OpenAI’s API traffic
- So Gemini can “indulgently” use larger models

Strength 3: Abundant human resources

Typical example: comparison of image generation model teams

Nano Banana Pro (Gemini’s image generation model):
- Algorithm team: fewer than 10 people
- Data + infra team: nearly 1,000 people
OpenAI’s corresponding model:
- Algorithm team: fewer than 10 people
- Data + infra headcount: an order of magnitude smaller
Where does having more people help?
- They can construct大量训练数据for specific scenarios
- For example: schematics, 3×3 image grids, etc.
- These all require manual labeling and data construction
- Today’s foundation models still heavily rely on human-constructed data; internet data alone is far from enough

Strength 4: Natural advantages from ecosystem entry points

Browser (Chrome):
- Gemini button directly integrated in the top-right corner
- Better experience than the ChatGPT Atlas and Perplexity Comet browsers
- Can directly ask questions about the current page and summarize long articles
Workspace integration:
- Google Calendar: can let Gemini schedule calendar events
- Google Drive: can let Gemini read and write documents
- Gmail: can let Gemini handle emails
- These are all natural user bases
YouTube data:
- Video data accumulated over many years
- A precious resource for multimodal training
Search engine:
- Google Search directly shows AI Summary
- Becomes a new traffic entry point

Disadvantage 1: Big-company efficiency problems

Cumbersome processes:
- Building datasets and evaluations takes a huge amount of time
- Product iteration speed is far behind startups
- In comparison: even OpenAI (a mid-sized company) iterates faster than Google
How internal employees feel:
- What they worry about most is the speed of OpenAI and startups
- They do have more resources, but “it’s hard to turn a big ship around”

Disadvantage 2: Only optimizing general needs, not vertical domains

Priorities of the model team:
1. General benchmarks
2. Improving model intelligence and long-term capabilities
3. Basically no bandwidth left for vertical-domain needs
What does this mean for startups?
- Window of opportunity: Vertical domains are where startups have a chance
- Google won’t come to grab this: For now, specialized scenarios still require a lot of context engineering work; you can’t solve it with a generic data flywheel
- Technical parity: The models we use are the same ones the Gemini App team uses
- Startups may even have unique advantages: They can flexibly choose and mix SOTA models from different vendors (OpenAI + Anthropic + Gemini). For different scenarios and different subagents they can use the most suitable model, and can even have multiple models review each other to improve reliability. But internal Google teams can only use Gemini, and OpenAI teams can only use GPT. This limitation actually leads to foundation-model companies building Agents that are weaker than those built by external teams.

2. OpenAI: anxiety, academic baggage, and resource constraints

Sources of anxiety

Strong brand, heavy pressure:
- They are the “flag bearer” of AI, with the highest brand recognition
- But internally they are very anxious about Google catching up
- Google has the advantage in resources, compute, and data

Talent-structure issues

They hired too many people with academic backgrounds:

Conflict between academic mindset and engineering mindset:
- Academic people want to validate their academic ideas
- But these ideas may not be practical from an engineering perspective
- This absolutely would not happen at xAI; Elon Musk would not allow it
Contradictions in resource allocation:
- In 2023, Ilya Sutskever and other core scientists left
- Main reason: they believed Sam Altman was allocating resources unfairly
- Too many resources were used to serve online users, leaving insufficient resources to train new models

Compute-resource dilemma

OpenAI’s double bind:
- Most users: More than 1 billion users
- Limited compute: Not as well-resourced as Google
- Must trade off: Balance between training and serving
Compromises in user experience:
- In early ChatGPT Plus ($20/month), they brutally truncated context
- Context window was only 32k tokens
- This caused severe hallucinations: earlier context got dropped, and what came after was nonsense
- My own experience: upload a book and ask it to summarize; the first few pages were fine, then everything after was hallucination
- So I later stopped subscribing and just called the API directly
Controversy over small/large model routing:
- GPT-5 did automatic routing, using small models for small questions
- But routing was inaccurate, and important questions were sent to small models
- Users couldn’t see which model was responding
- Experience degraded, triggering many complaints

OpenAI’s strength: Codex

Codex really is excellent:
- Very strong code-generation capabilities
- Optimized heavily for real usage scenarios
- Not just tuned for benchmarks

3. xAI (Elon Musk): extreme intensity and zero-tolerance results orientation

“No Research Engineer, only Engineer”

Elon Musk’s declaration:
- We have no Research Engineers, only Engineers
- Meaning: don’t do academic research, only deliver engineering results
In contrast with OpenAI:
- OpenAI has many people with academic backgrounds burning resources to try ideas from papers, causing engineering delivery to be delayed
- xAI absolutely does not allow this

Work intensity: everyone 70+ hours/week

Culture of “never let machines wait for humans”:
- It’s normal for engineers to work more than 70 hours per week
- To avoid wasting training time:
  - Get up at midnight to check loss curves and various metrics
  - If problems are found, resubmit immediately
  - Otherwise, another dozen hours are wasted the next day
Compared with other companies:
- OpenAI, Anthropic, DeepMind: Core teams generally work 60+ hours/week
- xAI: Everyone 70+ hours/week, and must come to the office every day
- Non-core Google departments: Go to the office 3 days a week

4. Anthropic: focused on coding and agents

Strategic focus:

Top priority: Coding
Second priority: Agents (including Computer Use)

Among the “big three” and xAI, Anthropic has the strongest research atmosphere. For example, Anthropic has a classic series of blogs on interpretability.

Anthropic has done the best job in building a developer community: From technical output to hands-on support, Anthropic has shown genuine respect for and willingness to help developers.

Leading technical innovations:

Claude Skills:
- Although it appears as a Claude feature, in essence it’s a general technique
- Other models can learn from it and implement similar mechanisms
- Demonstrates best practices in context engineering
Artifacts:
- An interaction pattern first proposed by Anthropic
- Now an industry standard
- Many products are learning from and imitating it
Claude Code:
- Currently the most useful coding agent
- Performs most stably in real-world development
- Context engineering is executed most thoroughly

Rich public resources:

Large number of public documents and tutorials on context engineering
Technical blogs and paper sharing
Developer guides and cookbooks

Practical support for startups:

Anthropic is very willing to help startups optimize context engineering; many startups have channels to get direct help from them.

Personal experience:

At the Expo at AWS re:Invent, we spent an entire afternoon talking at Anthropic’s booth. Anthropic engineers gave a large amount of very concrete context-engineering optimization advice tailored to our specific business scenarios:

How to design an evaluation system
How to design a sub-agent architecture
How to optimize system prompts
How to leverage the Skills mechanism

This kind of hands-on technical support is extremely valuable for startups. They don’t just provide tools; they are also willing to help developers use the tools well.

5. Compensation and the talent war: AI’s “arms race”

The truth about sky-high annual packages

Top fresh PhDs:
- Total comp: $1.5–2 million per year
- Condition: relatively strong research performance
- Mostly options, but OpenAI is already big enough that they can be cashed out
Algorithm engineers with some experience at top AI companies:
- Total comp: $4–5 million per year
- Condition: some experience at a top AI company, or some well-known academic work
Meta Super Intelligence–level top experts:
- Total comp: more than $10 million per year
- Mainly Meta stock (can be cashed out)
- The “$100 million hiring” headlines are real
First wave of salary hikes:
- Meta’s Super Intelligence team started aggressively poaching
- This pushed up salary levels across the entire market

Salary gap between AI and non-AI

3–4x gap for engineers at the same level:
- Non-AI engineers: $250–300k/year (normal level at Google)
- AI engineers: $1M+/year
- The gap is 3–4x
What does this mean?
- Knowing AI vs not knowing AI makes a world of difference in pay
- The industry has already become highly stratified

Massive marketing spend

Coming out of SFO: For dozens of kilometers along the highway, hundreds of billboards are basically all AI companies
Traditional companies like Snowflake are also telling AI stories
Interesting ad:
- Redis: “My boss really wants you to know we’re an AI company”
- All companies are trying to lean into AI

The AI talent war: similar to the group-buying wars, but with a completely different playbook

Then: group-buying wars vs now: AI wars

Group-buying wars (internet era):
- Money went into: hiring operations, doing offline promotion, running ads
- Core: fighting for market share, merchants, and users
- Manpower-heavy tactics; whoever had the larger operations team won
AI wars:
- Money goes into: hiring top talent + buying GPUs
- Core: training models, competing on compute, fighting for talent
- Elite tactics; whoever has the stronger research team wins

Comparison of the order of magnitude of resource投入：

Core Research teams at base model companies:
- Compute per capita: 500–1000 GPUs/person
- Compensation per capita: Over 1 million USD/year
- People’s compensation and GPU costs are in the same order of magnitude

The resource dilemma of base model companies: they can only do “big things”

More resources = must do the most impactful things:

Each person’s opportunity cost is extremely high:
- They must work on general capabilities that affect hundreds of millions of users
- They cannot be assigned to small, vertical-domain scenarios
Why not work on vertical domains?
- Vertical markets are small; the ROI doesn’t add up
- Having people with multi-million salaries work on a single vertical industry? The return on investment is too low
They can only focus on general capabilities:
- Math, Coding, Computer Use, Deep Research, etc.
- These are foundational capabilities that affect all users and are the only things worthy of such resource投入

Conclusion: this is the opportunity window for startups

6. Strategic insights for startups: how to survive between giants?

Insight 1: Don’t confront base model companies head-on

Areas startups should absolutely avoid:

General-purpose Coding Agent
General-purpose Deep Research
General-purpose Computer Use

Why will startups lose?

You can’t afford the people:
- People who truly understand model training have extremely high salaries
- Startups typically raise only several million to tens of millions of USD
- Hiring just a few such people will burn all the money
Insufficient compute:
- Training a general model, even post-training, requires hundreds to thousands of GPUs
- Startups cannot afford to rent that many GPUs
Insufficient data:
- General capabilities require massive amounts of high-quality data
- Big companies have ecosystem advantages (YouTube, Web Search)
- Startups can’t get access to this data

Conclusion: Unless you’re born with a golden key, don’t touch general-purpose domains

Insight 2: Don’t lightly touch model training

The threshold for model training is extremely high:

People who truly understand models are too expensive:
- These people are all in big companies and won’t easily leave
- Startups simply can’t poach them
People at a medium level cannot build competitive products:
- Model training needs a lot of trial and error
- Newcomers burn through money easily before a usable model is trained
The dilemma of open-source models + fine-tuning:
- Open-source models lag behind closed-source by 2 generations (roughly 6–12 months)
- Leading performance on benchmarks for open-source models does not mean leadership in real production environments
- Closed-source models often have internal benchmarks and are heavily optimized for real user scenarios
- Fine-tuning on top of open-source models can hardly bridge this gap
- Unless it’s an extremely niche vertical domain

When can you consider training?

Only for small models targeting specific scenarios (e.g., 8B/32B)
The domain is niche enough that general models are inadequate
You have a clear data advantage

Insight 3: The optimal talent strategy for startups

Core principle: hire smart people with strong learning ability but no AI background

Why is this strategy effective?

The AI field moves too fast; compounding effects are weak:
- AI best practices change every 3–6 months
Newcomers can reach the frontier quickly:
- Smart + strong learning ability + willing to dive deep
- Can reach mid-to-upper industry level within 6–12 months
- No PhD or big-tech background required
Huge cost advantage:
- 5–10x cost difference

Insight 4: What should startups do? Vertical domains

Core strategy: Go Vertical (focus on vertical domains)

Giants will not optimize for every niche segment; this is the opportunity window for startups. You need to build three core capabilities:

1. Professional Context Engineering

This is not simple work; it requires very specialized skills:

Giants will not invest so much effort into every vertical niche
Requires deep domain knowledge
Requires repeated refinement with customers (Customer Iteration)
These are things big companies are unwilling or unable to do

2. Build a Domain Knowledge Base

General-purpose LLMs don’t understand your industry; your knowledge base is a competitive advantage:

Data and knowledge accumulation takes time
Competitors cannot easily replicate it in the short term
Over time, the advantage compounds

3. Build a feedback data flywheel around real business scenarios

You must think about the data flywheel from day one; it determines how far you can go:

Capture users’ real interactions and feedback
Learn from failures in production environments
Use real data to optimize prompts and the knowledge base
Continuously improve Agent performance

Insight 5: Maintain your mindset and wait for your Wave

Don’t be anxious about the sky-high salaries of AI talent

Seeing AI talent with annual packages in the millions or even tens of millions of USD can easily cause anxiety and frustration. But we need to view this rationally:

Every person and company is at a different stage of development:

Why can they command such high salaries?
- Because they caught this Wave
- They did the right things at the right time
- Their accumulated skills happen to match current demand

The key is: Stay Relevant

When there is no Wave, don’t give up:
- Steadily lay a solid foundation
- For example, don’t think people’s CV/NLP accumulation is useless in the LLM era
- Don’t lie flat just because you don’t see immediate opportunities
Prepare your team:
- Build a team with strong engineering ability, strong learning ability, and strong cohesion
Stay Relevant:
- First Principle Thinking
- Closely track cutting-edge products and research
Stay Ahead of the Curve:
- Anticipate trends in models and products
- When the Wave comes, you’re already prepared
Wait for and seize your Wave:

When the opportunity comes, can you quickly build a product and rapidly go to market?
When user growth explodes, have you already built your data flywheel? ChatGPT, Gemini App, and Cursor all have data flywheels.

7. The work routine in the Silicon Valley AI scene

Core teams generally work at high intensity

Weekly working hours:
- OpenAI, Anthropic, DeepMind core teams: 60+ hours
- xAI: 70+ hours
- Google non-core departments: <30 hours (slacking off)
Why is it so intense?
- Model training cycles are very long
- Submitting a task may crash after several hours
- If you don’t handle it in time, you lose another dozen hours
- So everyone cares a lot about not wasting machine time

A typical day

Around 5–6 pm:
- Before leaving work, submit a training task
- Go home for dinner
Midnight (12 am):
- Wake up and check the result
- If it’s crashed, quickly adjust
- Possibly work until 2–3 am
- Resubmit
Next morning:
- Go to the office and check the result
- If you didn’t fix it last night, another whole night was wasted

Cultural differences between companies

Anthropic:
- Only requires 25% of the time in the office
- But the workload is heavy, actual workweek is above 60 hours
xAI:
- Must come to the office every day
- Extreme execution and discipline
Google non-core departments:
- Only work 3 days a week
- But AI core teams are nothing like this

8. Scaling Law: cognitive divergence between frontline researchers and top scientists

After talking with frontline researchers at several top AI companies, I noticed an interesting phenomenon: frontline researchers at top companies generally believe the Scaling Law has not ended, which is in clear contrast with public remarks by top scientists like Ilya Sutskever, Richard Sutton, and Yann LeCun.

Why this divergence?

Frontline researchers believe that while Ilya, Sutton, and LeCun are titans of academia, they are relatively removed from frontline engineering practice. The issues they point out are indeed very important:

The sampling efficiency problem in RL
The generalization problem of models

But the attitude of frontline researchers is: all these problems have engineering solutions.

How do engineering methods solve these problems?

1. Poor sampling efficiency → make up for it with compute; brute force works wonders

RL’s sampling efficiency is indeed much lower than supervised learning
But with enough compute, brute-force sampling still works
This is one of the reasons top companies pour money into buying GPUs

2. Poor generalization → manually construct domain data + RL environments

Summary from frontline researchers:

Midtrain / SFT: Manually construct high-quality domain data for continued training
Domain datasets: Collect and label data for specific scenarios
Sim Env (simulation environment): Build a simulation environment so the model can learn in a controllable setting
Rubrics-based Reward: Reward mechanisms based on detailed rubrics instead of simple binary feedback

This methodology can solve problems in a large number of practical domains. It’s not a “silver bullet”, but it does work in engineering practice.

Base models still have plenty of room for improvement

Frontline researchers’ consensus: whether it’s pretrain or post-train, no one sees a ceiling yet.

Release cadence: Currently one major release every six months
Capability gains: Each version shows a clear capability leap
Expectation: There is still huge room for improvement; no need to be overly pessimistic

Idealized continual learning really is still at the research stage

The kind of non-intervention, autonomously continual learning described by Ilya and Sutton is acknowledged by frontline researchers to still be at the research stage; there is no such ideal method yet.

But the key point is: engineering approaches all work in practice. They require human intervention and lots of engineering investment, but they produce real results.

On concrete details, different companies’ understanding diverges a lot

Interestingly, on specific technical details, people at different companies don’t always agree; some have figured things out, some haven’t yet:

Example 1: Learner–Sampler mismatch in RL

Some researchers believe: This problem significantly affects training stability for larger models and needs special handling
Other researchers believe: They haven’t encountered many such problems in practice, and the impact is small

Example 2: Linear Attention / Hybrid Attention

Some researchers believe: Linear Attention degrades the base model’s CoT ability and instruction-following ability, doesn’t scale well, and is not recommended
Other researchers believe: Linear Attention layers force the model to compress and extract knowledge from the context; compressing into a compact representation is itself a process of learning knowledge, so for the model’s overall in‑context learning ability it is not only harmless but actually helpful

Shared understanding: validate on small models → then scale up

Although they differ on specifics, frontline researchers share a strong consensus:

A technique that works on small models does not necessarily work on large models.

Typical experimental path:

4B/8B scale: First validate that the idea is feasible on small models
GPT-OSS scale (e.g., GPT-OSS 120B A20B): Scale to medium size to see whether it still works, and whether it works on MoE models
Production scale: Finally scale to the largest production models

This is also why top companies need so much compute: not only to train large models, but also to train a large number of small models for experiments.

4. Core technical practices: context engineering and file systems

1. Context Engineering

This is the technical direction Anthropic’s team emphasizes most, and it was also the core topic of our in‑depth conversations with Anthropic experts at re:Invent.

Definition: Context engineering is the discipline of optimizing the utility of tokens against the inherent constraints of LLMs.

1.1 The full framework of context engineering

Anthropic has proposed a systematic context engineering framework with four core dimensions:

Dimension 1: System prompt

Core principle: “Say less, mean more”
- Use the fewest but most precise instructions
- Use clear, simple, and direct language
- Structure content
- Choose an appropriate level of abstraction (not too rigid, not too vague)

Dimension 2: Tools (tool design)

Core principle: “Every tool earns its place”
- Self-contained (independent), non-overlapping, purpose‑clear
- Well‑defined parameters
- Concise and clear descriptions
- Clear success/failure modes

Dimension 3: Data retrieval

Core principle: “Load what you need, when you need it”
- JIT context (Just‑In‑Time context)
- Balance preloading and dynamic fetching
- Agents can autonomously retrieve data
- Well‑designed retrieval tools
- “Don’t send the entire library. Send a librarian.”

Dimension 4: Long-horizon optimizations

Core strategies:
- Compaction strategy for history
- Structured note‑taking
- Use of sub‑agent architectures when appropriate

1.2 Paradigm shift in data retrieval

Old approach: pre-loading (traditional RAG)

Preload all potentially relevant data

New approach: Just‑In‑Time loading

This is one of the most important shifts in context engineering, with three core strategies:

Strategy 1: Lightweight identifiers

Principle:
- Pass IDs instead of full objects
- The agent requests details only when needed
Example:
- Pass user_id: "12345"
- When needed, the agent calls get_user()
- And obtains the full user profile

Strategy 2: Progressive disclosure

Principle:
- Start from summaries
- Let the agent drill down as needed
Example:
- File list → file metadata → file contents

Strategy 3: Autonomous exploration

Principle:
- Agentic search: provide discovery tools rather than dumping data
- The agent autonomously navigates the information space
Example:
- search_docs() + read_doc(detail_level)
- Instead of loading all documents

1.3 Context window and context rot

Limits of the context window:

All frontier models have a maximum token limit per interaction
Anthropic’s context window is 200k tokens

What is context rot?

As context grows, output quality degrades. Main causes include:

Context poisoning: Conflicting information breaks reasoning
Context distraction: Irrelevant information diverts attention
Context confusion: Similar items become blurred
Context clash: Instructions contradict one another

Research conclusion: All models show performance degradation on long contexts (Chroma Technical Report: Context-Rot: How Increasing Input Tokens Impacts LLM Performance)

1.4 Three strategies for long-horizon tasks

When a task exceeds the capacity of the context window, you can use:

Strategy 1: Compaction

Approach:
- Periodically summarize intermediate steps and/or compress history
- Reset the context with the compressed summary
- Retain only key information
Trade-off:
- Sacrifice some detail in exchange for continued operation
Example:
- “The user wants X, we tried Y, we learned Z” vs. the full conversation

Strategy 2: Structured memory / note‑taking

Approach:
- The agent maintains explicit memory artifacts (externally persisted storage)
- Store “working notes”: decisions, learnings, state (in structured formats)
- Retrieve on demand instead of keeping everything in context
Example:
- Decision logs
- Key findings documents

Strategy 3: Sub‑agent architectures

Approach:
- Break complex tasks into specialized agents
- Each sub‑agent has a focused, clear, narrow context
- The main agent orchestrates and synthesizes results
Example:
- A code‑review agent spawns a doc‑checker sub‑agent

1.5 How the skills mechanism works

What are skills?

Skills are organized folders containing instructions, scripts, and resources that Claude can dynamically discover and load.

Implementation of progressive disclosure:

pdf/SKILL.md (主文件)
├── YAML Frontmatter (name, description)
├── Overview (概述)
└── References: "For advanced features, see /reference.md"

pdf/reference.md (详细参考)
└── Advanced PDF processing features...

pdf/forms.md (专门功能)
└── PDF form filling instructions...

Discovery: Claude navigates and discovers more detail as needed
Executable scripts: Token‑efficient, and provide deterministic reliability for operations that are better done with traditional code

Application of Skills in Different Products:

Apps: Automatic invocation for the best user experience
Developer Platform: Deploy to products via the Code Execution API
Claude Code: Automatic invocation, suitable for developer workflows

1.6 Tool Design Best Practices

Elements of good tool design:

Simple and precise tool names
Detailed and well-formatted descriptions
- Include what the tool returns, how to use it, etc.
Avoid overly similar tool names or descriptions
Tools work best when they perform a single operation
- Try to keep nesting to at most 1 level of parameters
Provide examples – expected input/output formats
Pay attention to the format of tool results
Test your tools! Ensure the Agent can use them well

1.7 Practical Cases of Context Engineering

Case 1: Deep Research Agent

Problems:
- Needs to read a large number of documents
- Requires multiple rounds of search and analysis
- Context easily explodes
Solution (applying Progressive Disclosure):
- Stage 1 (Search): Main Agent plans the search strategy
- Stage 2 (Analysis): Sub-Agent analyzes each document
  - Receives only document IDs and analysis goals (Lightweight Identifiers)
  - Reads the full document only when needed
- Stage 3 (Synthesis): A new Agent synthesizes all analysis results
  - Reads all analysis files written by Sub-Agents (Structured Memory)
  - Produces the final report

Case 2: Coding Agent

Applying the Context Engineering framework:
- System Prompt: Clear coding standards and architectural guidelines
- Tools: ls, read, write, edit, grep, bash and other self-contained tools
- Data Retrieval: Load files on demand via grep/search instead of preloading the entire codebase
- Long Horizon: Use the file system to record decisions and state; Sub-Agents handle independent file modifications

1.8 Benefits of Effectively Establishing and Maintaining Context

Challenge	Solution	Benefit
Handle context window limits	Compaction, Sub-Agents, Structured Memory	Reliability
Reduce context rot	Progressive Disclosure, JIT Loading	Accuracy
Optimize for prompt caching	Structure context properly	Cost & Latency

1.9 Context Engineering Strategies for Different Models

Claude (Anthropic):

✅ Prefer using the Skills mechanism
✅ Dynamic prompt loading works best
✅ Training is specifically optimized for dynamic context update scenarios
✅ Supports Progressive Disclosure

GPT (OpenAI) and Gemini (Google):

❌ Dynamic loading not recommended (poor results)
✅ Define all rules in the initial System Prompt
✅ Use Sub-Agents to isolate tasks
✅ When new rules are needed, summarize + create a new Agent
✅ But strategies like Progressive Disclosure and JIT Loading are still applicable

General advice:

Make full use of each model’s characteristics
Do not rigidly apply other models’ best practices
The four-dimensional framework (System Prompt, Tools, Data Retrieval, Long Horizon) is universal
Choose the right model based on the specific scenario

2. File System as the Agent’s Interaction Bus

Anthropic’s core view: the Coding Agent is the foundation of all general-purpose Agents.

In a classic article on context engineering, Manus also points out that the file system is at the core of Agents.

3.1 Why Use a File System?

Problems with a single Tool Call outputting a large amount of content:

Unstable:
- A single Tool Call outputs hundreds of lines
- If it’s interrupted midway, all prior work is wasted
- Cannot resume
Non-iterable:
- Output is just output, it cannot be modified
- To change a section, you must regenerate everything
- Cannot support a “draft–revise–finalize” workflow

Advantages of a file system:

Persistence:
- Once written to a file, the content is saved
- Even if the Agent crashes, the file remains
- You can resume work
Iterability:
- You can read a file and modify part of it
- You can revise multiple times, improving it step by step
- Just like how humans write documents
Generality:
- ls: list files
- read_file: read contents
- write_file: write contents
- edit_file: modify contents
- delete_file: delete files
- All models understand these operations

3.2 Coding Agent as a Foundational Capability

Why is Coding considered foundational?

Broad sense of Coding:
- Not just writing Python/Java code
- Includes writing documents, reports, and any structured content
- The core is: reading, writing, and editing files
Deep Research case:
- Needs to generate a long research report
- Wrong approach: Output the entire report in one Tool Call
  - Thousands of lines, easy to get interrupted
  - Cannot be modified after it’s written
- Right approach: Use the file system
  - First write Chapter 1 -> write_file("report.md", ...)
  - Research more information
  - Write Chapter 2 -> append_file("report.md", ...)
  - Discover Chapter 1 needs revision -> edit_file("report.md", ...)

All mainstream models have strong Coding capabilities

3.3 Information Passing Between Main Agent and Sub-Agent

Why use a file system instead of function calls?

Limitations of function calls:
- Need to serialize complex data structures
- String length has limits
- Not flexible enough
Advantages of a file system:
- Main Agent: Writes task descriptions and input data to files
- Sub-Agent: Reads files, performs tasks, writes results back to files
- Main Agent: Reads result files and proceeds to the next step
- Clear input-output boundaries, just like Unix pipelines

V. Key Q&A Transcript

Q1: For vertical domains, should we do finetuning or use closed-source models?

Conclusion: Closed-source models + Context Engineering > Open-source models + Finetune

Reasons:

Gap in knowledge density:
- Closed-source models are two generations ahead of open-source models
- Higher-quality training data
- More parameter-efficient
Gap in thinking density:
- Open-source models (e.g., Qwen, Kimi) often have very long chains of thought
- Need to rely on extended thinking time to reach good performance
- Closed-source models have more compact and efficient CoT
Gap in generalization:
- Open-source models are mainly optimized for public Benchmarks
- Weaker generalization in non-Benchmark scenarios
- Closed-source models have large internal Benchmarks and generalize better

When to use open-source + finetuning?

Extremely niche domains:
- Almost no related data on the public internet
- Must inject domain knowledge via finetuning
Strict data privacy requirements:
- Data cannot be sent overseas
- Must deploy locally
Extreme cost sensitivity:
- Huge usage volume makes API cost unaffordable
- Self-hosting open-source models is more economical

MiniMind finetuning case:

Experiment background:
- Ultra-small 100MB model
- Attempted to finetune on complex dialogue data
Reasons for failure:
- Data was too difficult, beyond the model’s capability
- Small models can only learn simple knowledge (like “Beijing is the capital of China”)
- Cannot learn complex logical reasoning
Lesson:
- Use data appropriate to the model’s capability
- Official MiniMind training data is at elementary-school-level knowledge
- Overly difficult data will cause the model to collapse during training

Q2: Is Personalization a real problem?

It is a real problem and will be a core competitive edge in the future

Why is it important?

Evolution of recommendation systems:
- Traditional: everyone reads the same daily newspaper
- ByteDance: everyone sees different content
- ByteDance believes: each person lives in a different world and has different values
- Such personalized products better align with human nature, so users are more willing to use them
The future of AI will be the same:
- There should not be only one Universal Value
- It should adapt to each user’s values and preferences
- Differences in values at the level of details are large

Technical challenges:

Factual Information is relatively easy:
- Birthday, address, card number, work information
- Just remember it; there’s no ambiguity
- We’re already doing well here
User Preference is very hard:
- Highly context-dependent:
  - When the user is writing a paper, they request academic formatting
  - That doesn’t mean travel guides should also use academic formatting
  - AI easily overgeneralizes preferences
- One-off behavior vs. long-term preference:
  - The user says, “Yesterday I ordered Sichuan food”
  - You can’t record that as “The user likes Sichuan cuisine”
  - It might just be their friend’s preference, or a whim
- Requires extremely fine-grained evaluation:
  - Must rely on data and tests to balance
  - Can’t rely on intuition

Q3: What are the prospects and challenges of device–cloud collaborative Agents?

Prospects: an inevitable trend

Why do we need device–cloud collaboration?
- Agents need to run continuously in the background
- They can’t occupy the foreground (users still need to use their phones)
- The cloud has stronger compute and can run large models

Core challenge: APP state synchronization

Login state issues:
- The user is logged into WeChat on their phone
- Should the cloud Agent also log into WeChat?
- WeChat only allows one client; they’ll kick each other off
Risk control issues:
- Cloud IP and phone IP are different
- Possibly in different countries
- Easily triggers risk control, leading to account bans
The hassle of repeated logins:
- Every website and APP would need to be re-logged-in on the cloud
- Very poor user experience
- Privacy concerns

Solution: system-level state synchronization

Doubao Phone’s attempt:
- Created a “shadow system” locally
- The Agent operates APPs in the background
- APPs think they’re in the foreground
- The user operates the real foreground
- Two systems run in parallel without interfering with each other
A more ideal solution:
- Requires system-level support from Android/iOS/HarmonyOS
- Able to synchronize APP state to the cloud
- Cloud Agent operates a cloud APP mirror
- Syncs back to the phone when needed
- Companies did cross-device state migration 10 years ago; this may now become a basic capability
Device side vs. cloud side: two dimensions
- Agent runtime location: device side or cloud side
- Model runtime location: device side or cloud side
- These are two independent dimensions that can be flexibly combined

Q4: Any advice for using AI to write code?

Core principle: humans must know more than the AI

Why?
- You need to be able to review the code AI writes
- AI makes mistakes; humans must catch them
- Especially simple syntax errors: AI can catch them
- But complex architectural issues are hard for AI to detect

Role shift: from Coder to Architect + Reviewer

New workflow:

Requirement decomposition (human):
- Break big requirements into small tasks
- Each task within 500 lines
- Clearly define inputs and outputs
Implementation (AI):
- AI writes code based on task descriptions
- Multiple AIs can work in parallel
Automated testing (AI):
- Run test cases
- Ensure correctness
Code review (AI):
- 5–6 different Review Agents
- Check code from different angles
- Only after all pass is a PR generated
Final Review (human):
- Human acts as final gatekeeper
- Check business logic
- Check architectural consistency
- Approve or request changes

Flexibility of the 500-line limit:

Core Python backend code: 500 lines
Frontend React: 1000–2000 lines (lots of boilerplate)
Test cases: can be more (e.g., 100 tests)
C language: increase accordingly (the language itself is more verbose)

Cross-file? Of course

500 lines refers to the total size of a single PR
You can modify multiple files
The key is to keep tasks small and easy to review

Q5: How to ensure workflows still run after model upgrades?

The only answer: Evaluation

Why is Evaluation so important?

Models will upgrade:
- Base models upgrade every few months
- Each upgrade may break existing workflows
- Need rapid compatibility verification
Prompts will be adjusted:
- Prompts often need optimization
- Changing A might break B
- Requires comprehensive regression testing
Avoid subjective judgment:
- Can’t just manually test a few examples
- First, it’s time-consuming
- Second, it’s subjective and easily misses issues

A complete Evaluation system:

Test dataset:
- Extract representative cases from production
- Continuously update
Rubric-based evaluation:
- Don’t just look at overall quality
- Break it into multiple sub-metrics
- Score each independently
Automated runs:
- Switch model, run the whole test suite
- Review the report a few hours later
- Compare data objectively
Continuous improvement:
- When new issues are found, add them to the dataset
- Form a closed loop

Alternative to automated Evaluation:

If you don’t have an automated system, at least maintain a manual test checklist
Before each release, manually test each key scenario
It’s slow, but better than nothing

Q6: How to get cutting-edge AI information?

Top recommendation: X (Twitter)

Why X?
- The three major domestic AI media accounts (Xinzhiyuan, Synced, QbitAI) all source from X
- First-hand papers and technical discussions
- Many top researchers share there
How to use X?
- Follow technical leaders
- Follow accounts that push papers

Q7: What to do about Prompts that help one thing but hurt another?

Two solutions:

1. Use Evaluation to prevent regression

After changing Prompts, run full tests
Ensure other capabilities aren’t broken
Data-driven, objectively evaluated

2. Structured Prompt organization

Don’t pile on rules:
- ❌ 101 rigid rules, just appending more and more
- ✅ Structure it like a book, with hierarchy
Make it like a new-hire handbook:
- Not just a flat list of rules
- A logically organized guidebook
- Consider various situations and exceptions

This article was compiled by an AI Agent I developed based on my spoken narration

Slides (HTML), Download PDF version