2025-12-21
The Full Story of the Storage Performance Issue in the Course Review Community

This month, the Course Review Community encountered a storage performance issue that lasted nearly two weeks, causing slow service responses and degraded user experience. This post documents how the issue was discovered, investigated, and resolved, covering NFS performance, ZFS logs, Proxmox VE virtualization storage configuration, and more.

Read More

2025-12-20
Claude’s Context Engineering Playbook: Best Practices Learned from Anthropic

(This article is compiled from talks and in-depth conversations with the Anthropic team during AWS re:Invent 2025)

View Slides (HTML) (note: these slides are not official from Anthropic, they are my own reconstruction from photos and recordings)

Slides source code

What This Article Covers

Claude is already smart enough—intelligence is not the bottleneck, context is. Every organization has its own workflows, standards, and knowledge systems, and Claude doesn’t inherently know any of these. This article summarizes Anthropic’s best practices for Context Engineering, covering key topics such as Skills, Agent SDK, MCP, and evaluation systems, to help you build more efficient AI applications.

  • 01 | Skills System – Teach Claude your organization-specific knowledge
  • 02 | Context Engineering Framework – Four pillars for maximizing token efficiency
  • 03 | Context Window & Context Rot – Understand context limits and degradation
  • 04 | Tool Design Best Practices – How to build powerful tools
  • 05 | Claude Agent SDK – A framework for production-ready agents
  • 06 | Sub-agent Configuration Best Practices – Auto-invocation and permissioning
  • 07 | MCP (Model Context Protocol) – A standardized protocol for tooling
  • 08 | Evaluations – Why evaluation matters and how to do it well
  • 09 | Building Coding Agents – Lessons from Claude Code
  • 10 | Ecosystem Synergy – How Prompts, MCP, Skills, and Subagents work together
Read More

2025-12-20
The Next Station of Agent–Human Interaction: Real-Time Voice and Generative UI

(This article is the invited talk I gave at the first Intelligent Agent Network and Application Innovation Conference on December 20, 2025)

View talk slides (HTML)

Talk slides source code

Abstract

Current Agent–human interaction is centered on text, but this deviates from the natural pattern of human cognition. From first principles, the modality humans are best at for output is speech (speaking speed is three times typing speed), and the modality humans are best at for input is vision. Vision is not text; it is intuitive UI.

The first step is to achieve real-time voice interaction. The problem with the traditional VAD–ASR–LLM–TTS serial architecture is that the system must wait for the user to finish speaking before it can start thinking, and it cannot output before thinking is complete. Through the Interactive ReAct continuous thinking mechanism, an Agent can listen, think, and speak at the same time: it starts thinking while the user is speaking, and continues deeper reasoning while it is speaking itself, fully utilizing all time gaps.

The second step is to extend the observation space and action space on top of real-time voice. By expanding the Observation Space (from voice input to Computer Use visual perception) and the Action Space (from voice output to UI generation and computer operation), the Agent can make phone calls while operating existing computer/phone GUI interfaces, and generate dynamic UIs to interact with the user. One implementation path for generative UI is to generate frontend code, for which Claude 4.5 Sonnet has already reached the threshold. Another path is to generate images, for which Nano Banana Pro is also close to the threshold.

This is exactly the implementation path of Samantha in the movie Her. As an operating system, Samantha needs five core capabilities: real-time voice conversation with the user, the ability to make phone calls and handle matters on the user’s behalf, the ability to operate traditional computers and phones for the user, the ability to connect data across the user’s existing devices and online services, having her own generative UI interface, and having powerful long-term user memory to deliver personalized proactive services.

Read More

2025-12-19
Silicon Valley AI Observations: The Million‑Dollar‑Salary Model War and How Startups Survive

(This article is the invited talk I gave at AWS re:Invent 2025 Beijing Meetup)

Click here to view Slides (HTML)

Thanks to AWS for the invitation, which gave me the opportunity to attend AWS re:Invent 2025. During this trip to the US, I not only participated in this world‑class tech conference, but also had in‑depth conversations with many frontline practitioners from top Silicon Valley AI companies such as OpenAI, Anthropic, and Google DeepMind. Most of the viewpoints were cross‑validated by experts from different companies.

From the re:Invent venue in Las Vegas, to NeurIPS in San Diego, and then to AI companies in the Bay Area, over a dozen days of intensive discussions taught me a lot, mainly including the following aspects:

Practical experience of AI‑assisted programming (Vibe Coding): Analyzing efficiency differences in different scenarios—from 3–5x acceleration in startups, to why large companies and research institutes see limited gains.

Organization and resource allocation of foundation model companies: Analyzing the advantages and disadvantages of companies like Google, OpenAI, xAI, and Anthropic, including compute resources, compensation structures, and the current collaboration status between model teams and application teams.

A frontline view of the Scaling Law: Frontline researchers generally believe the Scaling Law has not ended, which diverges from public statements by top scientists such as Ilya Sutskever and Richard Sutton. Engineering methods can solve sampling efficiency and generalization problems, and there is still large room for improvement in foundation models.

A scientific methodology for application development: Introducing the rubric‑based evaluation systems broadly adopted by top AI application companies.

Core techniques of Context Engineering: Discussing three major techniques for dealing with context rot: dynamic system prompts, dynamic loading of prompts (skills), sub‑agents plus context summarization, and the design pattern of using the file system as the interaction bus between agents.

Strategic choices for startups: Based on real constraints of resources and talent, analyzing which areas startups should avoid (general benchmarks) and which directions they should focus on (vertical domains + context engineering).

Read More

2025-12-18
Clash Verge TUN Mode: Avoiding the Performance Pitfalls of Layer-3 Tunnels

In the previous article, “Set Up an Install-Free IKEv2 Layer-3 Tunnel to Bypass Cursor Region Restrictions”, we introduced how to use an IKEv2 layer-3 tunnel to bypass geo-restrictions of software like Cursor. Although the IKEv2 solution has the advantage of not requiring a client installation, layer-3 tunnels themselves have some inherent performance issues.

This article will introduce a more efficient alternative: using Clash Verge’s TUN mode together with the VLESS protocol, which keeps things transparent to applications while avoiding the performance overhead brought by layer-3 tunnels.

Performance Pitfalls of Layer-3 Tunnels

The IKEv2 + VLESS/WebSocket architecture from the previous article has three main performance issues:

  1. TCP over TCP: application-layer TCP is encapsulated and transported inside the tunnel’s TCP (WebSocket), so two layers of TCP state machines interfere with each other
  2. Head-of-Line Blocking: multiple application connections are multiplexed over the same tunnel; packet loss on one connection blocks all connections
  3. QoS Limits on Long Connections: a single long-lived connection is easily throttled by middleboxes on the network
Read More

2025-11-14
Self-Evolving Real-Time Agents: Think While Listening, Speak While Thinking, Learn While Acting

[This article is based on my invited talk at the 1st FAISys’25 (The 1st Frontier AI Systems Workshop).]

View slides (HTML)

Slides source code

[The following is an auto-generated Chinese translation from the English slides. It’s recommended to read the original slides.]

Hello everyone, it’s a great honor to give a talk at the 1st FAISys’25. Today I’ll be sharing on Self-Evolving Real-Time Agents: Think While Listening, Speak While Thinking, Learn While Acting.

I am the co-founder and Chief Scientist of Pine AI. At Pine AI, our product uses AI to make phone calls and operate computers to help users handle daily tasks — for example, negotiating prices, canceling subscriptions, filing complaints, and obtaining compensation. We have already saved users over 3 million USD, with a success rate of 93%, and on average we save each user 270 minutes of time.

Learning from experience represents the core challenge of machine learning. Current autonomous AI agents face two key challenges in practical applications: real-time interaction with the environment and learning from experience. Today I’ll introduce our technical breakthroughs in these two areas.

Two Core Challenges

Challenge 1: High Latency in Real-Time Interaction

Real-time voice agents must respond within 1 second like humans, but traditional architectures based on inference LLMs introduce 2–10 seconds of latency.

Challenges of VAD (Voice Activity Detection):

  • Must wait 500–800 ms of continuous silence to confirm the user has finished speaking
  • Backchannel utterances like “uh-huh” are misdetected as interruptions
  • Acoustic information is lost (emotion, environmental sounds)

Challenges of ASR (Automatic Speech Recognition):

  • Lack of context leads to high error rate (emails, names, phone numbers)
  • Lack of world knowledge leads to transcription errors

Challenges of LLMs:

  • Forced to wait, cannot think while listening
  • Cannot speak while thinking (5–10 seconds of silence)
  • Poor turn detection (deciding when to speak or stay silent)

Challenge 2: Learning from Experience

Models are smart but not skilled — like top graduates with little real-world work experience.

Fixed models cannot learn:

  • Cannot learn from successful trajectories
  • Cannot learn from failed trajectories
  • Parameters are frozen after deployment

Big World Hypothesis:
The world is too large to pre-encode all knowledge:

  • Business processes are dynamic and not publicly documented
  • Verification information differs across companies
  • Service rules change continuously
  • Pretrained knowledge is not sufficient for deployment
Read More

2025-10-24
The Dilemma of Continuous Learning for Agents: Why a Reasoner Is Not a True Agent

Reinforcement learning pioneer Richard Sutton says that today’s large language models are a dead end.

This sounds shocking. As the author of “The Bitter Lesson” and the 2024 Turing Award winner, Sutton is the one who believes most strongly that “more compute + general methods will always win,” so in theory he should be full of praise for large models like GPT-5, Claude, and Gemini. But in a recent interview, Sutton bluntly pointed out: LLMs merely imitate what humans say; they don’t understand how the world works.

The interview, hosted by podcaster Dwarkesh Patel, sparked intense discussion. Andrej Karpathy later responded in writing and further expanded on the topic in another interview. Their debate reveals three fundamental, often overlooked problems in current AI development:

First, the myth of the small-world assumption: Do we really believe that a sufficiently large model can master all important knowledge in the world and thus no longer needs to learn? Or does the real world follow the large-world assumption—no matter how big the model is, it still needs to keep learning in concrete situations?

Second, the lack of continuous learning: Current model-free RL methods (PPO, GRPO, etc.) only learn from sparse rewards and cannot leverage the rich feedback the environment provides. This leads to extremely low sample efficiency for Agents in real-world tasks and makes rapid adaptation difficult.

Third, the gap between Reasoner and Agent: OpenAI divides AI capabilities into five levels, from Chatbot to Reasoner to Agent. But many people mistakenly think that turning a single-step Reasoner into a multi-step one makes it an Agent. The core difference between a true Agent and a Reasoner is the ability to learn continuously.

This article systematically reviews the core viewpoints from those two interviews and, combined with our practical experience developing real-time Agents at Pine AI, explores how to bridge this gap.

Read More

2025-10-16
From Memory to Cognition: How AI Agents Enable Truly Personalized Service

View Slides (HTML)

Slides Source Code

Contents

  • 01 | The Importance and Challenges of Memory - Personalization Value · Three Capability Layers
  • 02 | Representation of Memory - Notes · JSON Cards
  • 03 | Retrieval of Memory - RAG · Context Awareness
  • 04 | Evaluation of Memory - Rubric · LLM Judge
  • 05 | Frontier Research - ReasoningBank

Starting from personalization needs → understanding memory challenges → designing storage schemes → implementing intelligent retrieval → scientific evaluation and iteration

Read More

2025-09-28
The Thinking Behind Unified Bus

The protocol documentation for Unified Bus has finally been released. Most of the initial design work for the protocol was done four or five years ago, and I haven’t worked on interconnects for more than two years. Yet reading this 500+ page document today still feels very familiar.

As with most protocol documents, the UB documentation presents a wealth of details about the Unified Bus protocol, but rarely touches on the thinking behind its design. As a small foot soldier who participated in UB in its early days, I’ll share some of my personal thoughts. The productized UB today may differ in many ways from what we designed back then, so don’t take this as an authoritative guide—just read it as anecdotes.

Why UB

To understand the inevitability of Unified Bus (UB), we must return to a fundamental contradiction in computer architecture: the split between the Bus and the Network.

For a long time, the computing world has been divided into islands by these two completely different interconnect paradigms.

  • Inside an island (for example, within a single server or a chassis), we use bus technologies such as PCIe or NVLink. They are designed for tightly coupled systems; devices share a unified physical address space, communication latency can be on the order of nanoseconds, and bandwidth is extremely high. This is a performance paradise, but its territory is very limited—the physical distance and the number of devices a bus can connect are strictly constrained.
  • Between islands, we rely on network technologies such as Ethernet or InfiniBand. They are born for loosely coupled systems, excel at connecting tens of thousands of nodes, and have superb scalability. But that scalability comes at a cost: complex protocol stacks, additional forwarding overhead, and latencies in the microsecond or even millisecond range create an orders-of-magnitude gap compared with buses.

This “inside vs. outside” architecture worked well for a long time. However, a specter began to haunt the computing world—Scaling Law.

About 10 years ago, researchers in deep learning discovered a striking regularity: as long as you keep increasing model size, data, and compute, model performance predictably and steadily improves. This discovery changed the game. What used to be a “good enough” single machine with 8 GPUs suddenly became a drop in the bucket in the face of models with tens or hundreds of billions of parameters.

At that moment, a clear and urgent need presented itself to system architects everywhere: can we tear down the wall between buses and networks? Can we create a unified interconnect that offers bus-level programming simplicity and extreme performance, while also providing network-level massive scalability?

This is UB’s core mission. It’s not merely a patch or improvement on existing protocols but a thorough rethinking. UB aims to build a true “datacenter-scale computer,” seamlessly connecting heterogeneous compute, memory, and storage across the entire cluster into a unified, programmable whole. In this vision, accessing memory on a remote server should be as simple and natural as accessing local memory; tens of thousands of processors should collaborate as efficiently as if they were on a single chip.

Read More

2025-09-12
Qwen3-Next: Hybrid Attention + Ultra-Sparse MoE + MTP = SOTA Inference Speed

Recently, Alibaba’s Qwen team released the Qwen3-Next model, another major innovation after Qwen3. The model achieves multiple breakthroughs in architectural design, especially reaching industry-leading levels in the balance between inference efficiency and performance. This article briefly summarizes Qwen3-Next’s core innovations.

Three major breakthroughs of Qwen3-Next:

  1. Hybrid attention architecture: 3 layers of linear attention + 1 layer of traditional attention, incorporating DeltaNet’s delta rule idea
  2. Ultra-sparse MoE: only 11 of 512 experts activated; 80B parameters with only 3B activated
  3. 100+ tokens/s inference speed: reaches a state-of-the-art level via MTP

Core value: With 1/10 the compute cost and 10× the token processing speed, it achieves performance surpassing 32B dense models, benchmarking against Gemini 2.5 Flash.

Read More
RSS