Bojie Li
2025-12-18
In the previous article, “Set Up an Install-Free IKEv2 Layer-3 Tunnel to Bypass Cursor Region Restrictions”, we introduced how to use an IKEv2 layer-3 tunnel to bypass geo-restrictions of software like Cursor. Although the IKEv2 solution has the advantage of not requiring a client installation, layer-3 tunnels themselves have some inherent performance issues.
This article will introduce a more efficient alternative: using Clash Verge’s TUN mode together with the VLESS protocol, which keeps things transparent to applications while avoiding the performance overhead brought by layer-3 tunnels.
Performance Pitfalls of Layer-3 Tunnels
The IKEv2 + VLESS/WebSocket architecture from the previous article has three main performance issues:
- TCP over TCP: application-layer TCP is encapsulated and transported inside the tunnel’s TCP (WebSocket), so two layers of TCP state machines interfere with each other
- Head-of-Line Blocking: multiple application connections are multiplexed over the same tunnel; packet loss on one connection blocks all connections
- QoS Limits on Long Connections: a single long-lived connection is easily throttled by middleboxes on the network
2025-10-24
Reinforcement learning pioneer Richard Sutton says that today’s large language models are a dead end.
This sounds shocking. As the author of “The Bitter Lesson” and the 2024 Turing Award winner, Sutton is the one who believes most strongly that “more compute + general methods will always win,” so in theory he should be full of praise for large models like GPT-5, Claude, and Gemini. But in a recent interview, Sutton bluntly pointed out: LLMs merely imitate what humans say; they don’t understand how the world works.
The interview, hosted by podcaster Dwarkesh Patel, sparked intense discussion. Andrej Karpathy later responded in writing and further expanded on the topic in another interview. Their debate reveals three fundamental, often overlooked problems in current AI development:
First, the myth of the small-world assumption: Do we really believe that a sufficiently large model can master all important knowledge in the world and thus no longer needs to learn? Or does the real world follow the large-world assumption—no matter how big the model is, it still needs to keep learning in concrete situations?
Second, the lack of continuous learning: Current model-free RL methods (PPO, GRPO, etc.) only learn from sparse rewards and cannot leverage the rich feedback the environment provides. This leads to extremely low sample efficiency for Agents in real-world tasks and makes rapid adaptation difficult.
Third, the gap between Reasoner and Agent: OpenAI divides AI capabilities into five levels, from Chatbot to Reasoner to Agent. But many people mistakenly think that turning a single-step Reasoner into a multi-step one makes it an Agent. The core difference between a true Agent and a Reasoner is the ability to learn continuously.
This article systematically reviews the core viewpoints from those two interviews and, combined with our practical experience developing real-time Agents at Pine AI, explores how to bridge this gap.
2025-10-16
View Slides (HTML), Download PDF Version
Contents
- 01 | The Importance and Challenges of Memory - Personalization Value · Three Capability Levels
- 02 | Representation of Memory - Notes · JSON Cards
- 03 | Retrieval of Memory - RAG · Context Awareness
- 04 | Evaluation of Memory - Rubric · LLM Judge
- 05 | Frontier Research - ReasoningBank
Starting from personalization needs → Understanding memory challenges → Designing storage schemes → Implementing intelligent retrieval → Scientific evaluation and iteration
2025-09-28
The protocol documentation for Unified Bus has finally been released. Most of the initial design work for the protocol was done four or five years ago, and I haven’t worked on interconnects for more than two years. Yet reading this 500+ page document today still feels very familiar.
As with most protocol documents, the UB documentation presents a wealth of details about the Unified Bus protocol, but rarely touches on the thinking behind its design. As a small foot soldier who participated in UB in its early days, I’ll share some of my personal thoughts. The productized UB today may differ in many ways from what we designed back then, so don’t take this as an authoritative guide—just read it as anecdotes.
Why UB
To understand the inevitability of Unified Bus (UB), we must return to a fundamental contradiction in computer architecture: the split between the Bus and the Network.
For a long time, the computing world has been divided into islands by these two completely different interconnect paradigms.
- Inside an island (for example, within a single server or a chassis), we use bus technologies such as PCIe or NVLink. They are designed for tightly coupled systems; devices share a unified physical address space, communication latency can be on the order of nanoseconds, and bandwidth is extremely high. This is a performance paradise, but its territory is very limited—the physical distance and the number of devices a bus can connect are strictly constrained.
- Between islands, we rely on network technologies such as Ethernet or InfiniBand. They are born for loosely coupled systems, excel at connecting tens of thousands of nodes, and have superb scalability. But that scalability comes at a cost: complex protocol stacks, additional forwarding overhead, and latencies in the microsecond or even millisecond range create an orders-of-magnitude gap compared with buses.
This “inside vs. outside” architecture worked well for a long time. However, a specter began to haunt the computing world—Scaling Law.
About 10 years ago, researchers in deep learning discovered a striking regularity: as long as you keep increasing model size, data, and compute, model performance predictably and steadily improves. This discovery changed the game. What used to be a “good enough” single machine with 8 GPUs suddenly became a drop in the bucket in the face of models with tens or hundreds of billions of parameters.
At that moment, a clear and urgent need presented itself to system architects everywhere: can we tear down the wall between buses and networks? Can we create a unified interconnect that offers bus-level programming simplicity and extreme performance, while also providing network-level massive scalability?
This is UB’s core mission. It’s not merely a patch or improvement on existing protocols but a thorough rethinking. UB aims to build a true “datacenter-scale computer,” seamlessly connecting heterogeneous compute, memory, and storage across the entire cluster into a unified, programmable whole. In this vision, accessing memory on a remote server should be as simple and natural as accessing local memory; tens of thousands of processors should collaborate as efficiently as if they were on a single chip.
2025-09-12
Recently, Alibaba’s Qwen team released the Qwen3-Next model, another major innovation after Qwen3. The model achieves multiple breakthroughs in architectural design, especially reaching industry-leading levels in the balance between inference efficiency and performance. This article briefly summarizes Qwen3-Next’s core innovations.
Three major breakthroughs of Qwen3-Next:
- Hybrid attention architecture: 3 layers of linear attention + 1 layer of traditional attention, incorporating DeltaNet’s delta rule idea
- Ultra-sparse MoE: only 11 of 512 experts activated; 80B parameters with only 3B activated
- 100+ tokens/s inference speed: reaches a state-of-the-art level via MTP
Core value: With 1/10 the compute cost and 10× the token processing speed, it achieves performance surpassing 32B dense models, benchmarking against Gemini 2.5 Flash.
2025-09-08
I was honored to be invited by Prof. Zhang Jiaxing to give an academic talk titled “The Two Dark Clouds over Agents: Real‑time Interaction with the Environment, Learning from Experience” at Lion Rock Artificial Intelligence Lab on September 4. Today I’m sharing the slides and video from the talk for your reference and discussion.
📰 Official coverage: 【产研对接】第 2 期 “FAIR plus × 狮子山问道” 成功举办,探索 AI 智能体与全地形具身智能的瓶颈及突破
Talk materials
- 🎬 Talk video
- 📖 Slides in English
- 📖 Slides in Chinese
Talk overview
In 1900, Lord Kelvin said in a speech: “The beauty and clearness of the dynamical theory, which asserts heat and light to be modes of motion, is at present obscured by two clouds…”. These two “small clouds” later triggered the revolutions of relativity and quantum mechanics. Today, the AI Agent field is facing a similar pair of “dark clouds”.
First dark cloud: challenges of real‑time interaction
Current AI Agents suffer from severe latency issues when interacting with the environment in real time:
The dilemma of voice interaction
- Serial processing vs real‑time needs: they must wait for the user to finish speaking before thinking, and finish thinking before speaking
- Fast vs slow thinking: deep thinking needs 10+ seconds (users lose patience), fast responses are prone to errors
- Technical bottlenecks: every step is a wait (VAD detection, ASR recognition, LLM thinking, TTS synthesis)
The “last mile” challenge of GUI operations
- Agents operate computers 3–5× slower than humans
- Every click requires a new screenshot and thinking (3–4 seconds of latency)
- “Moravec’s paradox”: the model “knows” what to do, but “can’t do it” well
2025-08-18
[This article is based on the first live session of the Turing Community AI Agent Practical Bootcamp. See the slides link and download the PDF version.]
Purchase link for Turing Community “AI Agent Practical Bootcamp”
Developing your own AI Agent starts here. This article not only systematically introduces the foundational technical path for building a general-purpose AI Agent from scratch (such as context engineering, RAG systems, tool calling, multimodal interaction, etc.), but also covers advanced techniques such as slow/fast thinking and multi-Agent collaboration. Through 9 weeks of hands-on projects, you will gradually master the full lifecycle of Agent development and core advanced capabilities.
This course was first previewed via livestream on August 18 and will officially start on September 11. Each weekly session is about 2 hours and covers all the fundamental and advanced content below. Of course, 2 hours of lectures per week is definitely not enough—you’ll also need to spend time on hands-on programming practice.
Core Goals of the Bootcamp
Developing your own AI Agent starts here
🎯 Master core architecture and engineering capabilities
- Deeply understand Agent architecture: Systematically grasp the core design paradigm of
LLM + context + tools. - Become proficient in context engineering: Master multi-level context management techniques from conversation history and users’ long-term memory to external knowledge bases (RAG) and file systems.
- Master dynamic tool calling: Reliably integrate Agents with external APIs and MCP Servers, and enable self-evolution via code generation.
- Build advanced Agent patterns: Design and implement complex Agent collaboration patterns such as slow/fast thinking (Mixture-of-Thoughts) and Orchestration.
💡 Build systematic understanding of development and deployment
- Understand the path of technological evolution: See clearly the evolution path from basic RAG to Agents that can autonomously develop tools.
- Master the full lifecycle of an Agent: Be capable of independently completing the closed loop of Agent project design, development, evaluation using LLM as a Judge, and deployment.
- Build domain knowledge: Accumulate cross-domain Agent development experience through multiple hands-on projects in law, academia, programming, and more.
- Solidify your knowledge system: Co-create the book “In-depth yet Accessible AI Agent” and turn fragmented knowledge into a systematic output.
9-Week Practical Plan Overview
| Week | Topic | Content Overview | Practical Case |
|---|---|---|---|
| 1 | Agent Basics | Agent structure and taxonomy, workflow-based vs. autonomous | Hands-on building an Agent that can search the web |
| 2 | Context Design | Prompt templates, conversation history, users’ long-term memory | Add role settings and long-term memory to your Agent |
| 3 | RAG and Knowledge Bases | Document structuring, retrieval strategies, incremental updates | Build a legal Q&A Agent |
| 4 | Tool Calling and MCP | Tool wrapping and MCP integration, external API calls | Connect to an MCP Server to implement a deep-research Agent |
| 5 | Programming and Code Execution | Understanding codebases, reliable code modification, consistent runtime environments | Build an Agent that can develop Agents by itself |
| 6 | Model Evaluation and Selection | Evaluating model capabilities, LLM as a Judge, safety guardrails | Build an evaluation dataset and use LLM as a Judge to automatically evaluate Agents |
| 7 | Multimodal and Real-Time Interaction | Real-time voice Agents, operating computers and phones | Implement a voice-call Agent & integrate browser-use to operate a computer |
| 8 | Multi-Agent Collaboration | A2A communication protocol, Agent team division and collaboration | Design a multi-Agent collaboration system to “operate the computer while on a call” |
| 9 | Project Integration and Demo | Final integration and demo of the Agent project, polishing final deliverables | Showcase your unique general-purpose Agent |
9-Week Advanced Topics
| Week | Topic | Advanced Content Overview | Advanced Practical Case |
|---|---|---|---|
| 1 | Agent Basics | Importance of context | Explore how missing context affects Agent behavior |
| 2 | Context Design | Organizing user memory | Build a personal knowledge management Agent for long-text summarization |
| 3 | RAG and Knowledge Bases | Long-context compression | Build an academic paper analysis Agent to summarize core contributions |
| 4 | Tool Calling and MCP | Learning from experience | Enhance the deep-research Agent’s expert capabilities (sub-agents and domain experience) |
| 5 | Programming and Code Execution | Agent self-evolution | Build an Agent that can autonomously leverage open-source software to solve unknown problems |
| 6 | Model Evaluation and Selection | Parallel sampling and sequential revision | Add parallelism and revision capabilities to the deep-research Agent |
| 7 | Multimodal and Real-Time Interaction | Combining fast and slow thinking | Implement a real-time voice Agent that combines fast and slow thinking |
| 8 | Multi-Agent Collaboration | Orchestration Agent | Use an Orchestration Agent to dynamically coordinate phone calls and computer operations |
| 9 | Project Integration and Demo | Comparing Agent learning methods | Compare four ways Agents learn from experience |
2025-08-03
Following “Solving LLM Constrained Sampling Interview Question with Vibe Coding”, I’m sharing another Vibe Coding interview question from our company (Pine AI) about the fundamental principles of LLM.
Many people misunderstand Vibe Coding, thinking it’s just about constantly asking AI, “How do you do this? How do you implement that?” This approach is doomed to fail. True Vibe Coding requires you to be the architect and product manager, guiding the AI like a teacher instructing a student, not the other way around.
This interview question assesses candidates’ understanding of the basic principles of Transformers and their engineering ability to quickly implement vibe coding. This is the kind of person we need: someone who understands models and has strong engineering skills.
The Challenge: Attention-Based LLM Hallucination Detector
1. Background & Problem Statement
In many applications, large language models (LLMs) need to answer questions or extract information based on a given context, a process often referred to as “In-Context Learning.” However, LLMs have a known, serious security flaw: when asked about information not present in the context, they may “hallucinate” a correctly formatted but factually incorrect answer instead of admitting the lack of information.
2025-07-30
[This article is based on a talk given at Turing Community’s Large Model Tech Study Camp. Slides: Slides link, Download PDF version]
A deep dive into the design philosophy and practical strategies for AI Agents. From the dialogue pattern of chatbots to the action pattern of Agents, we systematically design and manage the information environment of Agents to build efficient and reliable AI Agent systems.
Table of Contents
- Part 1: Paradigm Shift - From Chatbot to Agent
- Part 2: Core Analysis of Agents
- Part 3: Context Engineering
- Part 4: Memory and Knowledge Systems
Part 1: Paradigm Shift - From Chatbot to Agent
From Chatbot to Agent: A Fundamental Paradigm Shift
We are undergoing a fundamental transformation in AI interaction patterns:
Chatbot Era
- 🗣️ Conversational interaction: user asks → AI answers → repeated Q&A loop
- 📚 Knowledgeable advisor: can “talk” but not “act,” passively responding to user needs
- 🛠️ Typical products: ChatGPT, Claude Chat
Agent Era
- 🎯 Autonomous action mode: user sets goal → Agent executes → autonomous planning and decision-making
- 💪 Capable assistant: can both “think” and “do,” actively discovering and solving problems
- 🚀 Typical products: Claude Code, Cursor, Manus
2025-07-25
In AI application development, choosing the right LLM API service is crucial. Whether you are building an intelligent dialogue system, developing an AI Agent, or participating in an AI Hackathon, this article will provide you with a comprehensive API usage guide, covering mainstream services such as OpenRouter, Anthropic API, Volcano Engine, and Siliconflow.
Why Do You Need Multiple API Services?
Different LLM models have their own advantages, especially when developing AI Agents, where you need to choose the right model based on specific scenarios:
- Claude (Anthropic): Excels in complex reasoning, programming, and Agent tasks, particularly suitable for scenarios requiring deep thinking
- Gemini (Google): Performs well in long text processing and multimodal understanding, suitable for handling multimedia content such as images and videos
- GPT (OpenAI): Strong in image understanding and mathematical reasoning, excellent for everyday conversation experiences
- Doubao (ByteDance): Fast access speed in China, good voice dialogue experience, especially suitable for real-time interaction scenarios
- Open Source Models: Low cost, highly customizable, suitable for large-scale deployment