AI Agent Bootcamp: Build Your General-Purpose Agent in 9 Weeks

【This article is compiled from the first live session of the Turing Community AI Agent Bootcamp, Slides link】

Turing Community “AI Agent Bootcamp” purchase link

Build an AI Agent of your own—start here. This article not only systematically introduces the foundational technical path to building a general-purpose AI Agent from scratch (such as context engineering, RAG systems, tool use, multimodal interaction, etc.), but also covers advanced techniques like fast/slow thinking and multi-Agent collaboration. Through 9 weeks of hands-on projects, you will progressively master the full lifecycle of Agent development and the core advanced skills.

This course had its first live preview on August 18 and will officially start on September 11. Each week includes about 2 hours of class time covering all the foundational and advanced topics below. Of course, just 2 hours of lectures per week is not enough—you’ll also need to spend time coding and practicing.

Core Goals of the Bootcamp

Build an AI Agent of your own—start here

🎯 Master core architecture and engineering capabilities

Deeply understand Agent architecture: Systematically grasp the core design paradigm of LLM + context + tools.
Master context engineering: Learn multi-layered context management from conversation history and long-term user memory to external knowledge bases (RAG) and file systems.
Master dynamic tool calling: Reliably integrate Agents with external APIs and MCP Server, and enable self-improvement via code generation.
Build advanced Agent patterns: Design and implement fast/slow thinking (Mixture-of-Thoughts), orchestration, and other complex Agent collaboration patterns.

💡 Build a systematic understanding of development and deployment

Understand the path of technical evolution: See the progression from basic RAG to Agents that can autonomously develop tools.
Master the full Agent lifecycle: Be able to independently complete the closed loop of design, development, evaluation with LLM as a Judge, and deployment.
Build domain knowledge: Accumulate cross-domain Agent development experience through hands-on projects in law, academia, programming, etc.
Consolidate your knowledge system: Co-create the book “AI Agent, Explained,” turning fragmented knowledge into a systematic output.

9-Week Hands-On Plan Overview

Week	Topic	Content Overview	Hands-On Case
1	Agent Basics	Agent structure and taxonomy, workflow-based vs autonomous	Build a web-connected search Agent
2	Context Design	Prompt templates, conversation history, long-term user memory	Add persona and long-term memory to your Agent
3	RAG and Knowledge Base	Document structuring, retrieval strategies, incremental updates	Build a legal Q&A Agent
4	Tool Use and MCP	Tool wrapping and MCP integration, external API calls	Connect to an MCP Server to build a deep-research Agent
5	Programming and Code Execution	Codebase understanding, reliable code modification, consistent execution environment	Build an Agent that can develop Agents by itself
6	Model Evaluation and Selection	Model capability evaluation, LLM as a Judge, safety guardrail design	Build an evaluation dataset and auto-evaluate Agents with LLM as a Judge
7	Multimodal and Real-time Interaction	Real-time voice Agent, operating computers and phones	Implement a voice call Agent & integrate browser-use to operate a computer
8	Multi-Agent Collaboration	A2A communication protocol, Agent team roles and collaboration	Design a multi-Agent collaboration system to “make calls while operating the computer”
9	Project Integration and Demo	Final integration and demo of the Agent project, polishing the final deliverable	Showcase your unique general-purpose Agent

9-Week Advanced Topics

Week	Topic	Advanced Content Overview	Advanced Hands-On Case
1	Agent Basics	The importance of context	Explore how missing context affects Agent behavior
2	Context Design	Organizing user memory	Build a personal knowledge management Agent to summarize long texts
3	RAG and Knowledge Base	Long-context compression	Build a research paper analysis Agent to summarize core contributions
4	Tool Use and MCP	Learning from experience	Enhance the deep-research Agent’s expert capability (sub-agents and domain experience)
5	Programming and Code Execution	Agent self-evolution	Build an Agent that autonomously leverages open-source software to solve unknown problems
6	Model Evaluation and Selection	Parallel sampling and sequential revision	Add parallelism and revision capabilities to the deep-research Agent
7	Multimodal and Real-time Interaction	Combining fast and slow thinking	Implement a real-time voice Agent that combines fast and slow thinking
8	Multi-Agent Collaboration	Orchestration Agent	Use an Orchestration Agent to dynamically coordinate phone calls and computer operations
9	Project Integration and Demo	Comparing Agent learning methods	Compare four ways an Agent learns from experience

AI Agent Bootcamp Overview

Week 1: Agent Basics

Core Content

Agent structure and taxonomy

Workflow-based

Predefined processes and decision points
Highly deterministic; suitable for automation of simple business processes

Autonomous

Dynamic planning and self-correction
Highly adaptive; suitable for open-ended research, exploration, and complex problem solving

Basic frameworks and scenario judgment

ReAct framework: Observe → Think → Act

Agent = LLM + context + tools

LLM: decision core (the brain)
Context: perceive the environment (eyes and ears)
Tools: interact with the world (hands)

Hands-On Case: Build a web-connected search Agent

Goal: Build a basic autonomous Agent that can understand user questions, retrieve information via search engines, and summarize the answer.

Core challenges:

Task decomposition: Break complex questions into searchable keywords
Tool definition: Define and implement a web_search tool
Result integration: Understand search results and synthesize the final answer

Architecture design:

┌──────────────┐
│  用户问题     │
└──────┬───────┘
       │
       ▼
┌──────────────┐     需要搜索
│  LLM 思考     ├──────────────┐
└──────┬───────┘              │
       ▲                      ▼
       │              ┌────────────────┐
       │              │调用 web_search │
       │              │     工具       │
       │              └────────┬───────┘
       │                       │
       │                       ▼
       │              ┌────────────────┐
       │              │ 搜索引擎 API    │
       │              └────────┬───────┘
       │                       │
       │                       ▼
       │              ┌────────────────┐
       └──────────────┤ 返回搜索结果     │
                      └────────────────┘
                              │
                    信息充足   ▼
                     ┌────────────────┐
                     │ 生成最终答案     │
                     └────────────────┘

Advanced: The importance of context

Core idea: The context is the agent's operating system. Context is the only basis for an Agent to perceive the world, make decisions, and record history.

Thinking

The Agent’s inner monologue and chain-of-thought
Missing consequence: Turns Agent behavior into a black box, making it hard to debug or understand decisions

Tool Call

The actions the Agent decides to take, recording its intent
Missing consequence: You can’t trace the Agent’s action history, making retrospectives difficult

Tool Result

Environmental feedback from actions
Missing consequence: The Agent can’t sense the outcome of its actions, which may lead to infinite retries or poor planning

Advanced practice: Exploring how missing context affects Agent behavior

Goal: Through experiments, understand the indispensable roles of thinking, tool call, and tool result in the Agent workflow.

Core challenges:

Modify the Agent framework: Change the Agent’s core loop to selectively remove specific parts from the context
Design controlled experiments: Create a task set where missing different parts of context leads to distinct behavioral differences or failures
Behavior analysis: Analyze and summarize what types of failures each missing context part causes

Experiment design:

┌─────────┐     ┌─────────────────────┐
│  任务   ├────►│ 完整上下文 Agent      ├──► 成功
└────┬────┘     └─────────────────────┘
     │
     ├─────────►┌─────────────────────┐
     │          │ 无 Tool Call Agent  ├──► 行为异常/难以理解
     │          └─────────────────────┘
     │
     └─────────►┌─────────────────────┐
                │ 无 Tool Result Agent├──► 无限重试/错误规划
                └─────────────────────┘

Week 2: Context Design (Context Engineering)

Core Content

Prompt templates

System prompt: Set the Agent’s role, capability boundaries, and behavior guidelines
Toolset: Tool names, descriptions, parameters

Conversation history and user memory

Event sequence: Model the conversation history as an alternating sequence of “observations” and “actions”
Long-term user memory: Extract key user information (e.g., preferences, personal details) and store it in structured form for future interactions

Hands-On Case: Add persona and long-term memory to your Agent

Goal: Enhance personalization and continuity. The Agent should mimic the speaking style of a specific persona (e.g., an anime character) and remember key user information (e.g., name, interests) to use in subsequent conversations.

Core challenges:

Role-playing: How to clearly define the persona’s language style and personality in the prompt, and keep the persona stable
Memory extraction and storage: How to accurately extract key information from unstructured dialog and store it as a structured JSON object
Memory application: How to naturally incorporate the stored user-memory JSON into subsequent prompts so the Agent truly “remembers” the user

Architecture Design:

┌─────────────┐
│  用户输入   │
└──────┬──────┘
       │
       ▼
┌─────────────────────────────────────┐
│          LLM 思考                   │
│  ┌──────────────────────────────┐  │
│  │     上下文构建                 │  │
│  ├──────────────────────────────┤  │
│  │  • 角色设定 Prompt             │  │
│  │  • 对话历史                    │  │
│  │  • 用户记忆 JSON               │  │
│  └──────────────────────────────┘   │
└─────────────┬───────────────────────┘
              │
              ▼
┌─────────────────────────────────────┐
│       生成角色化回复                  │
└─────────────┬───────────────────────┘
              │
              │ 提取关键信息
              ▼
┌─────────────────────────────────────┐
│      更新用户记忆 JSON                │
└─────────────────────────────────────┘

Advanced Content: Organizing User Memory

Core Idea: Simple concatenation of memories can cause context bloat, information conflicts, and staleness. An advanced memory system needs to continuously organize, deduplicate, correct, and summarize the user’s long-term memories in the background, forming a dynamically evolving user profile.

Implementation Strategies:

Memory deduplication and merging: Identify and merge memory entries that are similar or duplicate in content
Conflict resolution: When new memories conflict with old ones (e.g., the user has changed preferences), treat the newest information as authoritative
Periodic summarization: Regularly or during background idle time, use an LLM to summarize scattered memory points and distill higher-level user preferences and traits

Architecture Design:

┌───────────────────────────┐
│ 新对话的完整对话历史         │
└────────────┬──────────────┘
             │
             ▼
┌───────────────────────────┐
│     记忆整理 Agent         │
│  ┌─────────────────────┐  │
│  │    整理流程          │  │
│  ├─────────────────────┤  │
│  │ • 识别冲突/过时信息    │  │
│  │ • 合并/更新记忆       │  │
│  └─────────────────────┘  │
└───────────────────────────┘

Advanced Practice: Summarize Your Diary into a Personal Report

Goal: Build an agent that can handle large amounts of personal text (such as daily diaries and blog posts) and, by reading and organizing these texts, ultimately generate a detailed, clear personal summary report.

Key Challenges:

Long-text processing: How to handle diaries/articles whose total size may exceed the LLM context window
Information extraction and structuring: How to extract structured information points (e.g., key events, emotional changes, personal growth) from narrative text
Coherent summary generation: How to organize scattered information points into a logically coherent, highly readable summary report

Architecture Design:

┌─────────────────────┐
│  批量日记/文章        │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│     分篇读取         │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  信息提取 Agent      │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  结构化记忆库         │
└──────────┬──────────┘
           │
┌──────────┴──────────┐
│                     │
│  用户指令: '生成总结'  │
│                     │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  报告生成 Agent      │
│  (读取全部结构化记忆)  │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│ 生成个人总结报告       │
└─────────────────────┘

Week 3: RAG Systems and Knowledge Bases

Core Content

Document Structuring and Retrieval Strategies

Chunking: Split long documents into meaningful semantic chunks
Embedding: Vectorize text chunks for similarity search
Hybrid retrieval: Combine vector similarity and keyword search to improve recall and precision
Re-ranking: Use more sophisticated models to re-rank initial retrieval results

Basic RAG

Knowledge expression: Express knowledge in clear, structured natural language
Knowledge base construction: Process documents and load them into a vector database
Precise retrieval: Precisely locate relevant entries in the knowledge base based on the user’s question

Practical Case: Build a Legal Q&A Agent

Goal: Enable the agent to act as a professional legal advisor. We will use public datasets of Chinese Criminal/Civil Law to build a knowledge base, allowing the agent to answer legal questions accurately and explicitly point to the specific statutes the answers are based on.

Key Challenges:

Domain data processing: How to parse and clean structured statutory data and optimize its retrieval performance within a RAG system
Answer accuracy and traceability: The agent’s answers must strictly be based on the knowledge base, avoid free-form speculation, and must provide statute sources
Handling vague queries: How to guide users to ask more precise questions to match the most relevant statutes

Architecture Design:

┌──────────────────┐
│ 下载法律数据集     │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ 数据清洗与分块     │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ 构建向量知识库     │
└────────┬─────────┘
         │
         │  ┌──────────────────┐
         │  │  用户法律问题      │
         │  └────────┬─────────┘
         │           │
         │           ▼
         │  ┌──────────────────┐
         └─►│  LLM + RAG Agent │◄──┐
            └────────┬─────────┘   │
                     │             │
                     │ 检索         │
                     ▼             │
            ┌──────────────────┐   │
            │   返回相关法条     ├───┘
            └──────────────────┘
                     │
                     ▼
            ┌──────────────────┐
            │生成答案并引用法条   │
            └──────────────────┘

Advanced Content: Treat the File System as the Ultimate Context

Core Idea: Treat the file system as the ultimate context. An agent should not stuff huge observations (e.g., web pages, file contents) directly into context; this is costly, degrades performance, and is limited by window size. The right approach is to store this big data in files and keep only a lightweight “pointer” (a summary and the file path) in context.

Implementation Strategies:

Recoverable compression: When a tool returns a large amount of content (e.g., read_file), first save it in full to the sandbox file system
Summary and pointer: Append only a summary of the content and the file path to the main context
On-demand I/O: Via the read_file tool, the agent can read the full content from the file system on demand in later steps

Architecture Design:

正确做法 ✅
┌─────────────────────────────────┐
│  Context (Remains compact)      │
├─────────────────────────────────┤
│  • Instruction                  │
│  • Action 1: readFile('doc_x')  │
│  • Observation 1                │
│    (summary: ..., path: 'doc_x')│
│  • ...                          │
└────────────┬────────────────────┘
             │
             │ points to
             ▼
┌─────────────────────────────────┐
│         File System             │
├─────────────────────────────────┤
│  doc_x (Full file content)      │
└─────────────────────────────────┘

Advanced Practice: Build an Agent That Can Read Multiple Papers

Goal: Train an academic research agent that can read a specified paper and all of its references (often dozens of PDFs) and, based on that, summarize the paper’s core contributions and innovations relative to its references.

Key Challenges:

Large-scale PDF processing: How to efficiently parse dozens of PDF papers and extract key information (abstract, conclusions, methodology)
Cross-document relational analysis: The core challenge is enabling the agent to establish links between the main paper and multiple references for comparative analysis, rather than simply summarizing a single paper
Contribution distillation: How to precisely extract the paper’s “incremental contributions” from complex academic discourse

Architecture Design:

┌─────────────────────┐
│    指定主论文         │
└──────────┬──────────┘
           │
           ▼
┌─────────────────────────────┐
│ 解析并提取参考文献列表          │
└──────────┬──────────────────┘
           │
           ▼
┌─────────────────────────────┐
│ 并行下载所有参考文献 PDF       │
└──────────┬──────────────────┘
           │
           ▼
┌─────────────────────────────┐
│   长上下文处理 Agent          │
└──────────┬──────────────────┘
           │
           │ 索引所有论文
           ▼
┌─────────────────────────────┐
│    文件系统知识库             │
└──────────┬──────────────────┘
           │
┌──────────┴──────────────────┐
│  用户提问: "总结贡献"          │
└──────────┬──────────────────┘
           │
           ▼
┌─────────────────────────────┐
│      分析 Agent              │
│   (查询/读取论文)             │
└──────────┬──────────────────┘
           │
           ▼
┌─────────────────────────────┐
│   生成贡献总结报告             │
└─────────────────────────────┘

Week 4: Tool Use and MCP

Core Content

Multiple Ways to Wrap Tools

Function Calling: Expose local code functions directly to the agent
API access: Call external HTTP APIs to obtain real-time data or perform remote operations
Agent as a Tool: Wrap a specialized agent (e.g., a code-generation agent) as a tool callable by another agent

MCP (Model Context Protocol)

Standardized interfaces: Provide a unified, language-agnostic connection standard between models and external tools/data sources
Plug-and-play: Developers can publish MCP-compliant tools, and agents can dynamically discover and use them
Security and isolation: Built-in permissions and sandboxing to ensure safe tool invocation

Practical Case: Connect to MCP Servers to Build a Deep Research Agent

Goal: Build an agent capable of in-depth information research. It needs to connect to multiple MCP-compliant external tool servers and autonomously plan and invoke these tools to complete a complex research task.

Key Challenges:

Authoritative source identification: The agent must accurately identify and adopt high-credibility sources such as official documents and academic papers amid massive information
Multi-tool coordination: How to plan a call chain that links outputs/inputs of multiple tools (e.g., search, then read, then analyze) into a complete workflow
Open-ended exploration: How to handle open-ended questions with no single answer, conduct exploratory searches from multiple angles, and synthesize results

Architecture Design:

┌───────────────────┐
│  用户调研课题       │
└─────────┬─────────┘
          │
          ▼
┌───────────────────────────────────┐
│      调研主控 Agent                │
└──────────┬────────────────────────┘
           │
           │ 连接
           ▼
┌───────────────────────────────────┐
│       MCP 工具网关                 │
└──┬──────────┬──────────┬──────────┘
   │          │          │
   ▼          ▼          ▼
┌──────┐  ┌──────┐   ┌──────────┐
│ Web  │  │Context│  │ ...其他  │
│Search│  │  7    │  │MCP Server│
│ MCP  │  │  MCP  │  │          │
│Server│  │Server │  │          │
└──────┘  └──────┘   └──────────┘
           │
           │ 规划调研步骤
           │ 调用工具
           │ 整合信息
           ▼
┌───────────────────┐
│  生成调研报告       │
└───────────────────┘

Advanced Content: Learn from Experience

Core Idea: A truly intelligent agent not only uses tools, but also learns and evolves from the experience of using them. It should remember the “playbook” for successfully solving certain tasks (i.e., prompt templates and tool-call sequences) and directly reuse it when similar tasks arise in the future.

Implementation Strategies:

Experience storage: After a complex task is successfully completed, the agent stores the entire process (including user intent, chain of thought, tool-call sequence, and final result) as an “experience case” in the knowledge base
Experience retrieval: When facing a new task, the agent first searches for similar cases in the experience base
Experience application: If a similar case is found, the agent uses its successful strategy as high-level guidance instead of starting from scratch each time

Architecture Design:

┌───────────┐     ┌─────────────┐
│  新任务   ├────►│    Agent     │
└───────────┘     └──────┬──────┘
                         │
                         │ 检索相似经验
                         ▼
                  ┌──────────────┐
                  │  经验知识库    │
                  └──────┬───────┘
                         │
            找到成功案例   │
                         ▼
                  ┌──────────────┐
                  │  应用经验     │
                  └──────┬───────┘
                         │
                         ▼
                  ┌──────────────┐
                  │  任务执行     │
                  └──────┬───────┘
                         │
              任务成功    │
                         ▼
                  ┌──────────────┐
                  │ 保存新经验     │
                  └──────┬───────┘
                         │
                         ▼
                  ┌──────────────┐
                  │  生成结果     │
                  └──────────────┘

Advanced Practice: Enhance the Deep Research Agent’s Expert Capabilities

Goal: Equip the agent with expert-level capabilities for complex deep-research scenarios. For example, when researching “OpenAI’s co-founders,” it can automatically spawn a parallel sub-research agent for each founder; when searching for person information, it can effectively handle name collisions.

Key Challenges:

Loading domain experience: How to load different experiential knowledge based on task type (e.g., “academic research” vs. “people research”) to guide the agent to use the most appropriate authoritative sources and prompt strategies
Dynamic sub-agents: How to let the main agent dynamically create multiple parallel sub-agents based on preliminary search results to handle sub-tasks separately
Disambiguation: How to design clarification and verification mechanisms when handling ambiguous scenarios such as people searches

Architecture Design:

┌──────────────────────────┐
│ 调研OpenAI cofounders     │
└────────────┬─────────────┘
             │
             ▼
┌──────────────────────────────────┐
│            Agent                 │
│  加载'人物调研'经验                 │
└────────────┬─────────────────────┘
             │
             │ 初步搜索
             ▼
┌──────────────────────────┐
│       搜索引擎            │
└────────────┬─────────────┘
             │
   返回创始人列表
             │
             ▼
┌──────────────────────────────────┐
│  为每个人启动 Sub-agent            │
├──────────────────────────────────┤
│ • Sam Altman 调研 Agent          │
│ • Greg Brockman 调研 Agent       │
│ • ...                            │
└────────────┬─────────────────────┘
             │
             ▼
┌──────────────────────────┐
│      结果汇总             │
└────────────┬─────────────┘
             │
             ▼
┌──────────────────────────┐
│   生成最终报告             │
└──────────────────────────┘

Week 5: Programming and Code Execution

Core Challenges for Code Agents

Codebase comprehension:
- How to find relevant code in a large codebase (semantic search)?
- How to accurately query all references to a function in the code?
Reliable code modification:
- How to reliably apply AI-generated diffs to source files (old_string -> new_string)?
Consistent execution environment:
- How to ensure the agent executes commands in the same terminal session each time (inheriting pwd, env var, etc.)?
- How to preconfigure the agent’s execution environment with the required dependencies and tools?

Practical Case: Build an Agent That Can Develop Agents

Goal: Create an “Agent Development Engineer” agent. It can take a high-level natural-language requirement (e.g., “Develop an agent that can browse the web; frontend with React + Vite + Shadcn UI; backend with FastAPI…”) and then autonomously complete the entire application development.

Key Challenges:

Documentation-driven development: How to have the agent first write a design document for the application to be built and strictly follow it for subsequent code implementation
Test-driven development: How to ensure the agent writes and runs test cases for each piece of code it generates to guarantee the delivered application’s quality and correctness
Development and test environment: The agent needs a solid development and testing environment to autonomously execute tests, discover bugs, and then fix them

Architecture Design:

┌──────────────────────────────┐
│ prompt: 开发一个搜索 Agent     │
└───────────────┬──────────────┘
                │
                ▼
┌──────────────────────────────┐
│     开发工程师 Agent           │
└───────────────┬──────────────┘
                │
                ▼
┌──────────────────────────────┐
│      创建 TODO List          │
└───────────────┬──────────────┘
                │
                ▼
┌──────────────────────────────┐
│     执行: vite create        │
└───────────────┬──────────────┘
                │
                ▼
┌──────────────────────────────┐
│    ...编码 & 调试...         │
└───────────────┬──────────────┘
                │
                ▼
┌──────────────────────────────┐
│           完成               │
└──────────────────────────────┘

Advanced Content: Agent Self-Evolution

Core Concept: The ultimate form of an Agent’s capability is self-evolution. When faced with a problem that existing tools cannot solve, an advanced Agent should not give up; it should leverage its coding ability to create a new tool for itself.

Implementation Strategy:

Capability Boundary Detection: The Agent must first determine whether the current problem exceeds the capabilities of its existing toolset
Tool Creation Planning: The Agent plans the new tool’s functions, inputs, and outputs, and searches open-source repositories (e.g., GitHub) for usable implementations
Code Wrapping and Verification: The Agent wraps the found code into a new tool function, writes test cases for it, and validates its correctness in a sandbox
Tool Library Persistence: After validation, add the new tool to its permanent tool library for future use

Architecture Design:

┌────────────┐     ┌────────────┐
│  新问题     ├────►│   Agent    │
└────────────┘     └─────┬──────┘
                         │
            现有工具无法解决
                         │
                         ▼
                  ┌──────────────┐
                  │ 搜索 GitHub  │
                  └──────┬───────┘
                         │
                         ▼
                  ┌──────────────────┐
                  │找到并下载相关代码   │
                  └──────┬───────────┘
                         │
                         ▼
                  ┌──────────────┐
                  │ 封装为新工具   │
                  └──────┬───────┘
                         │
                         ▼
                  ┌──────────────┐
                  │ 沙箱中验证     │
                  └──────┬───────┘
                         │
                    验证通过
                         │
                         ▼
                  ┌──────────────┐
                  │ 加入工具库     │
                  └──────────────┘

Week 6: Evaluation and Selection of Large Models

Core Content

Assessing the Capability Boundaries of Large Models

Core capability dimensions: reasoning ability, knowledge breadth, hallucination, long text, instruction following, tool invocation
Build discriminative test cases: Design Agent-centric evaluation sets, rather than simple chatbot Q&A
LLM as a Judge: Use a strong LLM (e.g., GPT-4.1) as the “judge” to automatically evaluate and compare the output quality of different models or Agents

Putting Safety Guardrails on Large Models

Input filtering: Prevent prompt injection
Output filtering: Monitor and block inappropriate or dangerous content
Human intervention: Introduce a human confirmation step before high-risk operations (Human-in-the-loop)
Cost control: Monitor token consumption, set budget limits, and prevent abuse

Hands-on Case: Build an evaluation dataset, use LLM as a Judge to automatically evaluate the Agent

Goal: For the in-depth research Agent we built in previous weeks, systematically build an evaluation dataset. Then develop an automated test framework that uses the LLM as a Judge approach to evaluate how different “brains” (e.g., Claude 4 vs Gemini 2.5) and different strategies (e.g., enabling/disabling chain-of-thought) affect the Agent’s performance.

Key Challenges:

Evaluation dataset design: How to design a set of research tasks that are representative yet cover various edge cases?
“Judge” prompt design: How to design the prompt for the “LLM Judge” so it can score the Agent’s output fairly, consistently, and accurately?
Result interpretability: How to analyze the automated evaluation results to identify the strengths and weaknesses of different models or strategies

Architecture Design:

┌─────────────────┐
│  评测任务集       │
└────────┬────────┘
         │
         ▼
┌────────────────────────────────────┐
│       自动化评测框架                 │
└────┬──────────┬──────────┬─────────┘
     │          │          │
     ▼          ▼          ▼
┌─────────┐┌─────────┐┌─────────┐
│调研Agent││调研Agent││调研Agent│
│(Claude 4││(Claude 4││(Gemini  │
│  with   ││   no    ││  2.5    │
│thinking)││thinking)││  with   │
│         ││         ││thinking)│
└────┬────┘└────┬────┘└────┬────┘
     │          │          │
     └──────────┴──────────┘
                │
                ▼
     ┌──────────────────┐
     │  Agent 输出结果   │
     └─────────┬────────┘
               │
               ▼
     ┌──────────────────┐     ┌─────────────────┐
     │  LLM as a Judge  │◄────┤   评测任务集      │
     └─────────┬────────┘     └─────────────────┘
               │
               ▼
     ┌───────────────────────┐
     │ 生成量化评分与分析       │
     └───────────────────────┘

Advanced Content: Parallel Sampling and Sequential Revision

Core Concept: Simulate the human processes of “brainstorming” and “reflective revision” to tackle complex, open-ended problems and improve the quality and robustness of Agent outputs.

Parallel Sampling (Parallel Sampling)

Idea: Launch multiple Agent instances simultaneously, using slightly different prompts or a higher temperature, to explore solutions in parallel from multiple angles
Advantages: Increase the probability of finding the optimal solution, and avoid the limitations of a single Agent’s thinking
Implementation: Similar to Multi-Agent, but the goal is to solve the same problem; finally select the best answer through an evaluation mechanism (e.g., LLM as a Judge)

Sequential Revision (Sequential Revision)

Idea: Have the Agent critique and revise its own initial output
Process: Initial response → self-evaluation → identify issues → generate improvements → final output
Advantages: Improve the success rate and depth of answers for a single task, enabling self-optimization

Advanced Practice: Add parallel and revision capabilities to the in-depth research Agent

Goal: Integrate both parallel sampling and sequential revision into our in-depth research Agent. Use the evaluation framework we just built to quantitatively assess whether, and to what extent, these strategies improve the Agent’s performance.

Key Challenges:

Strategy integration: How to organically combine parallel sampling (horizontal scaling) and sequential revision (vertical deepening) within one Agent workflow?
Cost control: Both strategies significantly increase LLM call costs; how to balance performance gains and cost?
Performance attribution: In evaluation, how to attribute performance improvements accurately to parallel sampling versus sequential revision?

Architecture Design:

┌────────────────┐
│   调研任务      │
└───────┬────────┘
        │
        ├─────────────────────────┐
        │      并行采样           │
        ├─────────────────────────┤
        │  ┌─────────────────┐   │
        ├─►│子 Agent 1       │   │
        │  │(Prompt A)       │   │
        │  └────────┬────────┘   │
        │           │             │
        │  ┌────────▼────────┐   │
        ├─►│子 Agent 2       │   │
        │  │(Prompt B)       │   │
        │  └────────┬────────┘   │
        │           │             │
        └───────────┼─────────────┘
                    │
                    ▼
           ┌─────────────────┐
           │   初步结果       │
           └────────┬────────┘
                    │
                    ▼
           ┌─────────────────┐
           │结果评估与筛选     │
           └────────┬────────┘
                    │
        ┌───────────┼─────────────┐
        │      顺序修订            │
        ├─────────────────────────┤
        │     ┌──────────┐        │
        │     │自我反思    │◄──────┤
        │     └─────┬────┘        │
        │           │             │
        └───────────┼─────────────┘
                    │
                    ▼
           ┌─────────────────┐
           │   最终报告      │
           └─────────────────┘

Week 7: Multimodality and Real-time Interaction

Core Content

Real-time voice-call Agent

Tech stack: VAD (Voice Activity Detection), ASR (Automatic Speech Recognition), LLM, TTS (Text-to-Speech)
Low-latency interaction: Optimize the end-to-end latency from user voice input to Agent voice output
Natural interruption handling: Allow users to interject while the Agent is speaking for more human-like dialogue flow

Operating computers and phones

Visual understanding: The Agent needs to interpret screenshots and identify UI elements (buttons, input fields, links)
Action mapping: Map natural-language instructions like “click the login button” precisely to screen coordinates or UI element IDs
Integration with existing frameworks: Directly call mature frameworks like browser-use to quickly equip the Agent with computer operation capabilities

Hands-on Case 1: Build a real-time voice-call Agent that can listen and speak

Goal: From scratch, build an Agent capable of real-time, fluent voice conversations with users. It should respond quickly, understand and execute voice commands, and even proactively lead guided dialogues.

Key Challenges:

Latency control: The end-to-end latency from user voice input to Agent voice output determines the experience quality. How to optimize each part of the tech stack?

Architecture Design:

语音输入流                    大脑                    语音输出流
┌──────────┐           ┌──────────────┐           ┌──────────┐
│用户语音   │           │              │            │播放声音  │
└────┬─────┘           │              │           └────▲─────┘
     │                 │              │                │
     ▼                 │              │                │
┌──────────┐           │     LLM      │           ┌──────────┐
│VAD 断句   │           │              │           │TTS 语音  │
└────┬─────┘           │              │           │  合成    │
     │                 │              │           └────▲─────┘
     ▼                 │              │                │
┌──────────┐  文本流   │              │   文本流    ┌──────────┐
│ASR 实时  ├──────────►│              ├──────────►│          │
│  转写    │           │              │           │          │
└──────────┘           └──────────────┘           └──────────┘

Hands-on Case 2: Integrate browser-use to let the Agent operate your computer

Goal: Call the existing browser-use framework to give our Agent the ability to operate a desktop browser. The Agent should understand user operation instructions (e.g., “help me open anthropic.com and find the computer use documentation”) and translate them into actual browser actions.

Key Challenges:

Framework integration: How to integrate browser-use as a tool seamlessly into our existing Agent architecture
Instruction generalization: User instructions may be vague; how to help the Agent understand them and translate them into precise operations supported by browser-use
State synchronization: How to let the Agent perceive the results of browser operations (e.g., page navigation, element loading) to make the next decision

Architecture Design:

┌───────────────────┐
│  用户操作指令       │
└─────────┬─────────┘
          │
          ▼
┌───────────────────────────────┐
│        主控 Agent              │
└─────────┬─────────────────────┘
          │
          │ 决策使用浏览器
          ▼
┌───────────────────────────────┐
│   调用 browser-use 工具        │
└─────────┬─────────────────────┘
          │
          │ page.goto(url)
          ▼
┌───────────────────────────────┐
│         浏览器                 │
└─────────┬─────────────────────┘
          │
          │ 返回页面截图
          ▼
┌───────────────────────────────┐
│        主控 Agent              │
│  (分析截图, 规划下一步)          │
└─────────┬─────────────────────┘
          │
          │ page.click(selector)
          ▼
┌───────────────────────────────┐
│   调用 browser-use 工具        │
└───────────────────────────────┘

Advanced Content: Fast/Slow Thinking and Intelligent Interaction Management

Fast/Slow Thinking (Mixture-of-Thoughts) Architecture

Fast path: Use low-latency models (e.g., Gemini 2.5 Flash) for instant feedback, handling simple queries and maintaining conversational fluency
Deep-thinking path: Use stronger SOTA models (e.g., Claude 4 Sonnet) for complex reasoning and tool use, delivering more precise and in-depth answers

Intelligent Interaction Management

Smart interruptions (Interrupt Intent Detection): Use VAD and smaller models to filter background noise and filler utterances, stopping only when the user has a clear intent to interrupt
Turn-taking (Turn Detection): Analyze the semantic completeness of what the user has said to decide whether the AI should continue speaking, avoiding interruptions
Silence management (Silence Management): When the user is silent for a long time, proactively start new topics or ask follow-ups to keep the conversation coherent

Advanced Practice: Build an advanced real-time voice Agent

Goal: Build an advanced voice Agent that integrates the “fast/slow thinking” architecture and “intelligent interaction management,” achieving industry-leading levels in response speed and natural interaction.

Key Challenges and Acceptance Criteria:

Basic reasoning: Ask: “What is 8 to the power of 6?” — must give an initial response within 2 seconds and the correct answer “262144” within 15 seconds.
Tool use: Ask: “How is the weather in Beijing today?” — must respond within 2 seconds and return accurate weather via API within 15 seconds.
Intelligent interaction management:
- Smart interruption: During the Agent’s speech:
  - If the user says “um”, the Agent should not stop speaking.
  - If the user taps the table, the Agent should not stop speaking.
  - If the user says “And its battery life…” the Agent should immediately stop the current speech.
- Turn-taking: After the user says “And its battery life…” and deliberately pauses, the Agent should not respond.
- Silence management: If the user says “And its battery life…” and pauses for more than 3 seconds, the Agent can proactively guide the conversation or ask follow-up questions to keep the exchange smooth.

Architecture Design:

┌──────────┐      ┌──────────┐
│用户语音   ├─────►│   ASR    │
└──────────┘      └────┬─────┘
                       │
                文本流  │          ┌─────────────────┐
         ┌─────────────┼────────►│ 打断/发言权判断    ├──┐
         │             │         └────────┬─────────┘  │
         │             │                   │            │
         │             │         ┌─────────▼────────┐  │ 打断信号
         │             │         │  快思考 LLM       │  │
         │             │         └────────┬─────────┘  │
         │             │                   │            │
         │    文本流   ▼                    │            │
         │     ┌───────────────┐           │            │
         │     │  慢思考 LLM    │           │            │
         │     └───────┬───────┘           │            │
         │             │                   │            │
         │  中间思考过程 │                   │            │
         └─────────────┘                   │            │
                       │                   │            │
                       ▼                   ▼            ▼
                  ┌──────────────────────────────┐
                  │            TTS               │
                  └─────────────┬────────────────┘
                                │
                                ▼
                          ┌──────────┐
                          │播放声音   │
                          └──────────┘

Week 8: Multi-Agent Collaboration

Core Content

Limitations of a single Agent

High context cost: A single context window balloons rapidly in complex tasks
Inefficient sequential execution: Cannot process multiple subtasks in parallel
Quality degradation in long contexts: Models in overly long contexts tend to “forget” or get “distracted”
No parallel exploration: Can only explore along a single path

Advantages of Multi-Agent

Parallel processing: Break down the task and hand it to different SubAgents to process in parallel, improving efficiency
Independent context: Each SubAgent has an independent, more focused context window to ensure execution quality
Compression is the essence: Each SubAgent only needs to return its most important findings, which the main Agent aggregates to achieve efficient information compression
Emergent collective intelligence: Suitable for tasks requiring multi-perspective analysis, such as open-ended research

Case Study: Design a multi-Agent collaboration system to realize “talking on the phone while using the computer”

Goal: Solve the challenge of “doing two things at once.” Build a team consisting of a “Phone Agent” and a “Computer Agent.” The “Phone Agent” communicates with the user via voice to gather information; the “Computer Agent” simultaneously operates web pages. The two communicate in real time and collaborate efficiently.

Core challenges:

Dual-Agent architecture: Two independent Agents, one responsible for voice calls (Phone Agent), and one responsible for operating the browser (Computer Agent)
Cross-Agent collaborative communication: The two Agents must communicate efficiently in both directions. Information obtained by the Phone Agent should be immediately shared with the Computer Agent, and vice versa. This can be implemented via tool calls
Parallel work and real-time responsiveness: The key is that both Agents must work in parallel without blocking each other. Each Agent’s context needs to include real-time messages from the other Agent

Architecture design:

┌──────────┐  语音  ┌──────────────┐  A2A通信  ┌──────────────┐  GUI操作  ┌──────────────┐
│  用户    │◄──────►│  电话 Agent  │◄─────────►│  电脑 Agent  │◄─────────►│ 浏览器/桌面  │
└──────────┘        └──────┬───────┘            └──────┬───────┘        └──────────────┘
                           │                            │
                    ┌──────┴──────────────┐      ┌──────┴──────────────┐
                    │  电话 Agent 流程     │      │  电脑 Agent 流程      │
                    ├─────────────────────┤      ├─────────────────────┤
                    │  ┌────┐            │      │  ┌────────────┐      │
                    │  │ASR │            │      │  │ 接收指令    │      │
                    │  └──┬─┘            │      │  └──────┬─────┘      │
                    │     ▼              │      │         ▼            │
                    │  ┌────┐  发送指令   │      │  ┌────────────┐      │
                    │  │LLM ├───────────►┤      │  │多模态 LLM   │      │
                    │  └──┬─┘            │      │  └──────┬─────┘      │
                    │     ▼              │      │         ▼            │
                    │  ┌────┐            │      │  ┌────────────┐      │
                    │  │TTS │            │      │  │执行点击/输入 │      │
                    │  └────┘            │      │  └────────────┘      │
                    │     ▲              │      │         │            │
                    │     │  返回状态     │      │          │请求澄清     │
                    │     └──────────────┤◄─────┤◄────────┘            │
                    └────────────────────┘      └─────────────────────┘

Advanced: Orchestration Agent - Treat Sub-agents as tools

Core idea: Instead of hard-coded Agent-to-Agent collaboration, introduce a higher-level “Orchestration Agent.” Its core responsibility is to understand the user’s top-level goals and dynamically select, launch, and coordinate a group of “expert Sub-agents” (as tools) to complete the task together.

Implementation strategy:

Sub-agent as Tools: Each expert Sub-agent (e.g., Phone Agent, Computer Agent, Research Agent) is encapsulated as a “tool” conforming to a standard interface
Dynamic tool invocation: The Orchestration Agent, based on user needs, asynchronously invokes one or more Sub-agent tools
Direct communication between Agents: Allow invoked Sub-agents to establish direct communication channels for efficient task collaboration without routing everything through the Orchestration Agent

Architecture design:

┌──────────────────┐
│ 用户顶层目标       │
└────────┬─────────┘
         │
         ▼
┌──────────────────────────┐
│  Orchestration Agent     │
└────┬──────────┬──────────┘
     │          │
     │决策调用  │决策调用
     │          │
     ▼          ▼
┌──────────┐ ┌──────────┐
│电话 Agent│  │电脑 Agent│
│  工具    │  │  工具    │
└────┬─────┘ └────┬─────┘
     │            │
     │  A2A 直连  │
     ◄────────────►
     │            │
     ▼            ▼
┌──────────┐ ┌──────────┐
│  用户    │  │ 浏览器   │
└──────────┘ └──────────┘

Advanced Practice: Use an Orchestration Agent to dynamically coordinate phone and computer operations

Goal: Refactor our “talk on the phone while using the computer” system. Instead of hard-coding the startup of two Agents, create an Orchestration Agent. When the user asks “help me call to book a flight,” the Orchestration Agent can automatically infer that the task requires both “making a phone call” and “operating a computer,” then launch these two Sub-agents in parallel and have them collaborate.

Core challenges:

Task planning and tool selection: How can the Orchestration Agent accurately decompose a vague user goal into which specific Sub-agent tools are needed?
Asynchronous tool management: How to manage the lifecycle (start, monitor, terminate) of multiple Sub-agent tools that run in parallel and for long durations
Sub-agent intercommunication: How to establish an efficient, temporary, direct communication mechanism for dynamically launched Sub-agents

Architecture design:

┌────────────────────────┐
│  帮我打电话订机票         │
└───────────┬────────────┘
            │
            ▼
┌────────────────────────────────┐
│     Orchestration Agent        │
│           (思考)                │
└────┬──────────────┬────────────┘
     │              │
     │ 并行启动      │ 并行启动
     ▼              ▼
┌──────────┐   ┌──────────┐
│电话 Agent│    │电脑 Agent│
└────┬─────┘   └────┬─────┘
     │              │
     │  A2A 通信    │
     ◄──────────────►
     │              │
┌────┴──────────────┴────┐
│      任务执行           │
├────────────────────────┤
│  • 获取用户信息          │
│  • 填写表单             │
└────┬──────────────┬────┘
     │              │
     ▼              ▼
┌──────────┐   ┌──────────────┐
│  用户    │    │航空公司网站    │
└──────────┘   └──────────────┘
     │              │
     │              │
     └──────┬───────┘
            │
      任务完成/失败
            │
            ▼
┌────────────────────────────────┐
│     Orchestration Agent        │
│       (向用户报告)               │
└────────────────────────────────┘

Week 9: Project Showcase

Core content

Project integration and demo

Integration capability: Integrate the capabilities learned in the first 8 weeks (RAG, tool use, voice, multimodality, Multi-Agent) into a final project
Outcome demo: Each participant will have the opportunity to showcase their unique general-purpose Agent and share the thinking and challenges during its creation
Peer review: Gain inspiration from others’ projects through mutual demos and Q&A

Book polishing and summary

Knowledge consolidation: Together, review and summarize the core knowledge points of the 9 weeks and solidify them into the final manuscript of “AI Agent, Explained”
Co-creation of content: Propose edits to the manuscript, jointly polish it, and ensure it is “systematic and practical”
Credited publication: The names of all participating co-creators will appear in the final published physical book

Case Study: Showcase your unique general-purpose Agent

Goal: Provide a comprehensive summary and showcase of the personal Agent project built during the bootcamp. This is not only a results report, but also an exercise in systematizing learned knowledge and clearly explaining complex technical solutions to others.

Key points to showcase:

Agent positioning: What core problem does your Agent solve?
Technical architecture: How did you synthesize the knowledge learned (context, RAG, tools, multimodality, Multi-Agent) to achieve your goal?
Innovation highlights: What is the most creative design in your Agent?
Demo: Live demonstration of the Agent’s core functions
Future outlook: How do you plan to continue iterating on and improving your Agent?

Final project architecture example:

┌──────────────┐
│     用户     │
│ (语音/文本)  │
└──────┬───────┘
       │
       ▼
┌─────────────────────────────────────┐
│           主控 Agent                 │
└────┬──────┬──────┬──────┬───────────┘
     │      │      │      │
     │      │      │      └──────────────┐
     │      │      │                     │
┌────▼──────▼──────▼─────┐      ┌───────▼──────────────────┐
│      核心能力           │      │    专业 Agent 团队        │
├────────────────────────┤      ├──────────────────────────┤
│ • 上下文与记忆系统       │      │  • 深度调研 Agent          │
│ • 工具调用引擎          │      │  • 编程 Agent              │
│   └─► 外部 API         │      │  • 电话 Agent             │
│ • RAG 知识库           │      │  • 电脑操作 Agent          │
└───────────────────────┘      └──────────────────────────┘

Advanced: Four ways an Agent learns from experience

1. Rely on long-context capability

Idea: Trust and leverage the model’s own long-context processing ability by feeding the complete, uncompressed conversation history
Implementation:
- Keep recent conversations: Fully retain the most recent interaction history (Context Window)
- Compress long-term memory: Use Linear Attention and related techniques to automatically compress distant conversation history into Latent Space
- Extract key snippets: Use Sparse Attention and related techniques to automatically extract the snippets most relevant to the current task from distant conversation history
Pros: Easiest to implement; preserves original information details to the greatest extent
Cons: Strongly dependent on model capabilities

2. Text-form extraction (RAG)

Idea: Summarize experience into natural language and store it in a knowledge base
Implementation: Retrieve relevant experience text via RAG and inject it into the prompt
Pros: Controllable cost; knowledge is readable and maintainable
Cons: Depends on retrieval accuracy

3. Post-training (SFT/RL)

Idea: Learn the experience into the model weights
Implementation: Use high-quality Agent behavior trajectories as data to fine-tune the model (SFT) or perform reinforcement learning (RL)
Pros: Internalizes experience as the model’s “intuition,” suitable for complex tasks with strong generalization
Cons: Higher cost, requires large amounts of high-quality data; long cycle, making it hard to realize a real-time feedback loop—i.e., the model will not immediately avoid similar mistakes from just-failed online examples

4. Abstract into code (tools/Sub-agent)

Idea: Abstract recurring successful patterns into a reusable tool or Sub-agent
Implementation: The Agent identifies automatable patterns and writes code to solidify them
Pros: Reliable and efficient way to learn
Cons: Requires strong coding ability from the Agent; when the number of tools grows large, tool selection becomes a challenge

Advanced practice: Compare the four ways an Agent learns from experience

Goal: Using the evaluation framework we built in Week 6, design experiments to compare the pros and cons of the four ways an Agent learns from experience.

Core challenges:

Experiment design: How to design a set of tasks that clearly reflect the differences among the four learning methods?
Cost-performance trade-off: How to combine each method’s “performance score” with its “computational cost” in the evaluation report for a holistic assessment?
Scenario-based analysis: Draw conclusions about which learning method should be prioritized in which task scenarios

Architecture design:

┌────────────┐
│ 评测任务    │
└─────┬──────┘
      │
      ▼
┌─────────────────────┐
│    评测框架          │
└──┬───┬───┬───┬──────┘
   │   │   │   │
   ▼   ▼   ▼   ▼
┌────┐┌────┐┌────┐┌────────────┐
│长  ││RAG ││后  ││工具和       │
│上  ││    ││训  ││sub-agent   │
│下  ││    ││练  ││            │
│文  ││    ││    ││            │
└──┬─┘└──┬─┘└──┬─┘└──┬─────────┘
   │     │     │     │
   └─────┴─────┴─────┘
            │
            ▼
   ┌──────────────────┐
   │ 性能/成本数据      │
   └────────┬─────────┘
            │
            ▼
   ┌──────────────────┐
   │  LLM as a Judge  │
   └────────┬─────────┘
            │
            ▼
   ┌──────────────────┐
   │生成对比分析报告     │
   └──────────────────┘

Summary and review

Through 9 weeks of systematic learning and practice, we completed a full journey from Agent fundamentals to building a general-purpose intelligent agent:

Core competencies mastered

Agent architecture understanding: Gained a deep understanding of the core design paradigm of LLM + context + tools
Mastery of context engineering: Mastered multi-level context management techniques
Tooling system construction: Achieved reliable integration with external APIs and MCP Servers
Multimodal interaction: Built voice, vision, and other multimodal Agents
Collaboration pattern design: Implemented complex collaboration modes such as Multi-Agent and Orchestration

Practical project portfolio

Web-connected search Agent
Legal Q&A Agent
In-depth research Agent
Agent developer engineer Agent
Real-time voice call Agent
Multi-Agent collaboration system

Advanced technical exploration

Context compression and optimization
Four ways of learning from experience
Parallel sampling and sequential revision
Fast-and-slow thinking architectures
An Agent’s self-evolution

🚀 Develop your own AI Agent—start here!

【Original Slidev Markdown】

【Slides link】

AI Agent Bootcamp: Build Your General-Purpose Agent in 9 Weeks

Core Goals of the Bootcamp

🎯 Master core architecture and engineering capabilities

💡 Build a systematic understanding of development and deployment

9-Week Hands-On Plan Overview

9-Week Advanced Topics

Week 1: Agent Basics

Core Content

Agent structure and taxonomy

Basic frameworks and scenario judgment

Hands-On Case: Build a web-connected search Agent

Advanced: The importance of context

Advanced practice: Exploring how missing context affects Agent behavior

Week 2: Context Design (Context Engineering)

Core Content

Prompt templates

Conversation history and user memory

Hands-On Case: Add persona and long-term memory to your Agent

Advanced Content: Organizing User Memory

Advanced Practice: Summarize Your Diary into a Personal Report

Week 3: RAG Systems and Knowledge Bases

Core Content

Document Structuring and Retrieval Strategies

Basic RAG

Practical Case: Build a Legal Q&A Agent

Advanced Content: Treat the File System as the Ultimate Context

Advanced Practice: Build an Agent That Can Read Multiple Papers

Week 4: Tool Use and MCP

Core Content

Multiple Ways to Wrap Tools

MCP (Model Context Protocol)

Practical Case: Connect to MCP Servers to Build a Deep Research Agent

Advanced Content: Learn from Experience

Advanced Practice: Enhance the Deep Research Agent’s Expert Capabilities

Week 5: Programming and Code Execution

Core Challenges for Code Agents

Practical Case: Build an Agent That Can Develop Agents

Advanced Content: Agent Self-Evolution

Week 6: Evaluation and Selection of Large Models

Core Content

Assessing the Capability Boundaries of Large Models

Putting Safety Guardrails on Large Models

Hands-on Case: Build an evaluation dataset, use LLM as a Judge to automatically evaluate the Agent

Advanced Content: Parallel Sampling and Sequential Revision

Advanced Practice: Add parallel and revision capabilities to the in-depth research Agent

Week 7: Multimodality and Real-time Interaction

Core Content

Real-time voice-call Agent

Operating computers and phones

Hands-on Case 1: Build a real-time voice-call Agent that can listen and speak

Hands-on Case 2: Integrate browser-use to let the Agent operate your computer

Advanced Content: Fast/Slow Thinking and Intelligent Interaction Management

Advanced Practice: Build an advanced real-time voice Agent

Week 8: Multi-Agent Collaboration

Core Content

Limitations of a single Agent

Advantages of Multi-Agent

Case Study: Design a multi-Agent collaboration system to realize “talking on the phone while using the computer”

Advanced: Orchestration Agent - Treat Sub-agents as tools

Advanced Practice: Use an Orchestration Agent to dynamically coordinate phone and computer operations

Week 9: Project Showcase

Core content

Project integration and demo

Book polishing and summary

Case Study: Showcase your unique general-purpose Agent

Advanced: Four ways an Agent learns from experience

1. Rely on long-context capability

2. Text-form extraction (RAG)

3. Post-training (SFT/RL)

4. Abstract into code (tools/Sub-agent)

Advanced practice: Compare the four ways an Agent learns from experience

Summary and review

Core competencies mastered

Practical project portfolio

Advanced technical exploration

Comments