AI Agent Bootcamp: Build Your General-Purpose Agent in 9 Weeks
【This article is compiled from the first livestream of the Turing Community AI Agent Bootcamp, Slides link】
Turing Community “AI Agent Bootcamp” purchase link
Start here to develop an AI Agent of your own. This article not only systematically introduces the foundational technical path to building a general-purpose AI Agent from scratch (such as context engineering, RAG systems, tool use, multimodal interaction, etc.), but also covers advanced techniques like fast/slow thinking and multi-Agent collaboration. Through 9 weeks of hands-on projects, you will gradually master the full lifecycle of Agent development and key advanced capabilities.
This course had its first livestream preview on August 18 and will officially start on September 11. Each weekly session is about 2 hours and covers all the foundational and advanced content below. Of course, just 2 hours of lectures per week won’t be enough—you’ll also need to spend time coding and practicing.
Bootcamp Core Objectives
Start here to build an AI Agent of your own
🎯 Master core architecture and engineering skills
- Deeply understand Agent architecture: Systematically master the core design paradigm of
LLM + context + tools
. - Excel at context engineering: Master multi-level context management techniques, from conversation history and user long-term memory to external knowledge bases (RAG) and file systems.
- Master dynamic tool use: Reliably integrate Agents with external APIs and MCP servers, and enable self-improvement via code generation.
- Build advanced Agent patterns: Design and implement complex collaboration patterns such as slow/fast thinking (Mixture-of-Thoughts) and Orchestration.
💡 Build a systematic understanding of development and deployment
- Understand the evolution path: See the progression from basic RAG to Agents that can autonomously develop tools.
- Master the Agent lifecycle: Be able to independently complete the closed loop of Agent project design, development, evaluation with LLM as a Judge, and deployment.
- Build domain knowledge: Accumulate cross-domain Agent development experience through hands-on projects in law, academia, programming, and more.
- Consolidate a knowledge system: Co-create the book “AI Agents, Explained” to systematize fragmented knowledge.
9-Week Hands-on Plan Overview
Week | Topic | Content Overview | Hands-on Case |
---|---|---|---|
1 | Agent Basics | Agent structure and taxonomy; workflow-based vs autonomous | Build an Agent that can search the web |
2 | Context Design | Prompt templates; conversation history; user long-term memory | Add persona and long-term memory to your Agent |
3 | RAG and Knowledge Bases | Document structuring; retrieval strategies; incremental updates | Build a legal Q&A Agent |
4 | Tool Use and MCP | Tool wrapping and MCP integration; external API calls | Connect to an MCP server to build a deep-research Agent |
5 | Programming and Code Execution | Codebase understanding; reliable code edits; consistent execution environments | Build an Agent that can develop Agents |
6 | Model Evaluation and Selection | Model capability evaluation; LLM as a Judge; safety guardrails | Build a benchmark and auto-evaluate Agents with LLM as a Judge |
7 | Multimodal and Real-time Interaction | Real-time voice Agents; operating PCs and phones | Implement a voice-call Agent & integrate browser-use for computer control |
8 | Multi-Agent Collaboration | A2A communication protocol; Agent team roles and collaboration | Design a multi-Agent system to “make calls while operating a computer” |
9 | Project Integration and Demo | Final assembly and demo; polishing the final deliverable | Showcase your unique general-purpose Agent |
9-Week Advanced Topics
Week | Topic | Advanced Content Overview | Advanced Hands-on Case |
---|---|---|---|
1 | Agent Basics | The importance of context | Explore the impact of missing context on Agent behavior |
2 | Context Design | Organizing user memory | Build a personal knowledge management Agent for long-text summarization |
3 | RAG and Knowledge Bases | Long-context compression | Build an academic paper analysis Agent to summarize core contributions |
4 | Tool Use and MCP | Learning from experience | Enhance the deep-research Agent’s expertise (sub-agents and domain experience) |
5 | Programming and Code Execution | Agent self-evolution | Build an Agent that autonomously leverages open-source software to solve unknown problems |
6 | Model Evaluation and Selection | Parallel sampling and sequential revision | Add parallelism and revision to the deep-research Agent |
7 | Multimodal and Real-time Interaction | Combining fast and slow thinking | Implement a real-time voice Agent that combines fast and slow thinking |
8 | Multi-Agent Collaboration | Orchestration Agent | Use an Orchestration Agent to dynamically coordinate calling and computer operation |
9 | Project Integration and Demo | Comparing Agent learning methods | Compare four ways Agents learn from experience |
AI Agent Bootcamp Overview
Week 1: Agent Basics
Core Content
Agent structure and taxonomy
Workflow-based
- Predefined flows and decision points
- Highly deterministic; suitable for automating simple business processes
Autonomous
- Dynamic planning and self-correction
- Highly adaptable; suitable for open-ended research and exploration, solving complex problems
Basic frameworks and scenario fit
ReAct framework: Observe → Think → Act
Agent = LLM + context + tools
- LLM: decision core (the brain)
- Context: perceives the environment (eyes and ears)
- Tools: interact with the world (hands)
Hands-on Case: Build an Agent that can search the web
Goal: Build a basic autonomous Agent that can understand user questions, fetch information via a search engine, and produce a summarized answer.
Core challenges:
- Task decomposition: Break complex questions into searchable keywords
- Tool definition: Define and implement a
web_search
tool - Result synthesis: Understand search results and synthesize the final answer
Architecture design:
1 | ┌──────────────┐ |
Advanced: The importance of context
Core idea: The context is the agent's operating system.
Context is the only basis for an Agent to perceive the world, make decisions, and record history.
Thinking
- The Agent’s inner monologue and chain of thought
- If missing: Turns the Agent’s behavior into a black box, making it impossible to debug or understand its decisions
Tool Call
- The actions the Agent decides to take, recording its intent
- If missing: You can’t trace the Agent’s action history, making retrospectives difficult
Tool Result
- Environmental feedback from actions taken
- If missing: The Agent can’t perceive the consequences of its actions, potentially causing infinite retries or flawed plans
Advanced Practice: Exploring the impact of missing context on Agent behavior
Goal: Through experiments, understand the indispensable roles of thinking
, tool call
, and tool result
in the Agent workflow.
Core challenges:
- Modify the Agent framework: Change the Agent’s core loop to selectively remove specific parts from the context
- Design comparative experiments: Create tasks where Agents missing different context parts exhibit clear behavioral differences or failures
- Behavior analysis: Analyze and summarize what types of failures are caused by each missing context component
Experiment design:
1 | ┌─────────┐ ┌─────────────────────┐ |
Week 2: Context Design (Context Engineering)
Core Content
Prompt templates
- System prompt: Define the Agent’s persona, capability boundaries, and behavioral guidelines
- Toolset: Names, descriptions, and parameters of tools
Conversation history and user memory
- Event sequence: Model conversation history as an alternating sequence of “observations” and “actions”
- User long-term memory: Extract key information from conversations (e.g., preferences, personal info) and store it in structured form for future interactions
Hands-on Case: Add persona and long-term memory to your Agent
Goal: Enhance the Agent’s personalization and continuity. The Agent should mimic a specific character’s speaking style (e.g., an anime character) and remember the user’s key information (e.g., name, interests) to apply in subsequent conversations.
Core challenges:
- Role-playing: How to clearly define the character’s linguistic style and personality in the prompt and keep the Agent stably in character
- Memory extraction and storage: How to accurately extract key information from unstructured dialogue and store it as a structured JSON object
- Memory application: How to naturally incorporate the stored user-memory JSON into subsequent prompts so the Agent truly appears to “remember” the user
Architecture Design:
1 | ┌─────────────┐ |
Advanced Content: Organizing User Memory
Core Idea: Naively stitching memories together leads to context bloat, information conflicts, and staleness. An advanced memory system should continuously organize, deduplicate, correct, and summarize the user’s long-term memories in the background, forming a dynamically evolving user profile.
Implementation Strategies:
- Memory deduplication and merging: Identify and merge memory entries that are similar or duplicates
- Conflict resolution: When new memories conflict with old ones (e.g., the user changed preferences), prefer the latest information
- Periodic summarization: Periodically or during idle time, use an LLM to summarize scattered memory points and distill higher-level user preferences and traits
Architecture Design:
1 | ┌───────────────────────────┐ |
Advanced Practice: Summarize Your Diary into a Personal Report
Goal: Build an Agent that can process large volumes of personal text (e.g., daily diaries, blog posts) and, by reading and organizing these texts, ultimately produce a thorough, clear personal summary report.
Key Challenges:
- Long-text processing: How to handle diaries/articles whose total size may exceed the LLM’s context window
- Information extraction and structuring: How to extract structured information points from narrative text (e.g., key events, emotional changes, personal growth)
- Coherent summary generation: How to organize scattered information points into a logically coherent, highly readable summary report
Architecture Design:
1 | ┌─────────────────────┐ |
Week 3: RAG Systems and Knowledge Bases
Core Content
Document Structuring and Retrieval Strategies
- Chunking: Split long documents into meaningful semantic chunks
- Embedding: Vectorize text chunks for similarity search
- Hybrid retrieval: Combine vector similarity and keyword search to improve recall and precision
- Re-ranking: Use more sophisticated models to re-rank the initial retrieval results
Basic RAG
- Knowledge expression: Use clear, structured natural language to express knowledge
- Knowledge base construction: Process documents and load them into a vector database
- Precise retrieval: Accurately locate relevant entries in the knowledge base based on the user’s question
Case Study: Build a Legal Q&A Agent
Goal: Make the Agent a professional legal consultant. We’ll build a knowledge base using public Chinese Criminal/Civil Law datasets so the Agent can accurately answer users’ legal questions and explicitly cite the specific statutes on which the answers are based.
Key Challenges:
- Domain data processing: How to parse and clean structured legal provisions and optimize their retrieval performance in a RAG system
- Answer accuracy and traceability: The Agent’s answers must be strictly grounded in the knowledge base, avoid improvisation, and must provide statute sources
- Handling ambiguous queries: How to guide users to pose more precise questions to match the most relevant legal provisions
Architecture Design:
1 | ┌──────────────────┐ |
Advanced Content: Treat the File System as the Ultimate Context
Core Idea: Treat the file system as the ultimate context.
An Agent should not stuff massive observations (e.g., web pages, file contents) directly into the context; that causes high costs, degraded performance, and window limits. The right approach is to store these large data in files and keep only a lightweight “pointer” (a summary and the file path) in the context.
Implementation Strategies:
- Recoverable compression: When a tool returns a large amount of content (e.g.,
read_file
), first save it completely to the sandbox file system - Summary and pointer: Append only the content’s summary and file path to the main context
- On-demand I/O: Through the
read_file
tool, the Agent can read full content from the file system on demand in later steps
Architecture Design:
1 | 正确做法 ✅ |
Advanced Practice: Build an Agent That Can Read Multiple Papers
Goal: Train a research Agent that can read a target paper and all its references (often dozens of PDFs), and, on that basis, summarize the focal paper’s core contributions and innovations relative to its references.
Key Challenges:
- Handling many PDFs: How to efficiently parse dozens of PDF papers and extract key information (abstracts, conclusions, methodology)
- Cross-document relational analysis: The core challenge is to build links between the main paper and multiple references for comparative analysis, rather than merely summarizing a single paper
- Contribution extraction: How to precisely distill the paper’s “incremental contributions” from complex academic discourse
Architecture Design:
1 | ┌─────────────────────┐ |
Week 4: Tool Use and MCP
Core Content
Multiple Ways to Wrap Tools
- Function Calling: Expose local code functions directly to the Agent
- API integration: Call external HTTP APIs to fetch real-time data or perform remote operations
- Agent as a Tool: Wrap a specialized Agent (e.g., a code-generation Agent) as a tool callable by another Agent
MCP (Model Context Protocol)
- Standardized interface: Provide a unified, language-agnostic connection standard for models and external tools/data sources
- Plug-and-play: Developers can publish tools conforming to the MCP spec, and Agents can discover and use them dynamically
- Security and isolation: Built-in permissions and sandboxing to ensure safe tool invocation
Case Study: Connect to an MCP Server to Build a Deep Research Agent
Goal: Build an Agent capable of deep information research. It should connect to multiple external tool servers conforming to MCP and autonomously plan and invoke these tools to complete a complex research project.
Key Challenges:
- Authoritative source identification: The Agent must precisely identify and prioritize high-credibility sources such as official docs and academic papers amid vast information
- Multi-tool orchestration: How to plan a call chain that connects the inputs/outputs of multiple tools (e.g., search, then read, then analyze) into a complete workflow
- Open-ended exploration: How to handle questions with no single answer, performing exploratory searches from multiple angles and aggregating results
Architecture Design:
1 | ┌───────────────────┐ |
Advanced Content: Learning from Experience
Core Idea: A truly intelligent agent should not only use tools, but also learn and evolve from the experience of using them. It should remember the “playbook” for successfully solving certain tasks (i.e., prompt templates and tool-invocation sequences) and reuse it directly when encountering similar tasks in the future.
Implementation Strategies:
- Experience storage: After successfully completing a complex task, the Agent stores the entire process (including user intent, chain of thought, tool-invocation sequence, and final result) in the knowledge base as an “experience case”
- Experience retrieval: When facing a new task, the Agent first searches the experience base for similar cases
- Experience application: If similar cases are found, the Agent uses their successful strategies as high-level guidance rather than starting from scratch each time
Architecture Design:
1 | ┌───────────┐ ┌─────────────┐ |
Advanced Practice: Enhance the Deep Research Agent’s Expert Capabilities
Goal: Equip the Agent with expert-level capabilities for complex deep-research scenarios. For example, when researching “OpenAI’s co-founders,” it can automatically launch a parallel sub-research Agent for each founder; when searching for people, it can effectively handle name collisions.
Key Challenges:
- Loading domain experience: How to load different experiential knowledge based on task type (e.g., “academic research” vs. “people research”) to guide the Agent to use the most suitable authoritative sources and prompt strategies
- Dynamic sub-agents: How to let the main Agent dynamically create multiple parallel sub-agents based on preliminary search results to handle sub-tasks separately
- Disambiguation: How to design clarification and verification mechanisms for ambiguity-prone scenarios such as people search
Architecture Design:
1 | ┌──────────────────────────┐ |
Week 5: Programming and Code Execution
Core Challenges for Code Agents
Codebase understanding:
- How to find relevant code in a large codebase (semantic search)?
- How to accurately find all call sites/references of a function in code?
Reliable code modification:
- How to reliably apply AI-generated diffs to source files (
old_string
->new_string
)?
- How to reliably apply AI-generated diffs to source files (
Consistent execution environment:
- How to ensure the Agent runs commands in the same terminal session each time (inheriting
pwd
,env var
, etc.)? - How to preconfigure the necessary dependencies and tools for the Agent’s execution environment?
- How to ensure the Agent runs commands in the same terminal session each time (inheriting
Case Study: Build an Agent That Can Develop Agents
Goal: Build an “Agent Development Engineer” Agent. It can take a high-level natural-language requirement (e.g., “Develop an Agent that can browse the web; frontend uses React + Vite + Shadcn UI, backend uses FastAPI…”) and then autonomously complete the entire application development.
Key Challenges:
- Document-driven development: How to have the Agent first write a design document for the application to be built and strictly follow it for subsequent code implementation
- Test-driven development: How to ensure the Agent writes and runs tests for every piece of code it generates to guarantee the final application’s quality and correctness
- Development and testing environment: The Agent needs a solid dev and test environment to autonomously run test cases, find bugs, and then fix them
Architecture Design:
1 | ┌──────────────────────────────┐ |
Advanced Topic: Agent Self-Evolution
Core Idea: The ultimate form of an Agent’s capability is self-evolution. When faced with a problem that existing tools can’t solve, an advanced Agent shouldn’t give up; instead, it should use its coding ability to create a new tool for itself.
Implementation Strategy:
- Capability Boundary Recognition: The Agent must first determine whether the current problem exceeds the capabilities of its existing toolset
- Tool Creation Planning: The Agent plans the new tool’s functions, inputs, and outputs, and searches open-source repositories (e.g., GitHub) for usable implementations
- Code Encapsulation and Verification: The Agent wraps the discovered code into a new tool function, writes test cases for it, and verifies correctness in a sandbox
- Tool Library Persistence: After validation, add the new tool to its permanent tool library for future use
Architecture Design:
1 | ┌────────────┐ ┌────────────┐ |
Week 6: Evaluation and Selection of Large Models
Core Content
Evaluating the Capability Boundaries of Large Models
- Core Capability Dimensions: reasoning ability, knowledge breadth, hallucination, long-text handling, instruction following, tool invocation
- Build Discriminative Test Cases: Design Agent-centric evaluation sets rather than simple chatbot Q&A
- LLM as a Judge: Use a strong LLM (e.g., GPT-4.1) as the “judge” to automatically evaluate and compare the output quality of different models or Agents
Adding Safety Guardrails to Large Models
- Input Filtering: Prevent prompt injection
- Output Filtering: Monitor and block inappropriate or dangerous content
- Human Intervention: Introduce a human-in-the-loop confirmation step before high-risk actions
- Cost Control: Monitor token usage, set budget limits, and prevent abuse
Practical Case: Build an Evaluation Dataset and Use LLM as a Judge to Auto-Evaluate the Agent
Goal: For the in-depth research Agent we built in previous weeks, systematically construct an evaluation dataset. Then develop an automated testing framework that uses the LLM as a Judge approach to assess how different “brains” (e.g., Claude 4 vs Gemini 2.5) and different strategies (e.g., enabling/disabling chain-of-thought) affect the Agent’s performance.
Key Challenges:
- Evaluation Dataset Design: How to craft a set of research tasks that are representative yet cover edge cases?
- “Judge” Prompt Design: How to design the prompt for the “LLM Judge” so it can score the Agent’s outputs fairly, consistently, and accurately?
- Result Interpretability: How to analyze the auto-evaluation results to identify the strengths and weaknesses of different models or strategies
Architecture Design:
1 | ┌─────────────────┐ |
Advanced Topic: Parallel Sampling and Sequential Revision
Core Idea: Simulate humans’ “brainstorming” and “reflect-and-revise” processes to handle complex, open-ended problems and improve the quality and robustness of Agent outputs.
Parallel Sampling (Parallel Sampling)
- Idea: Launch multiple Agent instances simultaneously, using slightly different prompts or a higher temperature to explore solutions in parallel from multiple angles
- Advantages: Increase the chance of finding the optimal solution and avoid the limitations of a single Agent’s thinking
- Implementation: Similar to Multi-Agent, but aimed at solving the same problem; finally select the best answer via an evaluation mechanism (e.g., LLM as a Judge)
Sequential Revision (Sequential Revision)
- Idea: Have the Agent critique and revise its own initial output
- Process: Initial response → self-evaluation → issue identification → generate improvements → final output
- Advantages: Improve single-task success rate and depth of answers, achieving self-optimization
Advanced Practice: Add Parallel and Revision Capabilities to the In-Depth Research Agent
Goal: Integrate both Parallel Sampling and Sequential Revision into our in-depth research Agent, and use the evaluation framework we just built to quantify whether and to what extent these strategies improve the Agent’s performance.
Key Challenges:
- Strategy Fusion: How to organically combine Parallel Sampling (horizontal expansion) and Sequential Revision (vertical deepening) into a single Agent workflow?
- Cost Control: Both strategies significantly increase LLM call costs; how to design mechanisms that balance performance gains and cost?
- Performance Attribution: In evaluation, how to accurately attribute performance improvements to Parallel Sampling versus Sequential Revision?
Architecture Design:
1 | ┌────────────────┐ |
Week 7: Multimodal and Real-Time Interaction
Core Content
Real-Time Voice Call Agent
- Tech Stack: VAD (Voice Activity Detection), ASR (Automatic Speech Recognition), LLM, TTS (Text-to-Speech)
- Low-Latency Interaction: Optimize end-to-end latency from user voice input to Agent voice output
- Natural Interrupt Handling: Allow users to interject while the Agent is speaking, achieving more human-like dialogue flow
Operating Computers and Phones
- Visual Understanding: The Agent needs to understand screenshots and identify UI elements (buttons, input fields, links)
- Action Mapping: Map natural-language commands like “click the login button” precisely to screen coordinates or UI element IDs
- Integration with Existing Frameworks: Invoke mature frameworks like
browser-use
to quickly give the Agent the ability to operate a computer
Practical Case 1: Build a Real-Time Voice Call Agent That Can Listen and Speak
Goal: From scratch, build an Agent that can engage in real-time, fluent voice conversations with users. It needs to respond quickly, understand and execute voice commands, and even proactively initiate guided dialogue.
Key Challenge:
- Latency Control: The end-to-end latency from user voice input to Agent voice output is critical to the experience. How to optimize each part of the stack?
Architecture Design:
1 | 语音输入流 大脑 语音输出流 |
Practical Case 2: Integrate browser-use to Let the Agent Operate Your Computer
Goal: Call the existing browser-use
framework to give our Agent the ability to operate the computer browser. The Agent should understand user operation commands (e.g., “help me open anthropic.com and find the computer use documentation”) and translate them into actual browser actions.
Key Challenges:
- Framework Integration: How to smoothly integrate
browser-use
as a tool into our existing Agent architecture - Instruction Generalization: User commands may be ambiguous; how to help the Agent understand them and translate them into precise operations supported by
browser-use
- State Synchronization: How to make the Agent aware of browser operation results (e.g., navigation, element loading) to inform the next decision
Architecture Design:
1 | ┌───────────────────┐ |
Advanced Topic: Fast–Slow Thinking and Intelligent Interaction Management
Fast–Slow Thinking (Mixture-of-Thoughts) Architecture
- Fast Response Path: Use low-latency models (e.g., Gemini 2.5 Flash) for instant feedback, handling simple queries and maintaining conversational flow
- Deep Thinking Path: Use stronger SOTA models (e.g., Claude 4 Sonnet) for complex reasoning and tool invocation to provide more precise, in-depth answers
Intelligent Interaction Management
- Smart Interrupts (Interrupt Intent Detection): Use VAD and small models to filter background noise and meaningless backchannels, and only stop speaking when the user has a clear intent to interrupt
- Turn-Taking Judgment (Turn Detection): Analyze the semantic completeness of what the user has said to decide whether the AI should continue speaking, avoiding talking over the user
- Silence Management (Silence Management): When the user is silent for an extended time, proactively start a new topic or ask follow-ups to maintain continuity
Advanced Practice: Build an Advanced Real-Time Voice Agent
Goal: Build an advanced voice Agent that integrates the “fast–slow thinking” architecture and “intelligent interaction management,” achieving industry-leading response speed and naturalness of interaction.
Key Challenges and Acceptance Criteria:
- Basic Reasoning: Ask: “What is 8 to the 6th power?” — must give an initial response within 2 seconds and the correct answer “262144” within 15 seconds.
- Tool Invocation: Ask: “How is the weather in Beijing today?” — must respond within 2 seconds and return accurate weather via API within 15 seconds.
- Intelligent Interaction Management:
- Smart Interrupts: During the Agent’s speech:
- If the user says “uh-huh,” the Agent should not stop speaking.
- If the user taps the table, the Agent should not stop speaking.
- If the user says “Then its battery life…,” the Agent should immediately stop the current utterance.
- Turn-Taking Judgment: After the user says “Then its battery life…” and deliberately pauses, the Agent should not respond.
- Silence Management: If the user pauses for more than 3 seconds after saying “Then its battery life…,” the Agent should proactively guide the conversation or ask a follow-up to keep the exchange smooth.
- Smart Interrupts: During the Agent’s speech:
Architecture Design:
1 | ┌──────────┐ ┌──────────┐ |
Week 8: Multi-Agent Collaboration
Core Content
Limitations of a Single Agent
- High Context Cost: A single context window balloons quickly in complex tasks
- Inefficiency of Sequential Execution: Cannot process multiple sub-tasks in parallel
- Quality Degradation with Long Contexts: Models tend to “forget” or get “distracted” in overly long contexts
- No Parallel Exploration: Can only explore along a single path
Advantages of Multi-Agent
- Parallel processing: Break down the task and have different SubAgents process in parallel to improve efficiency
- Independent context: Each SubAgent has an independent, more focused context window to ensure execution quality
- Compression is essence: Each SubAgent only needs to return its most important findings, aggregated by the main Agent to achieve efficient information compression
- Emergent collective intelligence: Suited for open-ended research and other tasks requiring multi-angle analysis
Practical case: Design a Multi-Agent collaborative system to achieve “talk on the phone while using a computer”
Goal: Solve the challenge of “multitasking.” Build a team composed of a “Phone Agent” and a “Computer Agent.” The “Phone Agent” handles voice communication with the user to gather information; the “Computer Agent” operates the web in sync. The two communicate in real time and collaborate efficiently.
Key challenges:
- Dual-Agent architecture: Two independent Agents, one responsible for voice calls (Phone Agent), and one for operating the browser (Computer Agent)
- Collaborative communication between Agents: The two Agents must communicate bidirectionally and efficiently. Information obtained by the Phone Agent should be immediately conveyed to the Computer Agent, and vice versa. This can be implemented via tool calls
- Parallel work and real-time performance: The key is that the two Agents must work in parallel without blocking each other. Each one’s context needs to include real-time messages from the other Agent
Architecture design:
1 | ┌──────────┐ 语音 ┌──────────────┐ A2A通信 ┌──────────────┐ GUI操作 ┌──────────────┐ |
Advanced topic: Orchestration Agent - Treat Sub-agents as tools
Core idea: Instead of hard-coded inter-Agent collaboration, introduce a higher-level “Orchestration Agent”. Its core responsibility is to understand the user’s top-level goal and dynamically select, launch, and coordinate a set of “expert Sub-agents” (as tools) to accomplish the task together.
Implementation strategy:
- Sub-agent as Tools: Each expert Sub-agent (e.g., Phone Agent, Computer Agent, Research Agent) is encapsulated as a “tool” conforming to a standard interface
- Dynamic tool invocation: The Orchestration Agent, based on user needs, asynchronously invokes one or more Sub-agent tools
- Direct inter-Agent communication: Allow invoked Sub-agents to establish direct communication channels for efficient task collaboration, without everything relayed through the Orchestration Agent
Architecture design:
1 | ┌──────────────────┐ |
Advanced practice: Use an Orchestration Agent to dynamically coordinate phone and computer operations
Goal: Refactor our “talk on the phone while using a computer” system. Instead of hard-coding the launch of two Agents, create an Orchestration Agent. When the user asks “help me call to book a flight”, the Orchestration Agent can automatically understand that the task requires both “making a phone call” and “operating a computer”, then launch these two Sub-agents in parallel and have them collaborate.
Key challenges:
- Task planning and tool selection: How can the Orchestration Agent accurately decompose a vague user goal into which specific Sub-agent tools are needed?
- Asynchronous tool management: How to manage the lifecycle (start, monitor, terminate) of multiple parallel, long-running Sub-agent tools?
- Communication between Sub-agents: How to establish an efficient, temporary direct communication mechanism for dynamically launched Sub-agents?
Architecture design:
1 | ┌────────────────────────┐ |
Week 9: Project Showcase
Core content
Project integration and showcase
- Integration capability: Combine skills from the first 8 weeks (RAG, tool calling, speech, multimodal, Multi-Agent) into a final project
- Results presentation: Each participant will have the opportunity to showcase their unique general-purpose Agent and share the thinking and challenges during creation
- Peer review: Through mutual demos and Q&A, gain inspiration and ideas from classmates’ projects
Book polishing and summary
- Knowledge consolidation: Together review and summarize the core knowledge points of the 9 weeks and solidify them into the final manuscript of 《In-Depth yet Simple AI Agent》
- Co-creating content: Propose edits to the manuscript and polish it together to ensure it is “systematic and practical”
- Credited publication: All participants who co-create will have their names appear in the final printed book
Practical case: Showcase your unique general-purpose Agent
Goal: Provide a comprehensive summary and demo of the personal Agent project built during the camp. This is not only a results report, but also an exercise in systematizing what you learned and clearly explaining complex technical solutions to others.
Presentation highlights:
- Agent positioning: What core problem does your Agent solve?
- Technical architecture: How did you combine what you learned (context, RAG, tools, multimodal, Multi-Agent) to achieve the goal?
- Innovation highlights: What is the most creative design in your Agent?
- Demo: Live demo of the Agent’s core capabilities
- Future outlook: How do you plan to continue iterating and improving your Agent?
Final project architecture example:
1 | ┌──────────────┐ |
Advanced topic: Four ways an Agent learns from experience
1. Rely on long-context capability
- Idea: Trust and leverage the model’s own long-context processing ability, providing the complete, uncompressed conversation history as input
- Implementation:
- Keep recent conversation: Fully retain the recent interaction history (Context Window)
- Compress long-term memory: Use
Linear Attention
to automatically compress distant conversation history into the latent space - Extract key snippets: Use
Sparse Attention
to automatically extract segments most relevant to the current task from distant conversation history
- Pros: Easiest to implement, preserves original information detail to the greatest extent
- Cons: Strongly dependent on model capability
2. Extract in text form (RAG)
- Idea: Summarize experience into natural language and store it in a knowledge base
- Implementation: Retrieve relevant experience text via RAG and inject it into the prompt
- Pros: Cost-controllable, knowledge is readable and maintainable
- Cons: Depends on retrieval accuracy
3. Post-training (SFT/RL)
- Idea: Learn the experience into the model weights
- Implementation: Use high-quality Agent behavior trajectories as data to fine-tune (SFT) or reinforcement-train (RL) the model
- Pros: Internalizes experience as the model’s “intuition”, suitable for complex tasks with strong generalization
- Cons: Higher cost, requires lots of high-quality data; longer cycles, hard to realize a real-time feedback loop—i.e., examples that just failed online won’t immediately prevent similar mistakes
4. Abstract into code (tools/Sub-agent)
- Idea: Abstract recurring successful patterns into a reusable tool or Sub-agent
- Implementation: The Agent identifies automatable patterns and writes code to solidify them
- Pros: Reliable and efficient learning method
- Cons: Requires strong coding ability from the Agent; as tool count grows, tool selection becomes a challenge
Advanced practice: Compare the four ways an Agent learns from experience
Goal: Using the evaluation framework we built in Week 6, design experiments to compare the pros and cons of the four learning-from-experience approaches for Agents.
Key challenges:
- Experimental design: How to design a set of tasks that clearly highlights the differences among the four learning methods?
- Cost-performance tradeoff: In the evaluation report, how to combine each method’s “performance score” with its “computational cost” for a holistic assessment?
- Scenario-based analysis: Draw conclusions about which learning method to prioritize under which task scenarios?
Architecture design:
1 | ┌────────────┐ |
Summary recap
Through 9 weeks of systematic study and practice, we completed the full journey from getting started with Agents to building general-purpose intelligent agents:
Core competencies mastered
- Agent architecture understanding: Deeply understood the core design paradigm of
LLM + context + tools
- Context engineering mastery: Mastered multi-level context management techniques
- Tooling system construction: Implemented reliable integrations with external APIs and MCP Server
- Multimodal interaction: Built voice and vision multimodal Agents
- Collaboration pattern design: Implemented complex collaboration patterns such as Multi-Agent and Orchestration
Practical project portfolio
- Web-connected search Agent
- Legal Q&A Agent
- In-depth research Agent
- Agent development engineer Agent
- Real-time voice call Agent
- Multi-Agent collaborative system
Advanced technical exploration
- Context compression and optimization
- Four ways to learn from experience
- Parallel sampling and sequential revision
- Fast and slow thinking architecture
- An Agent’s self-evolution
🚀 Start building your own AI Agent right here!