AI Agent Bootcamp: Build Your General-Purpose Agent in 9 Weeks
【This article is compiled from the first live session of the Turing Community AI Agent Bootcamp, Slides link】
Turing Community “AI Agent Bootcamp” purchase link
Build an AI Agent of your own—start here. This article not only systematically introduces the foundational technical path to building a general-purpose AI Agent from scratch (such as context engineering, RAG systems, tool use, multimodal interaction, etc.), but also covers advanced techniques like fast/slow thinking and multi-Agent collaboration. Through 9 weeks of hands-on projects, you will progressively master the full lifecycle of Agent development and the core advanced skills.
This course had its first live preview on August 18 and will officially start on September 11. Each week includes about 2 hours of class time covering all the foundational and advanced topics below. Of course, just 2 hours of lectures per week is not enough—you’ll also need to spend time coding and practicing.
Core Goals of the Bootcamp
Build an AI Agent of your own—start here
🎯 Master core architecture and engineering capabilities
- Deeply understand Agent architecture: Systematically grasp the core design paradigm of
LLM + context + tools
. - Master context engineering: Learn multi-layered context management from conversation history and long-term user memory to external knowledge bases (RAG) and file systems.
- Master dynamic tool calling: Reliably integrate Agents with external APIs and MCP Server, and enable self-improvement via code generation.
- Build advanced Agent patterns: Design and implement fast/slow thinking (Mixture-of-Thoughts), orchestration, and other complex Agent collaboration patterns.
💡 Build a systematic understanding of development and deployment
- Understand the path of technical evolution: See the progression from basic RAG to Agents that can autonomously develop tools.
- Master the full Agent lifecycle: Be able to independently complete the closed loop of design, development, evaluation with LLM as a Judge, and deployment.
- Build domain knowledge: Accumulate cross-domain Agent development experience through hands-on projects in law, academia, programming, etc.
- Consolidate your knowledge system: Co-create the book “AI Agent, Explained,” turning fragmented knowledge into a systematic output.
9-Week Hands-On Plan Overview
Week | Topic | Content Overview | Hands-On Case |
---|---|---|---|
1 | Agent Basics | Agent structure and taxonomy, workflow-based vs autonomous | Build a web-connected search Agent |
2 | Context Design | Prompt templates, conversation history, long-term user memory | Add persona and long-term memory to your Agent |
3 | RAG and Knowledge Base | Document structuring, retrieval strategies, incremental updates | Build a legal Q&A Agent |
4 | Tool Use and MCP | Tool wrapping and MCP integration, external API calls | Connect to an MCP Server to build a deep-research Agent |
5 | Programming and Code Execution | Codebase understanding, reliable code modification, consistent execution environment | Build an Agent that can develop Agents by itself |
6 | Model Evaluation and Selection | Model capability evaluation, LLM as a Judge, safety guardrail design | Build an evaluation dataset and auto-evaluate Agents with LLM as a Judge |
7 | Multimodal and Real-time Interaction | Real-time voice Agent, operating computers and phones | Implement a voice call Agent & integrate browser-use to operate a computer |
8 | Multi-Agent Collaboration | A2A communication protocol, Agent team roles and collaboration | Design a multi-Agent collaboration system to “make calls while operating the computer” |
9 | Project Integration and Demo | Final integration and demo of the Agent project, polishing the final deliverable | Showcase your unique general-purpose Agent |
9-Week Advanced Topics
Week | Topic | Advanced Content Overview | Advanced Hands-On Case |
---|---|---|---|
1 | Agent Basics | The importance of context | Explore how missing context affects Agent behavior |
2 | Context Design | Organizing user memory | Build a personal knowledge management Agent to summarize long texts |
3 | RAG and Knowledge Base | Long-context compression | Build a research paper analysis Agent to summarize core contributions |
4 | Tool Use and MCP | Learning from experience | Enhance the deep-research Agent’s expert capability (sub-agents and domain experience) |
5 | Programming and Code Execution | Agent self-evolution | Build an Agent that autonomously leverages open-source software to solve unknown problems |
6 | Model Evaluation and Selection | Parallel sampling and sequential revision | Add parallelism and revision capabilities to the deep-research Agent |
7 | Multimodal and Real-time Interaction | Combining fast and slow thinking | Implement a real-time voice Agent that combines fast and slow thinking |
8 | Multi-Agent Collaboration | Orchestration Agent | Use an Orchestration Agent to dynamically coordinate phone calls and computer operations |
9 | Project Integration and Demo | Comparing Agent learning methods | Compare four ways an Agent learns from experience |
AI Agent Bootcamp Overview
Week 1: Agent Basics
Core Content
Agent structure and taxonomy
Workflow-based
- Predefined processes and decision points
- Highly deterministic; suitable for automation of simple business processes
Autonomous
- Dynamic planning and self-correction
- Highly adaptive; suitable for open-ended research, exploration, and complex problem solving
Basic frameworks and scenario judgment
ReAct framework: Observe → Think → Act
Agent = LLM + context + tools
- LLM: decision core (the brain)
- Context: perceive the environment (eyes and ears)
- Tools: interact with the world (hands)
Hands-On Case: Build a web-connected search Agent
Goal: Build a basic autonomous Agent that can understand user questions, retrieve information via search engines, and summarize the answer.
Core challenges:
- Task decomposition: Break complex questions into searchable keywords
- Tool definition: Define and implement a
web_search
tool - Result integration: Understand search results and synthesize the final answer
Architecture design:
1 | ┌──────────────┐ |
Advanced: The importance of context
Core idea: The context is the agent's operating system.
Context is the only basis for an Agent to perceive the world, make decisions, and record history.
Thinking
- The Agent’s inner monologue and chain-of-thought
- Missing consequence: Turns Agent behavior into a black box, making it hard to debug or understand decisions
Tool Call
- The actions the Agent decides to take, recording its intent
- Missing consequence: You can’t trace the Agent’s action history, making retrospectives difficult
Tool Result
- Environmental feedback from actions
- Missing consequence: The Agent can’t sense the outcome of its actions, which may lead to infinite retries or poor planning
Advanced practice: Exploring how missing context affects Agent behavior
Goal: Through experiments, understand the indispensable roles of thinking
, tool call
, and tool result
in the Agent workflow.
Core challenges:
- Modify the Agent framework: Change the Agent’s core loop to selectively remove specific parts from the context
- Design controlled experiments: Create a task set where missing different parts of context leads to distinct behavioral differences or failures
- Behavior analysis: Analyze and summarize what types of failures each missing context part causes
Experiment design:
1 | ┌─────────┐ ┌─────────────────────┐ |
Week 2: Context Design (Context Engineering)
Core Content
Prompt templates
- System prompt: Set the Agent’s role, capability boundaries, and behavior guidelines
- Toolset: Tool names, descriptions, parameters
Conversation history and user memory
- Event sequence: Model the conversation history as an alternating sequence of “observations” and “actions”
- Long-term user memory: Extract key user information (e.g., preferences, personal details) and store it in structured form for future interactions
Hands-On Case: Add persona and long-term memory to your Agent
Goal: Enhance personalization and continuity. The Agent should mimic the speaking style of a specific persona (e.g., an anime character) and remember key user information (e.g., name, interests) to use in subsequent conversations.
Core challenges:
- Role-playing: How to clearly define the persona’s language style and personality in the prompt, and keep the persona stable
- Memory extraction and storage: How to accurately extract key information from unstructured dialog and store it as a structured JSON object
- Memory application: How to naturally incorporate the stored user-memory JSON into subsequent prompts so the Agent truly “remembers” the user
Architecture Design:
1 | ┌─────────────┐ |
Advanced Content: Organizing User Memory
Core Idea: Simple concatenation of memories can cause context bloat, information conflicts, and staleness. An advanced memory system needs to continuously organize, deduplicate, correct, and summarize the user’s long-term memories in the background, forming a dynamically evolving user profile.
Implementation Strategies:
- Memory deduplication and merging: Identify and merge memory entries that are similar or duplicate in content
- Conflict resolution: When new memories conflict with old ones (e.g., the user has changed preferences), treat the newest information as authoritative
- Periodic summarization: Regularly or during background idle time, use an LLM to summarize scattered memory points and distill higher-level user preferences and traits
Architecture Design:
1 | ┌───────────────────────────┐ |
Advanced Practice: Summarize Your Diary into a Personal Report
Goal: Build an agent that can handle large amounts of personal text (such as daily diaries and blog posts) and, by reading and organizing these texts, ultimately generate a detailed, clear personal summary report.
Key Challenges:
- Long-text processing: How to handle diaries/articles whose total size may exceed the LLM context window
- Information extraction and structuring: How to extract structured information points (e.g., key events, emotional changes, personal growth) from narrative text
- Coherent summary generation: How to organize scattered information points into a logically coherent, highly readable summary report
Architecture Design:
1 | ┌─────────────────────┐ |
Week 3: RAG Systems and Knowledge Bases
Core Content
Document Structuring and Retrieval Strategies
- Chunking: Split long documents into meaningful semantic chunks
- Embedding: Vectorize text chunks for similarity search
- Hybrid retrieval: Combine vector similarity and keyword search to improve recall and precision
- Re-ranking: Use more sophisticated models to re-rank initial retrieval results
Basic RAG
- Knowledge expression: Express knowledge in clear, structured natural language
- Knowledge base construction: Process documents and load them into a vector database
- Precise retrieval: Precisely locate relevant entries in the knowledge base based on the user’s question
Practical Case: Build a Legal Q&A Agent
Goal: Enable the agent to act as a professional legal advisor. We will use public datasets of Chinese Criminal/Civil Law to build a knowledge base, allowing the agent to answer legal questions accurately and explicitly point to the specific statutes the answers are based on.
Key Challenges:
- Domain data processing: How to parse and clean structured statutory data and optimize its retrieval performance within a RAG system
- Answer accuracy and traceability: The agent’s answers must strictly be based on the knowledge base, avoid free-form speculation, and must provide statute sources
- Handling vague queries: How to guide users to ask more precise questions to match the most relevant statutes
Architecture Design:
1 | ┌──────────────────┐ |
Advanced Content: Treat the File System as the Ultimate Context
Core Idea: Treat the file system as the ultimate context.
An agent should not stuff huge observations (e.g., web pages, file contents) directly into context; this is costly, degrades performance, and is limited by window size. The right approach is to store this big data in files and keep only a lightweight “pointer” (a summary and the file path) in context.
Implementation Strategies:
- Recoverable compression: When a tool returns a large amount of content (e.g.,
read_file
), first save it in full to the sandbox file system - Summary and pointer: Append only a summary of the content and the file path to the main context
- On-demand I/O: Via the
read_file
tool, the agent can read the full content from the file system on demand in later steps
Architecture Design:
1 | 正确做法 ✅ |
Advanced Practice: Build an Agent That Can Read Multiple Papers
Goal: Train an academic research agent that can read a specified paper and all of its references (often dozens of PDFs) and, based on that, summarize the paper’s core contributions and innovations relative to its references.
Key Challenges:
- Large-scale PDF processing: How to efficiently parse dozens of PDF papers and extract key information (abstract, conclusions, methodology)
- Cross-document relational analysis: The core challenge is enabling the agent to establish links between the main paper and multiple references for comparative analysis, rather than simply summarizing a single paper
- Contribution distillation: How to precisely extract the paper’s “incremental contributions” from complex academic discourse
Architecture Design:
1 | ┌─────────────────────┐ |
Week 4: Tool Use and MCP
Core Content
Multiple Ways to Wrap Tools
- Function Calling: Expose local code functions directly to the agent
- API access: Call external HTTP APIs to obtain real-time data or perform remote operations
- Agent as a Tool: Wrap a specialized agent (e.g., a code-generation agent) as a tool callable by another agent
MCP (Model Context Protocol)
- Standardized interfaces: Provide a unified, language-agnostic connection standard between models and external tools/data sources
- Plug-and-play: Developers can publish MCP-compliant tools, and agents can dynamically discover and use them
- Security and isolation: Built-in permissions and sandboxing to ensure safe tool invocation
Practical Case: Connect to MCP Servers to Build a Deep Research Agent
Goal: Build an agent capable of in-depth information research. It needs to connect to multiple MCP-compliant external tool servers and autonomously plan and invoke these tools to complete a complex research task.
Key Challenges:
- Authoritative source identification: The agent must accurately identify and adopt high-credibility sources such as official documents and academic papers amid massive information
- Multi-tool coordination: How to plan a call chain that links outputs/inputs of multiple tools (e.g., search, then read, then analyze) into a complete workflow
- Open-ended exploration: How to handle open-ended questions with no single answer, conduct exploratory searches from multiple angles, and synthesize results
Architecture Design:
1 | ┌───────────────────┐ |
Advanced Content: Learn from Experience
Core Idea: A truly intelligent agent not only uses tools, but also learns and evolves from the experience of using them. It should remember the “playbook” for successfully solving certain tasks (i.e., prompt templates and tool-call sequences) and directly reuse it when similar tasks arise in the future.
Implementation Strategies:
- Experience storage: After a complex task is successfully completed, the agent stores the entire process (including user intent, chain of thought, tool-call sequence, and final result) as an “experience case” in the knowledge base
- Experience retrieval: When facing a new task, the agent first searches for similar cases in the experience base
- Experience application: If a similar case is found, the agent uses its successful strategy as high-level guidance instead of starting from scratch each time
Architecture Design:
1 | ┌───────────┐ ┌─────────────┐ |
Advanced Practice: Enhance the Deep Research Agent’s Expert Capabilities
Goal: Equip the agent with expert-level capabilities for complex deep-research scenarios. For example, when researching “OpenAI’s co-founders,” it can automatically spawn a parallel sub-research agent for each founder; when searching for person information, it can effectively handle name collisions.
Key Challenges:
- Loading domain experience: How to load different experiential knowledge based on task type (e.g., “academic research” vs. “people research”) to guide the agent to use the most appropriate authoritative sources and prompt strategies
- Dynamic sub-agents: How to let the main agent dynamically create multiple parallel sub-agents based on preliminary search results to handle sub-tasks separately
- Disambiguation: How to design clarification and verification mechanisms when handling ambiguous scenarios such as people searches
Architecture Design:
1 | ┌──────────────────────────┐ |
Week 5: Programming and Code Execution
Core Challenges for Code Agents
Codebase comprehension:
- How to find relevant code in a large codebase (semantic search)?
- How to accurately query all references to a function in the code?
Reliable code modification:
- How to reliably apply AI-generated diffs to source files (
old_string
->new_string
)?
- How to reliably apply AI-generated diffs to source files (
Consistent execution environment:
- How to ensure the agent executes commands in the same terminal session each time (inheriting
pwd
,env var
, etc.)? - How to preconfigure the agent’s execution environment with the required dependencies and tools?
- How to ensure the agent executes commands in the same terminal session each time (inheriting
Practical Case: Build an Agent That Can Develop Agents
Goal: Create an “Agent Development Engineer” agent. It can take a high-level natural-language requirement (e.g., “Develop an agent that can browse the web; frontend with React + Vite + Shadcn UI; backend with FastAPI…”) and then autonomously complete the entire application development.
Key Challenges:
- Documentation-driven development: How to have the agent first write a design document for the application to be built and strictly follow it for subsequent code implementation
- Test-driven development: How to ensure the agent writes and runs test cases for each piece of code it generates to guarantee the delivered application’s quality and correctness
- Development and test environment: The agent needs a solid development and testing environment to autonomously execute tests, discover bugs, and then fix them
Architecture Design:
1 | ┌──────────────────────────────┐ |
Advanced Content: Agent Self-Evolution
Core Concept: The ultimate form of an Agent’s capability is self-evolution. When faced with a problem that existing tools cannot solve, an advanced Agent should not give up; it should leverage its coding ability to create a new tool for itself.
Implementation Strategy:
- Capability Boundary Detection: The Agent must first determine whether the current problem exceeds the capabilities of its existing toolset
- Tool Creation Planning: The Agent plans the new tool’s functions, inputs, and outputs, and searches open-source repositories (e.g., GitHub) for usable implementations
- Code Wrapping and Verification: The Agent wraps the found code into a new tool function, writes test cases for it, and validates its correctness in a sandbox
- Tool Library Persistence: After validation, add the new tool to its permanent tool library for future use
Architecture Design:
1 | ┌────────────┐ ┌────────────┐ |
Week 6: Evaluation and Selection of Large Models
Core Content
Assessing the Capability Boundaries of Large Models
- Core capability dimensions: reasoning ability, knowledge breadth, hallucination, long text, instruction following, tool invocation
- Build discriminative test cases: Design Agent-centric evaluation sets, rather than simple chatbot Q&A
- LLM as a Judge: Use a strong LLM (e.g., GPT-4.1) as the “judge” to automatically evaluate and compare the output quality of different models or Agents
Putting Safety Guardrails on Large Models
- Input filtering: Prevent prompt injection
- Output filtering: Monitor and block inappropriate or dangerous content
- Human intervention: Introduce a human confirmation step before high-risk operations (Human-in-the-loop)
- Cost control: Monitor token consumption, set budget limits, and prevent abuse
Hands-on Case: Build an evaluation dataset, use LLM as a Judge to automatically evaluate the Agent
Goal: For the in-depth research Agent we built in previous weeks, systematically build an evaluation dataset. Then develop an automated test framework that uses the LLM as a Judge approach to evaluate how different “brains” (e.g., Claude 4 vs Gemini 2.5) and different strategies (e.g., enabling/disabling chain-of-thought) affect the Agent’s performance.
Key Challenges:
- Evaluation dataset design: How to design a set of research tasks that are representative yet cover various edge cases?
- “Judge” prompt design: How to design the prompt for the “LLM Judge” so it can score the Agent’s output fairly, consistently, and accurately?
- Result interpretability: How to analyze the automated evaluation results to identify the strengths and weaknesses of different models or strategies
Architecture Design:
1 | ┌─────────────────┐ |
Advanced Content: Parallel Sampling and Sequential Revision
Core Concept: Simulate the human processes of “brainstorming” and “reflective revision” to tackle complex, open-ended problems and improve the quality and robustness of Agent outputs.
Parallel Sampling (Parallel Sampling)
- Idea: Launch multiple Agent instances simultaneously, using slightly different prompts or a higher temperature, to explore solutions in parallel from multiple angles
- Advantages: Increase the probability of finding the optimal solution, and avoid the limitations of a single Agent’s thinking
- Implementation: Similar to Multi-Agent, but the goal is to solve the same problem; finally select the best answer through an evaluation mechanism (e.g., LLM as a Judge)
Sequential Revision (Sequential Revision)
- Idea: Have the Agent critique and revise its own initial output
- Process: Initial response → self-evaluation → identify issues → generate improvements → final output
- Advantages: Improve the success rate and depth of answers for a single task, enabling self-optimization
Advanced Practice: Add parallel and revision capabilities to the in-depth research Agent
Goal: Integrate both parallel sampling and sequential revision into our in-depth research Agent. Use the evaluation framework we just built to quantitatively assess whether, and to what extent, these strategies improve the Agent’s performance.
Key Challenges:
- Strategy integration: How to organically combine parallel sampling (horizontal scaling) and sequential revision (vertical deepening) within one Agent workflow?
- Cost control: Both strategies significantly increase LLM call costs; how to balance performance gains and cost?
- Performance attribution: In evaluation, how to attribute performance improvements accurately to parallel sampling versus sequential revision?
Architecture Design:
1 | ┌────────────────┐ |
Week 7: Multimodality and Real-time Interaction
Core Content
Real-time voice-call Agent
- Tech stack: VAD (Voice Activity Detection), ASR (Automatic Speech Recognition), LLM, TTS (Text-to-Speech)
- Low-latency interaction: Optimize the end-to-end latency from user voice input to Agent voice output
- Natural interruption handling: Allow users to interject while the Agent is speaking for more human-like dialogue flow
Operating computers and phones
- Visual understanding: The Agent needs to interpret screenshots and identify UI elements (buttons, input fields, links)
- Action mapping: Map natural-language instructions like “click the login button” precisely to screen coordinates or UI element IDs
- Integration with existing frameworks: Directly call mature frameworks like
browser-use
to quickly equip the Agent with computer operation capabilities
Hands-on Case 1: Build a real-time voice-call Agent that can listen and speak
Goal: From scratch, build an Agent capable of real-time, fluent voice conversations with users. It should respond quickly, understand and execute voice commands, and even proactively lead guided dialogues.
Key Challenges:
- Latency control: The end-to-end latency from user voice input to Agent voice output determines the experience quality. How to optimize each part of the tech stack?
Architecture Design:
1 | 语音输入流 大脑 语音输出流 |
Hands-on Case 2: Integrate browser-use to let the Agent operate your computer
Goal: Call the existing browser-use
framework to give our Agent the ability to operate a desktop browser. The Agent should understand user operation instructions (e.g., “help me open anthropic.com and find the computer use documentation”) and translate them into actual browser actions.
Key Challenges:
- Framework integration: How to integrate
browser-use
as a tool seamlessly into our existing Agent architecture - Instruction generalization: User instructions may be vague; how to help the Agent understand them and translate them into precise operations supported by
browser-use
- State synchronization: How to let the Agent perceive the results of browser operations (e.g., page navigation, element loading) to make the next decision
Architecture Design:
1 | ┌───────────────────┐ |
Advanced Content: Fast/Slow Thinking and Intelligent Interaction Management
Fast/Slow Thinking (Mixture-of-Thoughts) Architecture
- Fast path: Use low-latency models (e.g., Gemini 2.5 Flash) for instant feedback, handling simple queries and maintaining conversational fluency
- Deep-thinking path: Use stronger SOTA models (e.g., Claude 4 Sonnet) for complex reasoning and tool use, delivering more precise and in-depth answers
Intelligent Interaction Management
- Smart interruptions (Interrupt Intent Detection): Use VAD and smaller models to filter background noise and filler utterances, stopping only when the user has a clear intent to interrupt
- Turn-taking (Turn Detection): Analyze the semantic completeness of what the user has said to decide whether the AI should continue speaking, avoiding interruptions
- Silence management (Silence Management): When the user is silent for a long time, proactively start new topics or ask follow-ups to keep the conversation coherent
Advanced Practice: Build an advanced real-time voice Agent
Goal: Build an advanced voice Agent that integrates the “fast/slow thinking” architecture and “intelligent interaction management,” achieving industry-leading levels in response speed and natural interaction.
Key Challenges and Acceptance Criteria:
- Basic reasoning: Ask: “What is 8 to the power of 6?” — must give an initial response within 2 seconds and the correct answer “262144” within 15 seconds.
- Tool use: Ask: “How is the weather in Beijing today?” — must respond within 2 seconds and return accurate weather via API within 15 seconds.
- Intelligent interaction management:
- Smart interruption: During the Agent’s speech:
- If the user says “um”, the Agent should not stop speaking.
- If the user taps the table, the Agent should not stop speaking.
- If the user says “And its battery life…” the Agent should immediately stop the current speech.
- Turn-taking: After the user says “And its battery life…” and deliberately pauses, the Agent should not respond.
- Silence management: If the user says “And its battery life…” and pauses for more than 3 seconds, the Agent can proactively guide the conversation or ask follow-up questions to keep the exchange smooth.
- Smart interruption: During the Agent’s speech:
Architecture Design:
1 | ┌──────────┐ ┌──────────┐ |
Week 8: Multi-Agent Collaboration
Core Content
Limitations of a single Agent
- High context cost: A single context window balloons rapidly in complex tasks
- Inefficient sequential execution: Cannot process multiple subtasks in parallel
- Quality degradation in long contexts: Models in overly long contexts tend to “forget” or get “distracted”
- No parallel exploration: Can only explore along a single path
Advantages of Multi-Agent
- Parallel processing: Break down the task and hand it to different SubAgents to process in parallel, improving efficiency
- Independent context: Each SubAgent has an independent, more focused context window to ensure execution quality
- Compression is the essence: Each SubAgent only needs to return its most important findings, which the main Agent aggregates to achieve efficient information compression
- Emergent collective intelligence: Suitable for tasks requiring multi-perspective analysis, such as open-ended research
Case Study: Design a multi-Agent collaboration system to realize “talking on the phone while using the computer”
Goal: Solve the challenge of “doing two things at once.” Build a team consisting of a “Phone Agent” and a “Computer Agent.” The “Phone Agent” communicates with the user via voice to gather information; the “Computer Agent” simultaneously operates web pages. The two communicate in real time and collaborate efficiently.
Core challenges:
- Dual-Agent architecture: Two independent Agents, one responsible for voice calls (Phone Agent), and one responsible for operating the browser (Computer Agent)
- Cross-Agent collaborative communication: The two Agents must communicate efficiently in both directions. Information obtained by the Phone Agent should be immediately shared with the Computer Agent, and vice versa. This can be implemented via tool calls
- Parallel work and real-time responsiveness: The key is that both Agents must work in parallel without blocking each other. Each Agent’s context needs to include real-time messages from the other Agent
Architecture design:
1 | ┌──────────┐ 语音 ┌──────────────┐ A2A通信 ┌──────────────┐ GUI操作 ┌──────────────┐ |
Advanced: Orchestration Agent - Treat Sub-agents as tools
Core idea: Instead of hard-coded Agent-to-Agent collaboration, introduce a higher-level “Orchestration Agent.” Its core responsibility is to understand the user’s top-level goals and dynamically select, launch, and coordinate a group of “expert Sub-agents” (as tools) to complete the task together.
Implementation strategy:
- Sub-agent as Tools: Each expert Sub-agent (e.g., Phone Agent, Computer Agent, Research Agent) is encapsulated as a “tool” conforming to a standard interface
- Dynamic tool invocation: The Orchestration Agent, based on user needs, asynchronously invokes one or more Sub-agent tools
- Direct communication between Agents: Allow invoked Sub-agents to establish direct communication channels for efficient task collaboration without routing everything through the Orchestration Agent
Architecture design:
1 | ┌──────────────────┐ |
Advanced Practice: Use an Orchestration Agent to dynamically coordinate phone and computer operations
Goal: Refactor our “talk on the phone while using the computer” system. Instead of hard-coding the startup of two Agents, create an Orchestration Agent. When the user asks “help me call to book a flight,” the Orchestration Agent can automatically infer that the task requires both “making a phone call” and “operating a computer,” then launch these two Sub-agents in parallel and have them collaborate.
Core challenges:
- Task planning and tool selection: How can the Orchestration Agent accurately decompose a vague user goal into which specific Sub-agent tools are needed?
- Asynchronous tool management: How to manage the lifecycle (start, monitor, terminate) of multiple Sub-agent tools that run in parallel and for long durations
- Sub-agent intercommunication: How to establish an efficient, temporary, direct communication mechanism for dynamically launched Sub-agents
Architecture design:
1 | ┌────────────────────────┐ |
Week 9: Project Showcase
Core content
Project integration and demo
- Integration capability: Integrate the capabilities learned in the first 8 weeks (RAG, tool use, voice, multimodality, Multi-Agent) into a final project
- Outcome demo: Each participant will have the opportunity to showcase their unique general-purpose Agent and share the thinking and challenges during its creation
- Peer review: Gain inspiration from others’ projects through mutual demos and Q&A
Book polishing and summary
- Knowledge consolidation: Together, review and summarize the core knowledge points of the 9 weeks and solidify them into the final manuscript of “AI Agent, Explained”
- Co-creation of content: Propose edits to the manuscript, jointly polish it, and ensure it is “systematic and practical”
- Credited publication: The names of all participating co-creators will appear in the final published physical book
Case Study: Showcase your unique general-purpose Agent
Goal: Provide a comprehensive summary and showcase of the personal Agent project built during the bootcamp. This is not only a results report, but also an exercise in systematizing learned knowledge and clearly explaining complex technical solutions to others.
Key points to showcase:
- Agent positioning: What core problem does your Agent solve?
- Technical architecture: How did you synthesize the knowledge learned (context, RAG, tools, multimodality, Multi-Agent) to achieve your goal?
- Innovation highlights: What is the most creative design in your Agent?
- Demo: Live demonstration of the Agent’s core functions
- Future outlook: How do you plan to continue iterating on and improving your Agent?
Final project architecture example:
1 | ┌──────────────┐ |
Advanced: Four ways an Agent learns from experience
1. Rely on long-context capability
- Idea: Trust and leverage the model’s own long-context processing ability by feeding the complete, uncompressed conversation history
- Implementation:
- Keep recent conversations: Fully retain the most recent interaction history (Context Window)
- Compress long-term memory: Use
Linear Attention
and related techniques to automatically compress distant conversation history into Latent Space - Extract key snippets: Use
Sparse Attention
and related techniques to automatically extract the snippets most relevant to the current task from distant conversation history
- Pros: Easiest to implement; preserves original information details to the greatest extent
- Cons: Strongly dependent on model capabilities
2. Text-form extraction (RAG)
- Idea: Summarize experience into natural language and store it in a knowledge base
- Implementation: Retrieve relevant experience text via RAG and inject it into the prompt
- Pros: Controllable cost; knowledge is readable and maintainable
- Cons: Depends on retrieval accuracy
3. Post-training (SFT/RL)
- Idea: Learn the experience into the model weights
- Implementation: Use high-quality Agent behavior trajectories as data to fine-tune the model (SFT) or perform reinforcement learning (RL)
- Pros: Internalizes experience as the model’s “intuition,” suitable for complex tasks with strong generalization
- Cons: Higher cost, requires large amounts of high-quality data; long cycle, making it hard to realize a real-time feedback loop—i.e., the model will not immediately avoid similar mistakes from just-failed online examples
4. Abstract into code (tools/Sub-agent)
- Idea: Abstract recurring successful patterns into a reusable tool or Sub-agent
- Implementation: The Agent identifies automatable patterns and writes code to solidify them
- Pros: Reliable and efficient way to learn
- Cons: Requires strong coding ability from the Agent; when the number of tools grows large, tool selection becomes a challenge
Advanced practice: Compare the four ways an Agent learns from experience
Goal: Using the evaluation framework we built in Week 6, design experiments to compare the pros and cons of the four ways an Agent learns from experience.
Core challenges:
- Experiment design: How to design a set of tasks that clearly reflect the differences among the four learning methods?
- Cost-performance trade-off: How to combine each method’s “performance score” with its “computational cost” in the evaluation report for a holistic assessment?
- Scenario-based analysis: Draw conclusions about which learning method should be prioritized in which task scenarios
Architecture design:
1 | ┌────────────┐ |
Summary and review
Through 9 weeks of systematic learning and practice, we completed a full journey from Agent fundamentals to building a general-purpose intelligent agent:
Core competencies mastered
- Agent architecture understanding: Gained a deep understanding of the core design paradigm of
LLM + context + tools
- Mastery of context engineering: Mastered multi-level context management techniques
- Tooling system construction: Achieved reliable integration with external APIs and MCP Servers
- Multimodal interaction: Built voice, vision, and other multimodal Agents
- Collaboration pattern design: Implemented complex collaboration modes such as Multi-Agent and Orchestration
Practical project portfolio
- Web-connected search Agent
- Legal Q&A Agent
- In-depth research Agent
- Agent developer engineer Agent
- Real-time voice call Agent
- Multi-Agent collaboration system
Advanced technical exploration
- Context compression and optimization
- Four ways of learning from experience
- Parallel sampling and sequential revision
- Fast-and-slow thinking architectures
- An Agent’s self-evolution
🚀 Develop your own AI Agent—start here!