UCAS Spring 2026 AI Agent Practical Projects
This document provides a series of carefully designed AI Agent practical projects, covering three difficulty levels from easy to hard. These projects are intended to help students deeply understand the core technologies and design patterns of AI Agents, including tool use, multi-agent collaboration, long-term memory management, externalized learning, and other frontier topics. Each project includes clear experimental objectives, detailed descriptions of the experimental content, and specific acceptance criteria, ensuring that students can master the key skills needed to build advanced AI Agent systems through hands-on practice.
The projects are divided into three levels by difficulty. Students are advised to choose appropriate projects according to their own background and improve their abilities step by step.
Project Index
Difficulty: Easy
- Enhancing mathematical and logical reasoning ability using code generation tools
- Natural language interactive ERP Agent
- Werewolf Agent
Difficulty: Medium
- Personal photo search engine
- Intelligent video editing
- PPT generation Agent
- Book translation Agent
- Agent that collects information from multiple websites simultaneously
Difficulty: Hard
- A user memory that understands you better
- Agent that uses a computer while talking on the phone
- Computer operation Agent that gets more proficient the more you use it
- Agent that can create Agents
Difficulty: Easy
Enhancing mathematical and logical reasoning ability using code generation tools
Experimental Objective
This is a comprehensive experiment aimed at verifying the capability of an Agent to handle different types of reasoning tasks simultaneously through a code generation tool (code interpreter). Specifically, it uses Python code generation to compensate for two weaknesses of large language models:
- Mathematical computation: addressing the limitations of pure chain-of-thought reasoning in numerical precision and complexity of symbolic operations, and solving accurately using mathematical libraries.
- Logic puzzles: addressing the issues where pure text reasoning in complex logical constraints (such as Knights and Knaves problems) easily overlooks constraints or logical conflicts, by using constraint solvers for exhaustive search.
The experiment will evaluate how a single Agent can flexibly use different code libraries (math libraries vs. constraint libraries) to adapt to different task domains.
Background Knowledge
AIME (American Invitational Mathematics Examination): A high-difficulty mathematics competition in the U.S., positioned between AMC and USAMO. Its problems usually do not require advanced calculus knowledge, but demand extremely high arithmetic accuracy, algebraic skills, and rigorous logical thinking. For LLMs, AIME problems are an excellent benchmark to test complex multi-step reasoning and accurate calculation ability, because pure text reasoning often produces “calculation hallucinations” in intermediate steps.
Knights and Knaves problems: A classic type of logic puzzle popularized by logician Raymond Smullyan. On this fictional island, “knights” always tell the truth and “knaves” always lie. The solver must infer the true identity of each person based on the islanders’ statements (for example, A says: “At least one of us is a knave”). These problems severely test the ability to check logical consistency; pure LLMs can easily fall into logical dead ends or miss implicit constraints.
Experimental Content Description
Equip the Agent with a fully functional code interpreter tool, providing a Python sandbox environment that includes the following key libraries:
- Mathematics domain: sympy (symbolic computation), numpy (matrix operations), scipy (numerical optimization)
- Logic domain: python-constraint (constraint satisfaction problem solving)
The Agent needs to automatically determine and adopt the appropriate problem-solving strategy based on the type of problem input by the user:
- For math problems (such as AIME competition questions), it should formalize the problem into algebraic equations or computational procedures, call mathematical libraries to obtain precise solutions, and avoid “approximately” or “possibly”.
- For logic puzzles (such as Knights and Knaves problems), it should convert natural language constraints into formal constraint variables and conditions, call python-constraint to define the problem space and search for feasible solutions, avoiding logical loopholes in manual reasoning.
Evaluation Method:
The experiment includes two independent test sets, and the Agent needs to automatically adapt without manual prompt switching:
- Mathematical ability: Use the AIME 2025 dataset to compare the accuracy of pure chain-of-thought reasoning and code-assisted reasoning.
- Logical ability: Use the K&K Puzzle dataset, requiring the code-assisted mode to achieve over 90% accuracy on variants of different complexity.
Expected Acceptance Results
- The Agent can automatically recognize the problem type and load the correct Python libraries for solving.
- On mathematical tasks (AIME 2025), the accuracy of the code-assisted mode is significantly higher than that of the pure chain-of-thought mode, with no numerical precision errors.
- On logical tasks (K&K Puzzle), the code-assisted mode can correctly model and solve complex puzzles, with accuracy > 90%.
- Demonstrate how the Agent uses the same “Thinking -> Coding -> Execution” paradigm across two completely different domains.
Natural language interactive ERP Agent
Experimental Objective
ERP (Enterprise Resource Planning) software is a key software system for enterprises. It is generally operated via a GUI interface at present, where complex operations require multiple mouse clicks and are very cumbersome. An AI Agent can convert users’ natural language queries into SQL statements, thus enabling automated queries.
Experimental Content Description
Core Requirement: Agent rather than Workflow
The key of this experiment is to build a real Agent, not just a linear Text-to-SQL workflow. In practical applications, the SQL generated by the model in one shot often contains syntax errors, hallucinated field names, or logical flaws. The core value of an Agent lies in its self-correction capability.
You need to implement an Agent with reflection and retry mechanisms:
- Execution and monitoring: After generating SQL, the Agent calls the database tool to execute it.
- Error capturing: If the database returns an error, the Agent must capture the error message.
- Autonomous debugging: The Agent analyzes the cause of the error based on the error message and the current table schema (for example, “I used a non-existent column
salary_date, the correct one should bepay_date“), and generates a corrected SQL. - Automatic retry: Execute the corrected SQL again until success is achieved or the maximum retry count is reached.
Experimental Simulation Environment
You are required to set up a PostgreSQL database containing two tables:
- Employee table, including employee ID, employee name, department name, current level, start date, and end date (null indicates still employed).
- Salary table, including employee ID, pay date, and salary, with one pay record per month.
The Agent is required to automatically answer the following questions:
- On average, how long does each employee stay employed at the company?
- How many current employees are there in each department of the company?
- Which department has the highest average level of employees?
- How many new employees joined each department this year and last year respectively?
- From March of the year before last to May of last year, what was the average salary in department A?
- Between department A and department B, which had the higher average salary last year?
- What is the average salary of employees at each level this year?
- What is the average salary in the most recent month for employees whose length of service is within one year, one to two years, and two to three years?
- Who are the 10 employees with the largest salary increase from last year to this year?
- Has there ever been a case of delayed salary payment, i.e., a month in which an employee was employed but no salary was paid?
Werewolf Agent
Experimental Objective
Werewolf is a classic LARP (Live Action Role-Playing) game that tests players’ reasoning ability, deception skills, and social strategy. This experiment aims to build a Werewolf system that supports mixed play between humans and AI, focusing on assessing the Agent’s logical reasoning, deception, and multi-party game-theoretic ability in an environment with asymmetric information.
Experimental Content Description
The main development work of this project is divided into two independent parts:
Werewolf battle platform (Game Platform):
This is a standard Web application that does not contain any LLM logic. You need to develop an independent backend service (Game Server) and a frontend Web interface. The backend is responsible for maintaining the game state machine (dealing cards, state transitions, settlement), and the frontend provides a graphical interface for human players to participate in the game (view identities, input text to speak, click to vote).Werewolf Agent (AI Player):
This is the core of the experiment. You need to build AI Agents that can call the platform API to participate in the game. Each Agent plays a game role and has independent observation, reasoning, and decision-making capabilities.
Experimental Architecture Design
1. Game state management: The platform backend maintains centralized game state, including player survival status, current phase, historical events, etc.
2. Information access control: The platform should strictly control information flow. When the Agent obtains the current game state via API, it can only obtain the information that should be known from the perspective of its role (for example, werewolves can see their wolf teammates, the seer can see the results of checks, and villagers can only see public information).
3. Interaction mechanism:
- Speaking: Human players input text through the Web interface; Agents generate textual speeches and send them via API.
- Voting: Human players click to vote on the interface; Agents vote via API based on their reasoning results.
4. Agent’s reasoning and strategy: The intelligence of each Agent is reflected in its reasoning and decision-making ability. The following are several key design points:
Werewolf disguise strategy: The prompt for the werewolf Agent should include common werewolf rhetoric and strategies, for example: “You should speak like an ordinary villager, express suspicion of certain players, but don’t be too aggressive so as not to attract attention. If a seer jumps out and checks you as a werewolf, you can bite back and accuse the other party of being a fake seer. When voting, try to follow the crowd (vote for the target most people are voting for) to avoid standing out.”
Seer’s identity proof: When multiple players claim to be the seer (the real seer and a werewolf making a risky fake claim), the real seer needs to prove themselves through logic. The prompt can guide: “If someone makes a risky fake seer claim, you need to compare your check information with theirs and point out contradictions or unreasonable points in their information. For example, if a player they checked behaves later in a way that clearly does not match the claimed identity, this is a flaw. You can also ask the witch or hunter to cooperate in verifying your authenticity.”
Villager’s logical reasoning: The Villager Agent needs to possess basic Werewolf/Mafia-style reasoning skills. The prompt can include: “Analyze whether each player’s statements are self-consistent; pay attention to players who are eager to steer the discussion, obscure their identity, or frequently change their stance. Focus on voting behavior—werewolves tend to concentrate their votes on the good players who pose the greatest threat to them. If someone claiming to be the Seer gives check results that don’t match subsequent deaths, they are likely a werewolf making an aggressive fake claim. Don’t suspect people at random; every inference you make should be based on concrete facts and logic.”
Acceptance criteria:
- Set up a 6–8 player game (1 human player + 5–7 AI Agents)
- Configure roles: 2 werewolves, 1 seer, 1 witch, the rest are villagers, with the human player randomly assigned a role
- The game can proceed normally for at least 3 complete rounds (night–day–voting cycle)
- The speeches and actions of the AI Agents are consistent with their role identities and game strategies
- The Werewolf Agents can effectively hide their identities
- The Seer Agent can choose an appropriate time to reveal and announce check results
- The Villager Agents’ reasoning is not random, but based on logical analysis of statements and behaviors
- The game can correctly determine victory or defeat at the end
Difficulty: Medium
Personal photo search engine
Experimental objective
Build a personal photo search engine based on multimodal embeddings and a Vision LLM, to validate the effectiveness of cross-modal retrieval technology in real application scenarios, and to verify the capabilities of the Vision LLM in image understanding and natural language description generation.
Experimental description
This experiment is divided into two parts: indexing phase and query phase.
Indexing phase (offline processing):
- Photo scanning: traverse the user-specified folder and identify all image files
- Image description generation: call the Vision LLM API for each photo to generate a natural language description
- Embedding generation: use an embedding model to generate an embedding vector for the description text of each photo
- Index construction: store the photo path, description text, and embedding vector into a vector database
Query phase (online interaction):
- User input: receive the user’s natural language queries, such as “photos from last summer at the beach”, “group photo of me with friends”, “Christmas decorations”
- Query embedding: convert the user’s query into an embedding vector
- Similarity search: perform similarity search (e.g., cosine similarity, dot product) in the vector database to retrieve the Top-K most relevant photos
- Result display: display the retrieved photos in a grid or list, each photo accompanied by its auto-generated description and similarity score
Expected acceptance results
- Successfully index at least 100 personal photos, with generated descriptions accurately reflecting the photo content
- Support diverse natural language queries, including:
- Scene queries (e.g., “photos in the mountains”, “city night views”)
- Person queries (e.g., “photos with people in them”, “group photos”)
- Activity queries (e.g., “dining”, “sports”, “travel”)
- Time queries (combined with metadata, such as “photos from last year”, “photos from winter”)
- Emotion queries (e.g., “joyful scenes”, “serene landscapes”)
Intelligent video editing
Background knowledge
Render–Critique mechanism: This is a quality assurance paradigm particularly suitable for generating visual content (such as videos, PPTs, images). Its core idea is: after the Agent generates code or configuration, it cannot directly determine whether the final rendering effect meets expectations; it must actually render and use a Vision LLM to inspect the visual output. The workflow is: the Editor Agent generates code → rendering is executed → the Critic Agent uses a Vision LLM to analyze the rendered result → proposes improvements → the Editor Agent adjusts the code → re-renders, iterating in this way until the quality meets the standard.
This mechanism solves the “semantic gap” between code generation and visual effects—even if the code is syntactically correct and logically sound, the final visual presentation may still have issues (such as inaccurate video clip starting points, fonts that are too small in PPTs, overlapping image elements, etc.). Only by actually rendering and visually checking can such problems be discovered.
Experimental description
Experimental goal
Verify the Agent’s ability to perform video editing via generating Blender Python API code, and evaluate the role of the Render–Critique mechanism, based on visual feedback, in quality control for multimedia content processing.
Key challenges
Understand the user’s natural language editing requirements and translate them into precise API call sequences; handle code implementation for multiple editing operations (cutting, merging, adding subtitles, audio track mixing, visual effects); ensure that the generated Blender Python scripts are syntactically correct and can execute properly. The key point is: after the Editor Agent writes the code, it cannot directly judge whether the video effect meets expectations; it must rely on actual rendering and use a Vision LLM to inspect key frame images.
Technical approach
The user provides a sports or travel video (for example, raw footage containing surfing, hiking, skiing, etc.) and describes editing requirements in natural language, such as “cut out the surfing parts”, “extract the first 10 minutes of hiking footage and add background music”, “clip together the exciting ski jumps and add slow-motion effects”.
The Editor Agent calls a dedicated video analysis Sub-Agent to identify scenes, adopting a two-step localization strategy:
First step, coarse-grained localization: The main Agent calls the video analysis Sub-Agent, passing in parameters: video file path, time range, screenshot interval (every 10 seconds), and the question to answer (“During which time periods does surfing appear in the video?”). The Sub-Agent is a simple workflow-based agent that executes a fixed process: first, use the ffmpeg tool to capture key frames at the specified interval, then feed all key frames along with the question into a Vision LLM to obtain scene recognition results. The Sub-Agent returns the results to the main Agent (e.g., “surfing scenes appear between 40 seconds and 110 seconds”).
Second step, fine-grained localization: The main Agent again calls the video analysis Sub-Agent, this time passing in a narrower time range (30 seconds to 40 seconds), a denser screenshot interval (every second), and a more precise question (“What is the exact time when the surfer stands up on the board?”). The Sub-Agent again performs the screenshot–analysis process and returns the precise time point (e.g., “surfing starts: 38 seconds”).
This coarse-to-fine two-step localization ensures efficiency (avoiding dense sampling across the entire video) while guaranteeing precision (accurately finding boundaries within the target region). By encapsulating video analysis as a Sub-Agent, the main Agent can avoid massive screenshots from occupying its context and causing rapid context exhaustion. After scene localization is completed, the main Agent generates a Blender Python API script to implement the editing operations.
Introducing the Render–Critique mechanism: The Critic Agent executes the script to generate a quick preview (rather than a full render), extracts key frame images, and uses a Vision LLM to verify whether the clip has accurately captured the target content, check visual quality, and propose improvement suggestions (such as “the crop start point is too late; the surfing has already started”). The Editor adjusts the script and regenerates the preview, iterating quickly until the editing effect meets the standard, and only then performs a full high-quality render to output the final video.
Acceptance criteria
The Agent can accurately identify different scenes in the video (such as regions with surfing, hiking, skiing, etc.) and correctly generate editing scripts according to natural language instructions. The resulting video clips contain the user-specified content, with start and end points accurate (error not exceeding 3 seconds). If the instructions include special effects requirements (such as slow motion, transitions, subtitles), the output video should correctly apply these effects. The Render–Critique mechanism can detect obvious editing errors (such as missing key content, including irrelevant segments) and trigger corrections. The final output video file should be in a correct format, play normally, and have expected image quality.
Expected acceptance results
- Demonstrate the coarse-grained and fine-grained two-step localization strategy, showing how the system first locates the approximate range of a scene, then precisely finds the boundary time points
- Verify the effectiveness of the Render–Critique mechanism: show issues in the initial edit (such as offset start time), the Critic Agent’s feedback, and the improved results after the Editor Agent adjusts according to the feedback
- Compare editing quality with and without the Render–Critique mechanism, demonstrating the key role of visual feedback in improving final output quality
- Record the entire iterative process, including each preview, the issues discovered, the adjustments made, until an acceptable quality standard is reached
PPT generation Agent
Experimental objective
Creating PPTs is often time-consuming and labor-intensive. A typical academic presentation PPT may contain dozens of slides, each requiring careful layout design, key point extraction, and figure selection. Traditional PPT authoring tools, though powerful, pose a challenge for AI Agents—these GUI-driven tools require complex mouse operations, drag-and-drop positioning, and style adjustments, which are difficult to control precisely in a programmatic way.
Background knowledge
Proposer–Reviewer paradigm: This is a multi-Agent collaboration pattern that improves quality through division of labor. The Proposer is responsible for generating the initial solution; the Reviewer is responsible for evaluating the solution quality and proposing improvements. This division of labor brings two core advantages: first, specialization—the Proposer focuses on creative generation, the Reviewer focuses on critical evaluation, each performing its own role; second, context isolation—the Reviewer only needs to analyze the current version, avoiding context inflation caused by accumulating all historical versions in a single context.
In the PPT generation scenario, the Editor Agent plays the role of Proposer, and the Critic Agent plays the role of Reviewer. The Critic renders the PPT as images and uses a Vision LLM to conduct multi-dimensional evaluation (content density, readability, layout rationality, visual aesthetics), generating structured improvement suggestions. The Editor adjusts the code according to these suggestions, forming an iterative loop.
Slidev framework: Slidev is a presentation tool designed specifically for developers, using Markdown and HTML to define content, and themes and CSS to control styles. This design turns PPT creation into a code generation problem, making it very suitable for Agent operation—the Agent only needs to generate Markdown/HTML code, without understanding complex GUI software.
Experimental description
For Agents, if the PPT creation problem is reframed as a code generation problem, its complexity can be greatly reduced. Modern PPT frameworks (such as Slidev) adopt an elegant design philosophy: using Markdown and HTML to define presentation content. Slidev is a presentation tool specifically designed for developers; it separates slide content from styles—content is written in concise Markdown syntax, styles are controlled via themes and CSS, and interactions and animations can be implemented with Vue components.
This means that creating a slide only requires writing a concise piece of markup language—using Markdown to express textual content and structure, and HTML to embed images and custom elements. The framework automatically handles details such as rendering, layout, and animation. For Agents that have mastered code generation capabilities, this pattern is extremely friendly: they do not need to understand complex GUI software, only Markdown and HTML as structured languages, which is precisely one of the tasks that large language models are best at.
However, being able to generate PPT code alone is not enough. After the Editor Agent finishes writing the code, it has no idea what the actual rendered effect looks like—whether the content is too crowded, whether text overflows the boundaries, whether the image size is appropriate, whether the font size is reasonable, whether the color scheme is harmonious. These visual issues can only be discovered through actual rendering. This is completely consistent with how humans create PPTs: we see the actual effect of operations through a GUI, then further refine the content based on visual feedback. An excellent presentation requires not only correct content but also clear and professional visual presentation.
Therefore, quality assurance needs to introduce a Render-Critique mechanism, which is precisely the concrete application of the Proposer-Reviewer paradigm in the PPT generation scenario. The system is designed as two collaborating Agents: Editor Agent (Proposer) and Critic Agent (Reviewer). The Editor Agent is responsible for generating Slidev-format slide code based on user-provided content (such as a paper abstract, conference talk, or product introduction). It needs to understand the logical structure of the content, break it down into reasonable pages, choose suitable layout patterns for each page, organize text using Markdown syntax, and embed necessary media elements using HTML.
The Critic Agent plays the role of visual quality inspector. Its workflow includes several key steps. First is render execution: the Critic uses Slidev’s command-line tools to render the code generated by the Editor into a PDF or a series of PNG images. Each slide is converted into a high-resolution image, and these images become the input for visual analysis. Second is multi-dimensional evaluation: the Critic uses a Vision LLM’s multimodal understanding capability to analyze the rendering results of each slide. Evaluation dimensions include: content density (is the page too crowded or too sparse), readability (is the font size appropriate, is the contrast between text and background sufficient), layout rationality (are elements neatly aligned, is important information prominent), and visual aesthetics (is the color scheme harmonious, is the overall style consistent).
Based on this analysis, the Critic generates structured improvement suggestions. These suggestions are not vague comments like “this page doesn’t look good,” but specific, actionable guidance, such as: “Page 3: too much content, consider splitting the three bullet points in the lower half into a new page,” “Page 7: code block font is too small, consider reducing the number of code lines or increasing the font size to 14pt,” “Page 12: the image overlaps with the text, consider moving the image to the right and left-aligning the text.” These suggestions are formatted as structured feedback objects, including fields such as page number, issue type, severity, and concrete recommendation.
After receiving the Critic’s feedback, the Editor Agent does not blindly apply all suggestions, but instead understands the intent of each suggestion and makes adjustments while preserving content integrity. If the suggestion is “too much content, needs splitting,” the Editor needs to identify natural split points and ensure that logic remains coherent after the split; if the suggestion is “font too small,” the Editor needs to adjust the layout while increasing the font size to avoid text overflow. After the modifications are completed, the new version of the code is submitted to the Critic for review again, forming an iterative loop.
This loop may go through multiple rounds until the Critic believes that the visual effect of all pages has reached an acceptable standard, or until a preset maximum number of iterations is reached (for example, 5 rounds). Each iteration gradually improves the quality of the PPT, evolving from initially “functionally usable” to “visually professional.”
It is worth noting that you can also use a single Agent to repeatedly perform the render-modify loop, each time feeding the rendered images to a Vision LLM for self-review. But this approach tends to cause rapid context expansion—a complete PPT may contain dozens of slides, and each slide image consumes many tokens (a 1080p screenshot may use up several thousand tokens). Multiple iterations will quickly push the context over its limit. More seriously, when a large number of historical versions of rendered images accumulate in the context, the model may confuse differences between versions, or even hallucinate that an issue that has already been fixed still exists.
The Editor-Critic division-of-labor pattern avoids the problem of accumulating all historical rendered images in a single context by having the Critic focus on reviewing the current version. During each iteration, the Critic only needs to analyze the latest rendering results, so its context remains relatively clean. Although the Editor’s context will accumulate the Critic’s historical feedback, this feedback is structured text, which consumes far fewer tokens than images and is much easier for the model to understand and reason about. This design ensures review quality while enabling more efficient use of context.
Experiment requirements:
- Prepare an input with rich content (such as an extended abstract of an academic paper, including background, method, experiments, conclusions, etc., about 2000–3000 Chinese characters)
- Implement the Editor Agent so that it can convert the input content into Slidev-format slides, including title page, table-of-contents page, content pages, conclusion page, etc.
- Implement the Critic Agent so that it can invoke Slidev’s rendering tools to convert code into images and use a Vision LLM to perform visual quality assessment
- Implement the iterative loop mechanism, allowing the Editor to revise based on the Critic’s feedback until the quality meets the standard or the maximum number of iterations is reached
- Log the feedback content and modification actions of each iteration, showing the process of gradual PPT quality improvement
- Compare the single-Agent self-review pattern with the dual-Agent collaboration pattern in terms of context consumption, generation quality, and iteration efficiency
Expected acceptance results
- Show the complete iterative process: record each round of the Critic Agent’s visual evaluation feedback (such as “Page 3 has too much content,” “Page 7 font is too small”) and the corresponding adjustments by the Editor Agent
- Compare the single-Agent self-review pattern and the dual-Agent collaboration pattern: analyze the differences between the two in context consumption, generation quality, and iteration efficiency, and verify the advantage of the division-of-labor pattern in avoiding context bloat
- Demonstrate at least 3 rounds of the Render-Critique loop, showing how PPT quality gradually improves from initially “functionally usable” to “visually professional”
- The final PPT produced should meet professional standards: moderate content density, reasonable layout, clear fonts, and harmonious visuals
Book Translation Agent
Experimental objective
Book translation is a typical complex task that requires collaboration among multiple Agents. Translating a technical book is not merely converting text from one language to another; it also requires ensuring consistency of technical terms, accuracy of context, and overall fluency of reading. The complexity of this task makes it an ideal scenario for validating the effectiveness of the Manager pattern.
Background knowledge
Manager Pattern: This is a multi-Agent collaboration architecture used to solve context management problems in complex, long tasks. The core idea is task decomposition and separation of responsibilities—a single Manager Agent coordinates multiple specialized Sub-Agents, each of which is responsible only for one sub-stage or sub-module of the overall task.
Key advantages of this pattern:
- Context isolation—each Sub-Agent works within an independent, simplified context, focusing only on its own subtask and avoiding interference from irrelevant information
- Manageability—the Manager’s context primarily contains task planning, execution status, and result indexes, rather than the complete content of all subtasks, keeping it within a controllable scope
- Scalability—multiple Sub-Agents can be executed in parallel, or Agent instances can be dynamically created and destroyed as needed
In the book translation scenario, the Manager Agent coordinates three types of specialized Agents: the Glossary Agent (extracts terminology and builds a glossary), the Translation Agent (translates a single chapter), and the Proofreading Agent (checks consistency across the full text). Agents exchange data through a shared file system rather than passing complete content in context.
Problems with the single-Agent solution: If a single Agent is used to handle the translation of an entire book, its context will continuously accumulate the entire book’s content, the glossary, thoughts during the translation process, and so on, and will easily exceed the context window. More seriously, when the context becomes very long, the Agent can easily “get lost”—it may forget the unified terminology conventions for the whole book or hallucinate.
Experimental content description
This experiment validates the effectiveness of the Manager pattern in controlling context expansion and improving task completion quality through a book translation task.
Consider translating an English technical book into Chinese (for example, Nathan Lambert’s RLHF Book). The book contains 10 chapters, each discussing a different topic—neural network basics, convolutional networks, recurrent networks, attention mechanisms, etc. A large number of technical terms appear repeatedly throughout the translation process. Some terms may have multiple commonly used translations, and a unified choice needs to be made across the entire book.
An intuitive implementation is to use a single Agent to handle the entire translation process. Provide this Agent with a detailed prompt, instructing it to first browse the entire book to extract terminology and build a glossary, then translate chapter by chapter, and finally check for consistency. Provide necessary tools such as file reading and writing, glossary management, and text comparison. Then let the Agent autonomously execute the entire workflow.
However, this single-Agent solution faces serious context management problems. As the Agent processes each chapter, its context keeps expanding: the glossary for the entire book, the chapters already translated, the paragraph currently being processed, the thoughts during translation, and the results of tool calls. A technical book with 10 chapters, each containing perhaps 5,000–10,000 words, can easily exceed the model’s context window if all the content is included. More seriously, when the context becomes extremely long, the Agent easily “gets lost”—it may forget the unified terminology conventions for the whole book and use terms in Chapter 8 that are inconsistent with Chapter 2; or repeatedly check the same issue during proofreading, wasting computational resources; or even, due to scattered attention, hallucinate and “remember” a terminology translation rule that does not actually exist.
The Manager pattern elegantly solves these problems through task decomposition and separation of responsibilities. In this architecture, the system is designed as a Manager Agent coordinating the collaboration of multiple specialized Agents.
The Glossary Agent (terminology glossary Agent) is responsible for the first phase of work: it receives the content of the entire book (or chapter titles and summaries of key paragraphs), identifies all repeatedly occurring technical terms, and generates recommended Chinese translations for each term. This process may require the Glossary Agent to search professional dictionaries, refer to existing translation standards, or analyze the morphology of terms to infer reasonable translations. The Glossary Agent’s output is a structured glossary, possibly in JSON or CSV format, containing the English term, Chinese translation, part of speech, usage context, and other information. After this task is completed, the Glossary Agent’s work is done, and it can be destroyed to free up resources. The glossary is written to the shared file system for subsequent Agents to use.
Translation Agent (chapter translation Agent) is responsible for the actual translation work. The Manager will create an independent Translation Agent instance for each chapter (or reuse the same Agent while passing in different chapter content each time). The Translation Agent receives three inputs: the English content of the current chapter (read from the file system), the terminology glossary generated by the Glossary Agent (read from the file system), and the translation guidelines (such as target reader level and language style preferences, passed via tool parameters). Its task is to translate the chapter content into fluent Chinese, strictly using the prescribed translations when encountering terms in the glossary, and for new terms not in the glossary, infer translations from context and mark them for later review. The Translation Agent only focuses on a single chapter, and its context is not polluted by the content of other chapters, which makes the translation process more focused and accurate. After the translation is completed, the translated text is written to the shared file system (such as chapter1_zh.md), and newly discovered terms are appended to a review list.
The Manager can start Translation Agents for multiple chapters in parallel (if resources permit), or process them sequentially chapter by chapter. In either case, each Translation Agent works in its own independent context without interfering with each other.
The Proofreading Agent (full-text proofreading Agent) appears after all chapters have been translated. It receives the translations of all chapters (read from the file system) and the terminology glossary, and performs consistency checks: scanning the entire text to verify that all technical terms are translated according to the glossary, identifying possible inconsistencies (such as the same concept being translated differently in different chapters), and checking the fluency and readability of the translation (for example, whether there are awkward literal translations and whether the context is coherent). The Proofreading Agent generates a proofreading report listing all identified issues and their locations, and writes the report to the file system.
Based on this report, the Manager may send specific chapters back to the Translation Agent for revision (by reading the problem descriptions in the report, extracting the chapters that need modification, invoking the Translation Agent again and passing in the revision requirements), or directly apply minor fixes (if the issues are simple, the Manager can modify the files itself).
In this architecture, the Manager Agent plays the role of a project manager. Its context mainly contains: the overall task description and goals, the execution plan for each stage, the invocation records and return results of each specialized Agent, and the current task progress and status. The Manager does not need to store the complete translated content of each chapter; this content is stored in the file system, and the Manager only needs to maintain an index of the files (such as {"chapter1": "chapter1_zh.md", "chapter2": "chapter2_zh.md", ...}). This design keeps the Manager’s context within a manageable range, preventing overflow even when handling a large book.
More importantly, the use of specialized Agents achieves context isolation. The Glossary Agent only sees the content needed for terminology extraction, the Translation Agent only sees the current chapter and the glossary, and although the Proofreading Agent needs access to the full text, it only focuses on consistency checks. Each Agent works within a simplified, focused context, which not only improves efficiency but also reduces the likelihood of errors—Agents are less likely to be distracted by information overload.
Experiment requirements:
- Select a technically deep book as the translation target, such as Nathan Lambert’s RLHF Book
- Implement the Manager Agent and design a clear task decomposition and Agent scheduling logic
- Implement the Glossary Agent so that it can extract technical terms from the entire book and generate a glossary
- Implement the Translation Agent so that it can translate a single chapter according to the glossary and mark newly appearing terms
- Implement the Proofreading Agent so that it can check terminology consistency and translation quality across the whole text
- Record the context consumption of each Agent and verify the effectiveness of the manager pattern in controlling context bloat
- Compare the single-Agent solution (if feasible) with the manager-pattern solution in terms of translation quality, execution efficiency, and resource consumption
Agents that simultaneously collect information from multiple websites
Experiment objective
This experiment explores the application of parallel multi-Agent execution in information-gathering scenarios.
Background knowledge
Parallel multi-Agent execution: When a task can be decomposed into multiple mutually independent subtasks, running multiple Agents in parallel can greatly improve efficiency. The key is that each Agent runs in an independent execution environment (process/thread), has independent resources (such as a browser session), and does not block each other. This is different from the serial scheduling of the manager pattern—parallel mode aims for simultaneous execution in time.
Orchestration Agent: Responsible for coordinating the execution of multiple parallel Agents. Its core responsibilities include: dynamically starting Agent instances and assigning tasks; monitoring the execution status of each Agent in real time; handling message passing and coordination between Agents; triggering cascading termination when conditions are met (for example, when one Agent succeeds, all other Agents are terminated); aggregating the results of all Agents and reporting them to the user.
Cascading termination mechanism: This is a key technical challenge in parallel execution. When an Agent completes its goal, it must immediately notify all other still-running Agents to stop execution to avoid resource waste. This requires Agents to be able to respond to external signals during execution and gracefully clean up resources (such as closing the browser and disconnecting) without leaving hanging processes. At the same time, it must handle race conditions—such as when multiple Agents succeed almost simultaneously.
Experiment description
This experiment uses a multi-website information-gathering task to verify the performance advantages of parallel execution over serial execution, as well as the capabilities of the Orchestration Agent in coordination and control.
Problem description:
Given multiple school websites of a university (for example, the School of Computer Science, School of Mathematics, School of Physics, School of Chemistry, School of Biology, etc., a total of 10 schools), the task is to search the faculty directories of these schools for a teacher with a specified name (such as “Zhang Wei”), and once found, return the school, position, research areas, and other information of that teacher.
If this task is executed serially—visiting each school’s website one by one, parsing the page, and searching for the name—it may take a long time (assuming each site takes an average of 30 seconds to process, 10 sites would take 5 minutes). Moreover, if the target teacher is found on the first school’s website, then visiting the remaining 9 school websites is a complete waste.
Experiment objective:
Build an Orchestration Agent that can:
- Dynamically start multiple parallel Computer Use Agents, each responsible for one school website
- Monitor the execution status of all Agents in real time
- Immediately notify other Agents to stop execution when any one Agent finds the target information
- Aggregate the results of the successful Agent and report them to the user
Core challenges:
1. Dynamic startup of parallel Agents
The Orchestration Agent needs to dynamically create 10 Computer Use Agent instances according to the task requirements (10 school websites). Each instance should be an independent process or thread with an independent browser session, able to run concurrently without blocking each other. When starting Agents, the Orchestration Agent needs to pass each instance: the target website URL, the teacher’s name to be searched, and a task identifier (for subsequent message routing).
2. Real-time monitoring of task progress
Each Computer Use Agent should periodically send status update messages during execution, such as: “loading website”, “parsing faculty directory”, “target not found, task completed”, “match found, detailed information as follows: …”. The Orchestration Agent receives these updates through a message bus and maintains a task status table to understand in real time which Agents are still running, which have completed, and which have encountered errors.
3. Cascading termination after success
When an Agent (suppose the Agent responsible for the School of Computer Science) successfully finds the target teacher, it sends a {"type": "target_found", "agent_id": "agent_3", "data": {...}} message to the bus. After receiving this message, the Orchestration Agent immediately performs the following operations:
- Sends a
{"type": "terminate", "reason": "target_found_by_agent_3"}message to all other still-running Agents - Each Agent that receives the termination message should gracefully stop its current operation, clean up resources (close the browser session), and send a
{"type": "terminated"}confirmation message - The Orchestration Agent waits for termination confirmations from all Agents (or times out), then aggregates the results
This cascading termination mechanism is the key difficulty of the experiment. It requires that:
- Agents can respond to external termination signals at any time during execution (similar to the interruption mechanism discussed in Chapter 4)
- Termination must be graceful, without leaving hanging processes or unclosed resources
- The Orchestration Agent must handle potential race conditions—what if two Agents find the target almost simultaneously?
4. Handling failures and timeouts
The following exceptional situations may occur:
- A school website is inaccessible (network error, server down)
- The structure of a website does not match expectations, and the Agent cannot parse it correctly
- All Agents finish searching, but none finds the target teacher
The Orchestration Agent needs to design handling strategies for these situations:
- Set a timeout for each Agent, treating it as a failure if it times out
- When an Agent reports an error, record the error without affecting the execution of other Agents
- When all Agents have completed (regardless of success or failure), aggregate the results: if any Agent succeeded, return the success information; if all failed, report “target teacher not found” to the user along with a summary of failure reasons
Experiment requirements:
- Implement an Orchestration Agent that can dynamically start multiple parallel Agents
- Implement the Computer Use Agent (or Web Scraping Agent) so that it can access university school websites, parse faculty directories, and search for the target name
- Implement a message bus or messaging mechanism to support bidirectional communication between the Orchestration Agent and multiple sub-Agents
- Implement a cascading termination mechanism after success, ensuring that once the target is found, all other Agents can quickly stop
- Handle various exceptional situations (website access failure, parsing errors, all sites not finding the target)
- Record and compare the time differences between parallel and serial execution to verify the performance gains brought by parallelization
Difficulty: Hard
User Memory That Understands You Better
Objective
Apply the multi-round iterative retrieval capabilities of Agentic RAG to user conversation history to build a retrievable long-term memory system. Verify this method’s capabilities and limitations across the three levels of the user memory evaluation framework (basic recall, multi-session retrieval, proactive service).
Background Knowledge
Three levels of the user memory evaluation framework:
The first level is basic recall, requiring the Agent to accurately recall specific information provided by the user within a single session, such as simple factual queries like “What is my checking account number?”
The second level is multi-session retrieval, requiring the Agent to integrate information across multiple sessions that occur at different times. This includes two sub-challenges: first, integrating scattered information (for example, the user talks about two cars in different sessions, and the Agent needs to gain a comprehensive understanding of all the vehicles the user owns); second, handling conflicting facts (for example, multiple family members sequentially modify the same instruction in different sessions, and the Agent must determine which is the final valid one).
The third level is proactive service, requiring the Agent to proactively discover hidden connections between different pieces of information and provide early-warning suggestions. For example, the user mentioned in a conversation months ago that their passport is about to expire, and recently booked an international flight. The Agent should proactively connect these two facts and remind the user to renew their passport.
Advanced JSON Cards memory pattern: This is a structured method for representing user memory, storing the user’s core facts in JSON format. Each Card contains multiple fields: content (the factual content, such as “The passport will expire on February 18, 2026”), backstory (the source and context of the information), person (the people involved), relationship (the relationships between people), and other metadata. These JSON Cards are typically fixed in the Agent’s prompt as persistent context, enabling the Agent to always grasp an overview of the user’s core information.
Experiment Description
The core idea is to treat the user’s complete conversation history as a knowledge base. During the indexing phase, the system chunks and indexes the history using fixed windows (for example, every 20 turns of conversation); during the application phase, the Agent actively retrieves these “memories” via the search_user_memory tool.
Implementation of context-aware retrieval: For each conversation chunk, generate a prefix summary that includes background information such as time, people, and intent. For example, the isolated utterance “Okay, let’s book this one” becomes “[Context: The user is confirming a $500 one-way ticket from Shanghai to Seattle] Okay, let’s book this one.” This contextual prefix solves the fundamental flaw of traditional document chunking methods—the semantic information loss caused by separating tightly related context.
Context enhancement shows a decisive advantage in handling conflicting facts. In the test cases, the wife, husband, and wife modify the same wire transfer instruction three times in sequence. The contextual prefixes (“Wife sets up initial wire transfer”, “Husband modifies wire transfer”, “Wife modifies again after husband’s modification”) provide key clues for determining the final valid instruction.
Two-layer memory architecture: The experiment further explores a two-layer architecture combining Advanced JSON Cards with context-aware RAG. Advanced JSON Cards, as persistent context, store structured core facts (such as “The passport will expire in February 2026”, “A ticket to Tokyo on January 15 has been booked”), while context-aware RAG provides on-demand retrieval of unstructured conversational details.
In the third-level test, when the user asks “For my trip to Tokyo in January, is there anything else I need to prepare?”, the Agent’s workflow is: first, examine the JSON Cards in the persistent context to grasp the two core facts of “Tokyo trip (January 15)” and “passport information (expires on February 18)”; by comparing them, discover that the flight date is close to the passport expiration date and identify a potential risk; use RAG to retrieve relevant conversation details for confirmation and find the original segments where the ticket and passport were discussed as “evidence”; finally, proactively remind the user: “Your passport will expire one month after your trip. It is strongly recommended to apply for an expedited renewal immediately.”
Expected Acceptance Criteria
- After context enhancement, the Agent can correctly handle scenarios with conflicting facts and determine the final valid instruction.
- The two-layer memory architecture runs successfully: JSON Cards provide an overview of core facts, and RAG provides on-demand retrieval of conversation details.
- In the third-level test, the Agent can proactively discover hidden connections between different pieces of information and provide early-warning suggestions.
- Understand the value of coordinated work between structured knowledge management and unstructured information retrieval.
An Agent Using the Computer While Talking on the Phone
Objective
In many real-world scenarios, task completion requires multiple capabilities to operate in parallel rather than being executed serially. Imagine a human assistant handling urgent matters for their boss: they might be on the phone with a client while looking up related documents on their computer and taking notes on key points of the conversation at the same time. This kind of “multitasking” is extremely challenging for a single Agent—if one Agent has to both handle real-time voice conversations and operate a computer interface, it will inevitably keep switching between the two tasks, causing pauses in the conversation or interruptions in computer operations, resulting in a very poor user experience.
Experiment Description
The core idea of multi-Agent parallel execution is: let different Agents each focus on tasks with high real-time requirements, coordinate via asynchronous message passing, and thereby achieve true parallel processing. In the typical “talking on the phone while using the computer” scenario, we need two independent Agents running simultaneously: one responsible for the phone conversation, and one responsible for computer operations. These two Agents are not just doing simple task allocation, but are specifically optimized for different interaction modalities—the Phone Agent requires low-latency speech recognition and synthesis, while the Computer Agent requires powerful visual understanding and operation planning capabilities.
Problem Description:
Imagine a scenario: an AI Agent needs to help a user complete an online booking task, such as filling out a complex flight reservation form. During this process, the Agent needs to operate the webpage while asking the user questions and confirming personal information (such as name, ID number, flight preferences, etc.) via phone.
This task poses a huge challenge for a single Agent. Both phone communication and computer operations require high real-time performance. If an Agent is concentrating on “looking at” the screen and clicking buttons, it cannot simultaneously listen to the user and respond, and vice versa. This leads to call stuttering or interrupted operations, resulting in a poor experience.
The goal of this experiment is to build a multi-agent system where two Agents work collaboratively to solve this “multitasking” problem. One Agent is responsible for the call, and the other is responsible for operating the computer. They communicate in real time and complete the task efficiently.
Core Challenges and Requirements:
1. Dual-Agent Architecture
You need to build two independent Agents:
Phone Agent: Responsible for voice calls with the user. It needs to be implemented using APIs based on ASR (Automatic Speech Recognition) + LLM (Large Language Model) + TTS (Text-to-Speech). It should be able to understand the user’s natural language responses, extract key information from them, and send that information to the Computer Agent via a messaging framework. At the same time, it needs to receive messages from the Computer Agent (such as what information is needed, what problems were encountered) and generate appropriate wording to ask or inform the user accordingly.
Computer Agent: Responsible for operating the browser on the computer to complete tasks such as filling out web forms. It is recommended to build it on top of existing browser-operation frameworks such as Anthropic Computer Use, browser-use, or similar frameworks. It should be able to understand the webpage structure, identify the form fields that need to be filled, execute filling operations based on the received information, and seek help from the Phone Agent when encountering problems.
2. Inter-Agent Collaborative Communication
The two Agents must be able to communicate efficiently in both directions. There are two ways to implement the communication mechanism:
Method One (Simple Solution): Implement point-to-point communication via tool calls. The Phone Agent’s toolset includes send_message_to_computer_agent(message), and the Computer Agent’s toolset includes send_message_to_phone_agent(message). When an Agent calls this tool, the message is added as a new input event to the target Agent’s trajectory.
Method Two (Enhanced Solution): Implement a lightweight message bus and an Orchestration Agent. The message bus is responsible for routing and distributing messages, and the Orchestration Agent is responsible for overall task coordination and state monitoring. All Agents communicate using a unified message format that includes fields such as sender, receiver, type, and content.
3. Parallel Work and Real-Time Performance
The key is that the two Agents must be able to work in parallel. While the Computer Agent is searching for page elements or entering text, the Phone Agent must remain online and maintain normal conversation with the user; for example, it might say, “Okay, I’m entering your name now… Could you please tell me your ID number?” This requires:
The two Agents run in independent execution threads or processes, each maintaining its own ReAct loop. The Phone Agent’s execution loop includes: receiving user speech → ASR transcription → LLM understanding and response generation → TTS synthesis → playback to the user → checking for messages from the Computer Agent. The Computer Agent’s execution loop includes: capturing the screen → Vision LLM understanding the page → operation planning → executing operations (clicks, inputs, etc.) → checking for messages from the Phone Agent.
The inputs to both Agents need to include information from the other Agent. The Phone Agent’s LLM input context should not only include the user’s speech transcription and conversation history, but also a specially marked field containing the latest message from the Computer Agent (for example, [FROM_COMPUTER_AGENT] Can't find the "Next" button, may need the user to confirm whether to continue). Similarly, the Computer Agent’s multimodal model input should not only include browser screenshots and operation history, but also messages from the Phone Agent (for example, [FROM_PHONE_AGENT] The user said the name is Zhang San, and the ID number is 123456).
Expected Acceptance Criteria
- Demonstrate that the two Agents truly work in parallel: while the Phone Agent is conversing with the user, the Computer Agent is operating the browser and filling out forms, and they do not block each other.
- Demonstrate real-time communication between Agents: when the Computer Agent encounters a problem, it sends a message to the Phone Agent requesting information; after the Phone Agent receives the message, it asks the user and sends back the answer.
- Compare the total time cost between the single-Agent serial execution mode (finishing the call first to collect all information, then operating the computer) and the dual-Agent parallel mode, to verify the efficiency gains brought by parallelization.
- Record a complete task execution log, including each Agent’s ReAct loop, the time points and contents of message passing, and the total time to complete the task.
- Verify real-time performance: during the Computer Agent’s work, the Phone Agent can maintain smooth conversation with no obvious pauses or waiting.
A Computer-Operation Agent That Gets More Proficient with Use
Experiment Objective
This experiment applies the externalized learning method to computer operation scenarios, building a browser automation Agent that can learn from experience and become more proficient the more it is used. This transition from “thinking and executing” to “direct replay” is a typical manifestation of capability solidification achieved through externalized learning—liberating repetitive processes from the model’s temporary reasoning and turning them into independent external programs that can be executed precisely.
Background Knowledge
Core idea of externalized learning: Traditional Agents rely on in-context learning or parametric memory—by providing examples in the prompt, or relying on knowledge the model learned during training. Both approaches have limitations: in-context learning is constrained by context window size, and parametric memory cannot be dynamically updated. Externalized learning adopts a completely different strategy—it stores the Agent’s experience and skills in an external, editable knowledge base.
Voyager’s “explore–distill–reuse” paradigm: Voyager is an Agent system that autonomously explores in the game Minecraft and demonstrates the powerful potential of externalized learning. Its workflow includes three key stages:
Exploration stage: The Agent interacts with the environment to complete tasks such as “build a crafting table”. During this process, it determines which game actions to take through multi-step reasoning and eventually completes the task successfully.
Distillation stage: Once a task is successful, Voyager “distills” the entire operation sequence into a piece of executable JavaScript code, named as a skill function (e.g.,
craftWorkbench()). This code contains all the necessary steps to complete the task and is stored in an external skill library. The key point is that this code is independent and directly executable, without requiring the model to reason again.Reuse stage: When facing a new task (such as “build a wooden sword, which requires a crafting table first”), Voyager retrieves the skill library and discovers that the skill
craftWorkbench()already exists. It can directly call this function instead of re-reasoning about how to build a crafting table. This reuse greatly improves efficiency and stability.
The core value of this paradigm lies in “capability solidification”—transforming one-off, temporary reasoning processes into permanent, reusable capability units.
Experiment Description
This experiment applies the above externalized learning method to browser automation. The current computer-operation Agent has to perform full visual reasoning with a multimodal large model every time it executes a task, and must re-reason even when it has previously completed exactly the same task successfully. The goal of this experiment is to enable the Agent to learn from experience, distill successful web operation processes into replayable workflows, and directly reuse them when executing similar tasks later.
Problem background: When performing each task (such as web browsing or form filling), the current computer-operation Agent must run full visual reasoning with a multimodal large model—observe screen screenshots, decide where to click, and what text to input. Even if it has successfully completed exactly the same task before, the Agent cannot leverage that experience and must reason from scratch. This approach has three problems: low efficiency (each run requires multiple expensive LLM calls), instability (LLM randomness leads to inconsistent operation paths), and high cost (a large number of visual understanding calls). Essentially, this is because the Agent relies on in-context learning or parametric memory, without externalizing successful experience into reusable tools.
Experiment goal: Build a computer-operation Agent with learning ability so that it can:
- Use multimodal large-model reasoning to complete operations when executing a task for the first time, while capturing a stable operation process
- Store successful operation processes in a knowledge base using stable selectors (such as XPath and CSS Selector)
- Recognize task similarity on subsequent similar tasks and retrieve matching workflows from the knowledge base
- Directly replay the operation steps in the workflow, without calling the large model for step-by-step reasoning again, thus achieving fast and stable execution
Technical solution: The experiment is a secondary development based on the browser-use framework. Browser-use provides browser automation capabilities integrated with Playwright, and assigns temporary IDs to interactive elements on the page when interacting with multimodal LLMs. When the LLM outputs operation instructions (such as click(13), type(7, "[email protected]")), browser-use creates DOMHistoryElement objects to record detailed information about the element, including XPath, CSS Selector, element type, text content, and so on. The core task of this experiment is to use this information to implement the learning and replay mechanisms.
Implementation of the learning stage (corresponding to the “explore–distill” stage): When the Agent executes a task for the first time, the system is in “learning mode”. It captures the complete operation process when the web operation is first completed:
- The Agent completes the task through an observe–think–act loop with a multimodal large model. Each time the LLM decides to perform an action (click, input, etc.), the system extracts a stable identifier of the operated element from browser-use’s internal state (prioritizing XPath, since it is more robust to minor structural changes in the page).
- Each operation step is recorded as a structured step that includes: operation type (click, type, select, etc.), the target element’s XPath, related parameters (such as the text content to input), and post-execution state validation information (such as page URL changes or appearance of specific elements). This converts the action sequence into structured data that can be replayed.
- After the task is successfully completed, the system prompts the LLM to generate a semantic label and description for this task. The semantic label is an abstract expression of the task intent (such as “send email”), while the description contains the key components of the task (such as “recipient field, subject field, content field, send button”). This information will be used for future task matching.
- The step sequence, semantic label, and description are stored together in the knowledge base, forming a “workflow” entry. This workflow is an externalized, reusable operation capability.
Implementation of the application stage (corresponding to the “retrieve–reuse” stage): When the Agent receives a new task, the system first attempts to retrieve a matching workflow from the knowledge base. The matching mechanism combines semantic similarity (comparing the embedding vector similarity between the new task description and stored workflow descriptions) and key-element checks (whether the new task involves similar operation objects and goals). If it finds a workflow whose match score exceeds the threshold, the Agent enters “replay mode”—this is exactly where externalized learning brings an efficiency leap:
- Execute the operations step by step in the order recorded in the workflow. As modern web pages are dynamically loaded, executing steps in rapid succession will lead to failure (the target element may not have loaded yet). Therefore, before executing each step, Playwright’s waiting mechanisms must be used (such as
page.locator(xpath).wait_for(state='visible', timeout=15000)) to ensure that the target element is loaded and interactive. - For operations that involve parameters (such as text input), what is stored in the workflow is a parameterized template (such as “input
{{email}}into the recipient field”). During replay, actual parameter values (such as “test@example.com”) must be extracted from the current task instructions and filled into the template. This parameter extraction can be done with a simple LLM call, but does not require full visual reasoning, greatly reducing the cost. - If a step fails during workflow replay (for example, the target element cannot be found or the wait times out), it indicates that the page structure may have changed and the stored workflow is outdated. At this point, the Agent should log the failure, mark this workflow as “possibly outdated”, and fall back to learning mode to complete the task again through LLM reasoning, while generating a new workflow to replace the old one. This reflects a mechanism of continuous iterative improvement.
Acceptance scenario and metrics: Select a specific web-operation task for acceptance testing, such as sending an email via Gmail’s web interface.
First execution (learning stage):
- Task instruction: “Send an email to test@example.com with subject ‘Test Email’ and content ‘This is a test email to verify the Agent’s learning capability’”
- Observation and recording: Demonstrate how the Agent uses a multimodal LLM to observe the Gmail interface, identify and click the “Compose” button, identify the recipient input box and enter the email address, identify and fill the subject and body input fields, and identify and click the “Send” button. Record the entire process’s operation steps, time consumption, and number of LLM calls.
- Workflow generation: Show the generated workflow, including the XPath, operation type, and parameters of each step.
Repeated execution (application stage):
- Task instruction: “Send an email to another@example.com with subject ‘Follow-up Test’ and content ‘This is the second test email’”
- Workflow matching: Demonstrate how the system recognizes that the new task matches the stored “send email” workflow and extracts the new parameter values (recipient, subject, content).
- Fast replay: Demonstrate the Agent directly executing the steps in the workflow without requiring the LLM for visual reasoning. Each step only needs to wait for the element to load and then execute the operation; the entire process should be significantly faster than the first execution.
- Performance comparison: Record the time consumption and number of LLM calls for the second execution and compare them with the first execution.
Knowledge update:
- During repeated execution in the application stage, simulate a page redesign scenario (for example, manually modify the test page’s HTML so that the XPath of a certain button changes) and verify whether the Agent can detect workflow failure and fall back to the learning stage.
- Demonstrate how, after detecting failure, the Agent relearns a new workflow, updates the knowledge base, and uses the new workflow for the next execution.
An Agent That Can Create Agents
Experiment Objective
Build a Coding Agent with metaprogramming capabilities so that it can automatically create new Agent systems according to user requirements. The core challenge is ensuring that the generated Agents follow best practices: standard context management and tool-calling formats, the latest SOTA models and APIs, and Agent-development best practices.
Background Knowledge
Challenges of metaprogramming: Allowing an Agent to directly generate Agent code faces a fundamental problem—the training data of large language models usually lags behind the latest technologies. If we let a Coding Agent write Agent code from scratch, it is very likely to use outdated API formats, deprecated model names, or architectural patterns that no longer represent best practices. For example, it might generate the deprecated OpenAI Function Calling format instead of the latest Tool Calling protocol.
Example-based code generation: An effective way to solve this problem is to provide the Coding Agent with a high-quality Agent implementation that follows current best practices as a reference example. This example code should include: correct message formats (such as the latest OpenAI Chat Completions API format), standard tool-calling protocols, recommended model choices (such as gpt-5, claude-4), good context-management patterns (such as properly handling multi-turn conversation history), and engineering practices such as error handling and logging.
Experiment Description
The technical plan of this experiment is to provide Coding Agent with a high-quality Agent implementation as a reference example. When there is a requirement to create a new Agent, the Agent first copies this example code as the basic framework, and then makes targeted modifications based on the user’s specific requirements, instead of generating everything from scratch. This “copy-modify” pattern ensures a lower bound on the quality of the generated code.
Experiment workflow:
- Preparation phase: Create a high-quality example Agent code, including a complete project structure (dependency management, configuration files, main program, tool definitions, etc.)
- Input requirements: Provide Coding Agent with the requirement to create a new Agent (e.g., “Create an Agent that can search the web and summarize information”)
- Observe the generation process: Record how the Agent understands the requirements and how it modifies the example code (such as adding a
web_searchtool, adjusting prompts, modifying the main loop logic) - Quality check: Examine the quality of the generated code—whether the message format is standard, whether the tool invocation protocol is correct, whether the model selection is appropriate, and whether the code structure is clear
- Functional testing: Actually run the generated Agent and test whether it can successfully complete the specified tasks
Controlled experiment: Compare the two modes of “modifying based on an example” and “generating from scratch”:
- Code quality: Check the correctness of API formats, model names, error handling, and other aspects
- Development efficiency: Record the time from requirement input to a runnable Agent
- Success rate: Count whether the generated Agent can pass the test in one go, or how many rounds of debugging and fixing are needed
Expected acceptance results
- Coding Agent can successfully create new Agents, and the generated code can run and complete basic tasks.
- The generated Agent code uses standard message formats and tool invocation protocols.
- The generated Agent uses the currently recommended models and APIs, rather than an outdated tech stack.
- The generated Agent can correctly manage multi-turn dialogue context and state.
- The “modifying based on an example” mode yields higher code quality and development efficiency than the “generating from scratch” mode.