Zhongguancun AI Academy & UCAS 2025 Summer AI Agent Practical Topics
The AI Agent Hackathon at UCAS in February 2025 was a great success. Therefore, from July 27 to 30, 2025, at Zhongguancun AI Academy, and from July 31 to August 4 at UCAS, I will host two AI Agent practical topics again.
Many thanks to Professor Zheng Shuxin, Vice Dean of Zhongguancun AI Academy, and Professor Liu Junming from UCAS for inviting me to host these two AI Agent practical activities.
All topics in this AI Agent practice will take you deep into the cutting-edge technologies for building the next generation of AI Agents. You will have the opportunity to practice:
- Multimodal models and thinking model applications: Build the “brain” of the agent with industry-leading multimodal models and thinking models such as Gemini 2.5 Pro and Claude 4 Sonnet.
- Real-time voice interaction: Integrate VAD, ASR, LLM, and TTS technology stacks to create real-time voice agents capable of streaming conversations.
- Autonomous GUI operation: Develop agents that can stably operate browsers and other GUIs to complete complex real-world tasks.
- Advanced Agent Architecture: Explore advanced architectures such as “fast and slow thinking,” “thinking while listening,” and multi-agent collaboration to equip agents with both real-time response and deep thinking capabilities.
- Learning from experience: Build agents that can learn from experience, allowing them to become more proficient with repeated tasks.
- Identifying authoritative information sources: Enable agents to accurately identify and adopt high-credibility information such as official documents and academic papers from vast amounts of information.
- Autonomous tool invocation and creation: Enable agents not only to use existing tools but also to autonomously learn and create new tools to solve open-ended problems.
Suggestions for AI-assisted programming: In this AI Agent practice, we encourage everyone to use AI-assisted programming, which means “developing agents with agents.” We recommend using Cursor for Vibe Coding, and here are some suggestions:
- Documentation first, code later: First, let Cursor write the design document. Your role is to provide improvement suggestions for the AI-generated design document and iterate with the AI until satisfied. Then, let Cursor write the code according to the final design document. During coding, always keep the design document in the agent’s context as a reference.
- Choose the right model: Do not use Cursor’s “auto” mode; be sure to choose a model with thinking ability (with a brain icon next to it), such as Claude 4 Sonnet.
- Test-driven: Be sure to have AI write and execute test cases for its code to ensure code quality.
Feel free to form teams and choose any one of the following topics to start your creative journey!
Topic Directory
- Topic 1: Real-time Voice Agent Combining Fast and Slow Thinking
- Topic 2: Interview Cheating Agent with Thinking While Listening
- Topic 3: Deep Search Agent Capable of Identifying Authoritative Information Sources
- Topic 4: Agent Capable of Operating Computers and Becoming More Proficient
- Topic 5: Agent Capable of Creating Tools Independently
- Topic 6: Agent Capable of Operating Computers While Making Phone Calls
Topic 1: Real-time Voice Agent Combining Fast and Slow Thinking
Objective: Build an advanced real-time voice dialogue system that integrates industry-standard voice processing technology stacks (VAD, ASR, TTS) and large language models (LLM) to achieve natural, low-latency human-machine voice interaction.
Core Challenge: The core of this topic is to implement a “Mixture-of-Thoughts” architecture that simulates the complex thinking process of humans. The system needs to run two thinking modes in parallel:
- Fast Response Path: Use low-latency models (such as GPT-4o, Gemini 2.5 Flash) to achieve instant feedback, handle simple queries, and maintain conversation fluency.
- Deep Thinking Path: Use more powerful SOTA models (such as GPT-4.1, Claude 4 Sonnet, Gemini 2.5 Pro) for complex reasoning and tool invocation (such as online search) to provide users with more accurate and in-depth answers.
Technical Requirements:
- Architecture: Implement a complete application with a server and web front end. The front-end interface should focus on functionality.
- Mixture-of-Thoughts: Implement the above fast and slow dual-path collaborative working mechanism. When the slow thinking path is processing, the fast thinking path should provide filler words to avoid conversation interruption. The slow thinking model must support streaming output, transmitting its intermediate thinking process to the fast thinking model in real-time, allowing the fast thinking model to generate more meaningful filler words based on these intermediate results.
- Tool Invocation: The deep thinking model must have the ability to invoke external tools, with at least one real-time network search tool integrated.
- Voice Interaction: The system must integrate VAD (Voice Activity Detection) technology (Silero VAD is recommended) to achieve automatic voice segmentation and response without manual user triggering.
- Interruption Mechanism: Implement a controllable interruption function, allowing users to insert their speech while the AI is speaking.
Acceptance Criteria:
- Basic Latency: After the user completes a simple greeting (such as “Nice to meet you”), the system must generate a voice response within 2 seconds.
- Real-time Interaction: In multi-round dialogue games, the system must demonstrate quick understanding and reaction capabilities. For example, in the “turn-taking counting (skip 4)” game, after the user says “three,” the system must accurately respond with “five” within 1.5 seconds.
- Mixture-of-Thoughts Capability:
- Basic Reasoning: For questions requiring logical reasoning, the system must quickly respond and provide an answer. For example, when a user asks, “What is 8 to the power of 6?”, the system must start responding within 2 seconds after the user finishes asking (using filler words if necessary) and provide the correct answer “262144” within 15 seconds.
- Tool Invocation: For questions requiring online queries (such as “What’s the weather like in Beijing today?”), the system must start responding within 2 seconds after the user finishes asking and return accurate weather information through API invocation within 15 seconds, without interrupting the conversation.
- Intelligent Filler Word Mechanism: When the slow thinking model is in deep thought, the fast thinking model is responsible for real-time dialogue with the user. If the initial filler words of the fast thinking model (such as “Let me think”) are finished and the slow thinking is not completed, the fast thinking model needs to receive the streaming intermediate thinking process of the slow thinking model and summarize it into natural speech to continue communicating with the user, ensuring the conversation is not interrupted. For example, when a user asks a complex question, the fast thinking might first say “Let me think,” and then continue with “This question requires considering several aspects… I’m analyzing the data…” based on the intermediate process of slow thinking.
Bonus Points:
- Smarter Interruption:
- Interruption Robustness: Filter out meaningless background noise or short affirmations through the content of the user’s speech (such as “um,” “okay,” etc. for confirmation vs. “wait a minute,” “no” for rebuttal), only stopping its speech when the user has a clear intention to interrupt.
- Example: When the AI is introducing: “This phone uses the latest A18 chip, with excellent performance…” and the user says “um,” the AI should continue; but if the user says “How about its battery life?”, the AI should stop the introduction immediately after the user says “How about” and switch to answering the battery life question.
- Smarter Speech:
- Turn-taking Judgment: The system should have the ability to predict the user’s dialogue intention. By analyzing the semantic completeness of the user’s spoken content, it can determine whether the user might continue speaking. For example, when the user says “I want to ask about…”, the system should judge that the intention is not fully expressed and choose to wait rather than immediately interrupt.
- Silence Management: After the user completes a full intention expression, if there is a long awkward silence, the system should be able to proactively and naturally start a new topic or ask follow-up questions to maintain the flow of the conversation. For example, after answering a question, if the user does not respond for a few seconds, the AI can say: “Do you have any other questions about this topic?”
Technical Selection Suggestions:
To ensure low latency, if access to overseas APIs is restricted, consider using domestic service providers (such as Doubao, Tongyi Qianwen, Siliconflow) for LLM/TTS/ASR APIs. Fast thinking is recommended to use Doubao-1.5-Lite, and slow thinking is recommended to use Doubao-1.5-Pro.
Topic 2: Interview Cheating Agent with Thinking While Listening
Problem Description:
In technical interviews, interviewers often pose complex, multi-part questions that require candidates to quickly understand, organize their thoughts, and respond clearly in a short time. This is a huge challenge for anyone.
The goal of this topic is to build an interview cheating agent with “thinking while listening” capabilities. It does not start working only after the interviewer finishes speaking but synchronously thinks and retrieves information while the other party is speaking, displaying preliminary thoughts and key points to the user in real-time, helping the user gain an advantage and respond calmly.
Core Requirements:
- Thinking While Listening: The agent must be able to process streaming ASR results in real-time. While the interviewer is speaking, the agent should input incomplete sentence segments into a thinking model to generate and iteratively update its “internal monologue/thoughts.” These thought processes need to be used as context for generating the final answer.
- Real-time Thought Display: The user’s interface needs an area to display the agent’s internal thought process in real-time. For example, when the interviewer mentions a technical term, related keywords, definitions, pros, and cons should immediately appear. This allows the user to see preliminary answer points while listening to the question.
- Dual-Response Mechanism:
- Quick Response: After the interviewer stops speaking, if the “thinking while listening” deep thought process is not yet complete, the system must provide a brief core point within 5 seconds to help the user quickly start speaking and avoid awkward silence.
- Deep Answer: After the quick response, the system continues deep thinking and starts streaming a more comprehensive and organized detailed answer within 15 seconds.
- UI Interface: A simple interface (web or desktop application) is needed to display the agent’s output.
Acceptance Criteria:
- Scenario Simulation: A real person acts as a mock interviewer to conduct a mock interview.
- “Listen and Think” Ability Test:
- Test Case: A real interviewer asks a long question at a steady pace, such as: “Please explain in detail the architecture of the Transformer model, including its motivation, the principle of self-attention mechanism, the role of positional encoding, the concept of multi-head attention, and how the encoder and decoder are stacked. Finally, compare its advantages over earlier architectures like RNN.”
- Acceptance Requirement: During the interviewer’s question, the Agent’s “internal thinking” area needs to update in real-time. For example, when the interviewer mentions “attention mechanism,” keywords like “Query, Key, Value” should immediately appear on the interface.
- Complete Response Process Test:
- Test Case: The long question from the previous point can be used.
- Acceptance Requirement: After the interviewer finishes speaking, if the complete answer is still being generated, the Agent must display a core point summary within 3 seconds (e.g., “Key points: self-attention, multi-head attention, positional encoding, encoder-decoder architecture”). Subsequently, within 30 seconds, it should start streaming a structured detailed response.
Bonus Points:
- Non-question Filtering: The Agent can distinguish between formal questions and casual chat from the interviewer (e.g., “The weather is nice today”) and only triggers search and response for key questions.
- Context Understanding: The Agent can understand the context during the interview process. If the interviewer follows up on a detail, the Agent’s search and response should be based on the previous question, not a new start. For example, after answering the CAP theorem, if the interviewer asks, “Is the Raft protocol CP or AP?” the Agent should directly answer that Raft is a CP consistency algorithm.
- Internet Search Integration: The Agent’s deep thinking process can integrate real-time web search results to answer questions about the latest technologies or events.
Topic Three: Deep Search Agent Capable of Identifying Authoritative Information Sources
Background and Challenges:
Existing deep research agents (e.g., OpenAI Deep Research, Gemini Deep Research, Kimi Deep Research, GenSpark, etc.) show great potential in information retrieval but still face two core challenges: identifying authoritative sources and the timeliness of information. The internet is filled with outdated, inaccurate, and even contradictory information. The goal of this topic is to build an autonomous agent that can overcome these challenges, not only gathering information but also critically evaluating and logically reasoning to produce highly credible answers.
Task Description:
Develop a fully autonomous agent that, when given a complex query, can autonomously plan and execute a series of actions, including but not limited to:
- Calling search engines for preliminary exploration.
- Deeply browsing web content and extracting key information.
- Parsing and understanding PDF documents.
- Cross-verifying and logically reasoning based on collected multi-source information.
Core Requirements:
- Authoritative Source Identification: The agent must be able to identify and prioritize information from high-credibility sources such as official documents, academic papers, and authoritative technical communities, actively filtering out low-quality content from non-professional media or forums.
- Reasoning Ability: For questions where answers are not directly provided, the agent needs to have the ability to perform calculations or logical deductions from existing information.
- Generality: The agent’s solution strategy must be general, prohibiting hard-coded prompts or workflows for specific test questions.
Technical Implementation Plan:
- Context with URL: After the Agent browses a webpage, when providing the content (or summary) of that page to a large language model (LLM), the complete URL of the page must be included as context information. This allows the LLM to judge the authority of the information source based on the domain name (e.g.,
nvidia.com
,arxiv.org
). - Prompting for Authority: The Agent’s System Prompt must contain explicit instructions guiding the LLM to think critically and require it to be accurate. For example:
“You are a meticulous investigator tasked with finding the most accurate answer to a question. You must prioritize information from official websites, official technical documents, or top academic conference papers. Information from third-party media, blogs, or forums should be considered unreliable unless corroborated by official sources. Before answering, please repeatedly cross-verify and clearly state your information sources.”
- Tool Set: The Agent must have at least the following tools:
search(query)
: Call a search engine to obtain search results.browse(url)
: Access a webpage and extract its text content.parse_pdf(url_or_path)
: Parse and extract text content from a PDF document.
Acceptance Criteria:
The agent must provide accurate answers to at least 3 out of the following 5 questions in a fully autonomous mode. For questions that cannot be confirmed, the agent should clearly state its inability to answer rather than hallucinate.
- Tensor FP16 FLOPS performance (without sparsity) of the NVIDIA RTX 4090
- List of OpenAI founders (co-chairs) and their current affiliations
- The current total number of transactions on the Ethereum blockchain
- The exact number of racks, Ascend 910 nodes, NPUs, CPUs, and UB Switches inside a supernode in Huawei CloudMatrix384
- What are the full names of Bojie Li’s wife and ex-girlfriend? (Bojie Li is Co-Founder and Chief Scientist of Pine AI)
Special Note:
- The above questions are quite challenging, and some answers require reasoning from original materials (e.g., the number of UB switches in CloudMatrix384).
- Direct searches may yield misleading results (e.g., the TF16 performance of the 4090, where many media and materials provide performance with sparsity or non-tensor FP16 performance), requiring the agent to have excellent source identification capabilities.
- The setting of this topic references the industry-leading GAIA benchmark (https://huggingface.co/spaces/gaia-benchmark/leaderboard), aiming to challenge the comprehensive information acquisition capabilities of agents in the real world.
Recommended Reference Projects:
Encourage reference or secondary development based on the following cutting-edge open-source projects to tackle the challenges of this topic:
- https://github.com/inclusionAI/AWorld
- https://github.com/camel-ai/owl
- https://github.com/FoundationAgents/OpenManus
Topic Four: An Agent That Can Operate a Computer and Become More Proficient Over Time
Problem Description:
Current AI Agents do not typically learn from past experiences when performing repetitive tasks. Most Agents, regardless of how many times a task is executed, approach it as if it were the first time, making repeated mistakes.
The goal of this topic is to build an Agent that can learn from experience. After completing a task, the Agent can summarize successful experiences, forming “knowledge” or “shortcuts,” and directly utilize this knowledge when encountering the same or similar tasks next time, significantly improving execution speed and success rate.
Scenario Setting:
We will use real web application operation tasks as an example. You need to create an Agent to learn and accelerate these daily “computer usage” tasks.
- Target Application: Use a website with a clear function as an example, such as a weather query website, web-based email (like Gmail), online shopping, or ticket booking website.
- Build Agent:
- The Agent receives text task instructions, such as “Check the weather in Beijing for me” or “Send an email to test@example.com.”
- The Agent needs to have basic browser operation capabilities, able to browse web pages, take screenshots, input text, click links/buttons, etc.
- The Agent’s “thinking” ability relies on multimodal large models (e.g., GPT-4o, Claude 4 Sonnet, Gemini 2.5 Pro), deciding the next operation by sending webpage screenshots or DOM structures and instructions to the model.
- The Agent needs to implement a “Knowledge Base” for storing and retrieving learned operation workflows.
Technical Implementation Plan:
- Framework Suggestion: It is recommended to base secondary development on the browser-use code repository, which provides basic browser operation capabilities integrated with Playwright.
- Learning Phase: Capturing Stable Operation Flows:
browser-use
assigns temporary numbers (e.g.,13
) to clickable elements on the page when interacting with large models. After the model outputs an instruction (e.g.,click(13)
), you need to capture the stable identifier of that element frombrowser-use
‘s internal state.browser-use
creates aDOMHistoryElement
object for each operated element, containing rich details likexpath
andcss_selector
.- Your task is to extract this XPath or CSS Selector after the Agent executes each step and store it as a step in your workflow, along with the operation type (
click
,type
) and related parameters (e.g., input text). XPath is recommended as it is usually more robust to minor changes in page structure.
- Application Phase: Reliably Replaying Operation Flows:
- When the Agent retrieves a matching workflow from the Knowledge Base, it will execute the recorded steps in order.
- Since modern web pages are dynamically loaded, directly executing clicks and inputs consecutively is likely to fail. Therefore, before executing each operation, you must wait for the target element to appear on the page and become interactive.
- Playwright’s
locator.wait_for()
method can be used to implement this waiting mechanism. For example, before a click operation, usepage.locator(xpath).wait_for(state='visible', timeout=15000)
to ensure the element is loaded.
- Knowledge Base Design:
- The Knowledge Base can be a simple persistent storage (such as a JSON file or a small database).
- Its core function is to map the user’s “task intent” (e.g., “send an email”) to a specific operation workflow (i.e., the sequence of steps you recorded). You need to design a simple mechanism to match new tasks with stored intents.
Acceptance Criteria:
Choose a scenario for acceptance testing, such as “sending an email.”
First Task Execution (Learning Phase):
- Preconditions: The Agent’s knowledge base is empty.
- Task: Issue a command to the Agent, such as “Write an email to test@example.com with the subject ‘Hello’ and the content ‘This is a test email.’”
- Acceptance Requirements:
- Demonstrate the Agent completing the task through the “observe-think-act” loop of a multimodal large model.
- After the task is successful, show the operation process generated by the Agent and stored in the knowledge base, based on stable selectors (e.g., XPath).
- Record and report the time taken and the number of steps in this process.
Repeated Task Execution (Application Experience Phase):
- Preconditions: The knowledge base already contains the workflow for “sending an email.”
- Task: Issue a similar command to the Agent, such as “Send an email to another@example.com…”
- Acceptance Requirements:
- Demonstrate that the Agent can correctly match and retrieve the “send email” process from the knowledge base.
- Demonstrate that the Agent replays the recorded steps directly (including correctly filling in new email parameters) instead of invoking the large model for exploration from scratch.
- Compare and prove that the time taken and the number of steps for the second task execution are significantly less than the first.
Bonus Points:
- Knowledge Generalization: The Agent can apply learned knowledge to broader scenarios. For example, after learning “check Beijing weather,” it can reuse most of the process when asked to “check Shanghai weather,” only replacing the city name. After learning “send an email,” it can handle emails with different recipients and content.
- Knowledge Update and Verification: The Agent can realize that stored knowledge may be outdated (e.g., a website redesign makes the “send” button unfindable). When it discovers that a stored process is invalid, the Agent can record this failure, clear outdated knowledge, and revert to learning mode to find the correct operation process again.
Topic 5: An Agent That Can Create Its Own Tools
Problem Description:
Most current AI Agents rely on predefined toolsets, which limits their flexibility and scalability in handling open, complex tasks. When faced with a problem that no existing tool can solve, the Agent often becomes helpless.
The goal of this topic is to build an Agent with “self-evolution” capabilities, which can autonomously create and integrate new tools based on task requirements. We draw on the ideas from the Alita paper (Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal SELF-EVOLUTION), which emphasizes “minimizing predefinition and maximizing self-evolution.”
You need to build an Agent that does not rely on a large preset tool library. When encountering a new task, the Agent should be able to:
- Understand Task Requirements: Analyze the task and determine if new capabilities/tools are needed to complete it.
- Search for Solutions: Search the open-source world (e.g., GitHub) for relevant libraries or APIs to implement the required functionality.
- Learn and Integrate: Read documentation or code examples, learn how to use the found library/API, and dynamically generate code to call it, thereby “creating” a new tool.
- Execute the Task: Use the newly created tool to solve the problem.
Acceptance Criteria:
The Agent must be fully autonomous in creating tools for at least one of the following tasks and successfully executing them, without hallucinations. The Agent must be general-purpose, not allowing hardcoded tools or workflows for specific problems.
Scenario 1: YouTube Video Content Understanding
- Task: Given a question: “In the YouTube 360 VR video from March 2018 narrated by the voice actor of Lord of the Rings’ Gollum, what number was mentioned by the narrator directly after dinosaurs were first shown in the video?”
- Agent Execution Process (Reference):
- The Agent analyzes the need to obtain subtitles from the YouTube video.
- The Agent autonomously searches online and finds a suitable Python library.
- The Agent reads the library’s usage, writes Python code to download the subtitles of the specified video.
- The Agent analyzes the subtitle content to find the answer to the question.
- Acceptance: The Agent outputs the correct answer “100000000.”
Scenario 2: Real-time Financial Data Query
- Task: Given a question, such as “What is the latest stock price of NVIDIA (NVDA)?”
- Agent Execution Process (Reference):
- The Agent analyzes the need to query real-time stock prices, which requires calling a financial data API.
- The Agent autonomously searches online, finds a free stock data API, and learns its documentation.
- The Agent writes code to call the API according to its requirements (may need to register for a free API Key) to query the latest price of NVDA.
- The Agent parses the API return result to extract the price information.
- Acceptance: The Agent outputs the latest stock price of NVDA (allowing for slight delays or data source differences).
Bonus Points:
- Tool Reuse and Management: The Agent can save the tools created once (e.g., “YouTube Subtitle Fetcher” or “Stock Price Query Tool”). When encountering similar tasks in the future (e.g., querying another video or another stock), it can directly reuse existing tools instead of recreating them.
- Robustness Handling: The tools created by the Agent may encounter various errors during execution (e.g., API key expiration, network issues, library version incompatibility, etc.). The Agent can understand these errors and attempt to fix them, such as re-searching for other libraries/APIs.
Topic 6: An Agent That Can Operate a Computer While Making a Phone Call
Problem Description:
Imagine a scenario where an AI Agent needs to help a user complete an online booking task, such as filling out a complex flight booking form. During this process, the Agent needs to operate the webpage while simultaneously asking and confirming personal information (such as name, ID number, flight preferences, etc.) with the user over the phone.
This task poses a significant challenge for a single Agent because both phone communication and computer operation require high real-time performance. If an Agent is focused on “looking” at the screen and clicking buttons, it cannot simultaneously listen to the user’s speech and respond, and vice versa. This can lead to call stuttering or operation interruptions, resulting in a poor experience.
The goal of this topic is to build a multi-agent system with two Agents working collaboratively to solve this “multitasking” challenge. One Agent is responsible for making phone calls, and the other is responsible for operating the computer. They communicate in real-time to efficiently complete the task.
Core Challenges and Requirements:
- Dual-Agent Architecture: You need to build two independent Agents:
- Phone Agent: Responsible for voice communication with the user. You need to implement it based on ASR (Automatic Speech Recognition) + LLM (Large Language Model) + TTS (Text-to-Speech) APIs. It can refer to the implementation ideas from Topic 1.
- Computer Agent: Responsible for operating the computer’s browser to complete tasks such as filling out web forms. It is recommended to base it on existing browser operation frameworks, such as Anthropic Computer Use or browser-use or other similar frameworks.
- Agent Intercommunication:
- The two Agents must be able to communicate efficiently in both directions. When the Phone Agent obtains information from the user (e.g., “My name is Zhang San”), it needs to immediately “inform” the Computer Agent. When the Computer Agent encounters a problem during operation (e.g., “Cannot find the ‘Next’ button”) or completes a step, it also needs to “inform” the Phone Agent.
- This communication can be achieved through tool calls (Tool-use): The Phone Agent calls a
send_message_to_computer_agent
tool, and the Computer Agent calls asend_message_to_phone_agent
tool.
- Parallel Work and Real-time Performance:
- The key is that the two Agents must be able to work in parallel. While the Computer Agent is searching for page elements or entering text, the Phone Agent must remain online and able to have a normal conversation with the user, such as saying “Okay, I’m filling in your name… What is your ID number?”
- The input of both Agents needs to include information from the other. For example, the input to the Phone Agent’s language model should not only include the user’s speech transcription but also a specially marked field containing messages from the Computer Agent (e.g.,
[FROM_COMPUTER_AGENT] Cannot find the 'Next' button
). Similarly, the input to the Computer Agent’s multimodal model should not only include browser screenshots but also messages from the Phone Agent (e.g.,[FROM_PHONE_AGENT] User says the name is Zhang San
).
Reference Materials:
- You can refer to the design ideas of Google’s Agent-to-Agent (A2A) communication protocol.
Acceptance Criteria:
- Choose an Online Form: Choose a public website, such as a registration page, a booking form, or a contact us page.
- Demonstrate Collaborative Workflow:
- After starting the system, the Phone Agent proactively calls the user (played by a real person) or starts a voice conversation, explaining the task goal (“Hello, I will help you fill out the XX form”) and begins asking for the first required item (e.g., “What is your name?”).
- After the user answers, the Phone Agent immediately transmits the information to the Computer Agent.
- After receiving the information, the Computer Agent finds the corresponding input box in the browser and fills it in.
- During the Computer Agent’s operation, the Phone Agent should not be silent and can provide feedback to the user (“Okay, the name has been filled in.”) and then ask the next question.
- The entire form-filling process should be smooth, with no significant blocking between phone communication and computer operation.
- Demonstrate Exception Handling:
- When the Computer Agent encounters a situation it cannot handle (e.g., the user-provided information format is incorrect, causing a webpage error), it should inform the Phone Agent of this error.
- After receiving the error, the Phone Agent can relay the problem to the user and request new information (e.g., “Sorry, the email format you provided seems incorrect. Can you say it again?”).
It seems like your message is empty. Could you please provide the Markdown content you want translated?