Zhongguancun Artificial Intelligence Academy & UCAS 2025 Summer AI Agent Practical Topics
The AI Agent Hackathon at UCAS in February 2025 was very successful, so I will host two AI Agent practical topics again from July 27 to 30 at Zhongguancun Artificial Intelligence Academy and from July 31 to August 4 at UCAS.
Many thanks to Professor Zheng Shuxin, Vice Dean of Zhongguancun Artificial Intelligence Academy, and Professor Liu Junming of UCAS for inviting me to host these two AI Agent practical activities.
All the topics of this AI Agent practice will take you deep into the cutting-edge technology of building the next generation of AI Agents. You will have the opportunity to practice:
- Multimodal models and thinking model applications: Build the “brain” of the agent with industry-leading multimodal models and thinking models such as Gemini 2.5 Pro and Claude 4 Sonnet.
- Real-time voice interaction: Integrate VAD, ASR, LLM, and TTS technology stacks to create real-time voice agents capable of streaming conversations.
- Autonomous operation of graphical interfaces: Develop agents that can stably operate GUIs such as browsers to complete complex real-world tasks.
- Advanced Agent Architecture: Explore advanced architectures such as “fast and slow thinking,” “thinking while listening,” and multi-agent collaboration to give agents the ability to respond in real-time and think deeply.
- Learning from experience: Build agents that can learn from experience, allowing them to become more proficient in repetitive tasks.
- Identifying authoritative information sources: Enable agents to accurately identify and adopt high-credibility information such as official documents and academic papers from vast amounts of information.
- Autonomous tool invocation and creation: Allow agents not only to use existing tools but also to autonomously learn and create new tools to solve open-ended problems.
Suggestions on AI-assisted programming: In this AI Agent practice, we encourage everyone to use AI-assisted programming, which means “developing agents with agents.” We recommend using Cursor for Vibe Coding, and here are some suggestions:
- Documentation first, code later: Let Cursor write the design document first. Your role is to provide improvement suggestions for the AI-generated design document and iterate with the AI until satisfied. Then, let Cursor write the code according to the final design document. During coding, always keep the design document in the agent’s context as a reference.
- Choose the right model: Do not use Cursor’s “auto” mode; be sure to choose a model with thinking ability (with a brain icon next to it), such as Claude 4 Sonnet.
- Test-driven: Be sure to have AI write and execute test cases for its code to ensure code quality.
Feel free to form teams and choose any of the following topics to start your creative journey!
Topic Directory
- Topic 1: Real-time Voice Agent Combining Fast and Slow Thinking
- Topic 2: Interview Cheating Agent Thinking While Listening
- Topic 3: Deep Search Agent Capable of Identifying Authoritative Information Sources
- Topic 4: Agent Capable of Operating Computers and Becoming More Proficient
- Topic 5: Agent Capable of Creating Tools
- Topic 6: Agent Operating Computers While Making Phone Calls
Topic 1: Real-time Voice Agent Combining Fast and Slow Thinking
Objective: Build an advanced real-time voice dialogue system that integrates industry-standard voice processing technology stacks (VAD, ASR, TTS) and large language models (LLM) to achieve natural, low-latency human-machine voice interaction.
Core Challenge: The core of this topic is to implement a “Mixture-of-Thoughts” architecture that simulates the complex thinking process of humans. The system needs to run two thinking modes in parallel:
- Fast Response Path: Use low-latency models (such as GPT-4o, Gemini 2.5 Flash) to provide instant feedback, handle simple queries, and maintain conversation fluency.
- Deep Thinking Path: Use more powerful SOTA models (such as GPT-4.1, Claude 4 Sonnet, Gemini 2.5 Pro) for complex reasoning and tool invocation (such as online search) to provide users with more accurate and in-depth answers.
Technical Requirements:
- Architecture: Implement a complete application with both server-side and web front-end. The front-end interface focuses on functionality.
- Mixture-of-Thoughts: Implement the above fast and slow dual-path collaborative working mechanism. When the slow thinking path is processing, the fast thinking path needs to provide filler words to avoid conversation interruption. The slow thinking model must support streaming output, transmitting its intermediate thinking process in real-time to the fast thinking model, allowing the fast thinking model to generate more meaningful filler words based on these intermediate results.
- Tool Invocation: The deep thinking model must have the ability to call external tools, integrating at least one real-time network search tool.
- Voice Interaction: The system needs to integrate VAD (Voice Activity Detection) technology (recommended to use Silero VAD) to achieve automatic voice segmentation and response without manual user triggering.
- Interruption Mechanism: Implement a controllable interruption function, allowing users to insert their speech while the AI is speaking.
Acceptance Criteria:
- Basic Latency: After the user completes a simple greeting (such as “Nice to meet you”), the system must generate a voice response within 2 seconds.
- Real-time Interaction: In multi-round dialogue games, the system needs to demonstrate quick understanding and reaction capabilities. For example, in the “turn-taking counting (skip 4)” game, after the user says “three,” the system must accurately respond with “five” within 1.5 seconds.
- Mixture-of-Thoughts Capability:
- Basic Reasoning: For questions requiring logical reasoning, the system must quickly respond and provide answers. For example, when a user asks, “What is 8 to the power of 6?”, the system must start responding within 2 seconds after the user finishes asking (using filler words if necessary) and provide the correct answer “262144” within 15 seconds.
- Tool Invocation: For questions requiring online queries (such as “What’s the weather like in Beijing today?”), the system must start responding within 2 seconds after the user finishes asking and return accurate weather information through API calls within 15 seconds, without interrupting the conversation.
- Intelligent Filler Word Mechanism: When the slow thinking model is deeply thinking, the fast thinking model is responsible for real-time dialogue with the user. If the initial filler words of the fast thinking model (such as “Let me think”) are finished, and the slow thinking is not completed, the fast thinking model needs to be able to receive the streaming intermediate thinking process of the slow thinking model and summarize it into natural speech to continue communicating with the user, ensuring the conversation is not interrupted. For example, when a user asks a complex question, the fast thinking might first say “Let me think,” and then continue with “This question requires considering several aspects… I’m analyzing the data…” based on the intermediate process of slow thinking.
Bonus Points:
- Smarter Interruption:
- Interruption Robustness: Filter out meaningless background noise or short affirmations through the content of the user’s speech (such as “um,” “okay” for confirmation vs. “wait,” “no” for rebuttal), only stopping its speech when the user has a clear intention to interrupt.
- Example: AI is introducing: “This phone uses the latest A18 chip, with excellent performance…” At this point, if the user says “um,” AI should continue; but if the user says “How about its battery life?”, AI should immediately stop the introduction after the user says “How about” and switch to answering the battery life question.
- Smarter Speech:
- Turn-taking Judgment: The system needs to have the ability to predict the user’s dialogue intention. By analyzing the semantic completeness of the user’s spoken content, determine whether the user may continue speaking. For example, when the user says “I want to ask about…”, the system should judge that the intention is not fully expressed and choose to wait rather than immediately interrupt.
- Silence Management: After the user completes a complete intention expression, if there is a long awkward silence, the system should be able to proactively and naturally start a new topic or ask follow-up questions to maintain the flow of the conversation. For example, after answering a question, if the user does not respond for a few seconds, AI can say: “Do you have any other questions about this topic?”
Technical Selection Suggestions:
To ensure low latency, if access to overseas APIs is restricted, consider using domestic service providers (such as Doubao, Tongyi Qianwen, Siliconflow) for LLM/TTS/ASR APIs. Fast thinking is recommended to use Doubao-Seed-1.6-flash, and slow thinking is recommended to use Doubao-Seed-1.6-thinking.
Reference Code:
Topic 2: Interview Cheating Agent Thinking While Listening
Problem Description:
In technical interviews, interviewers often pose complex, multi-part questions that require candidates to quickly understand, organize their thoughts, and respond clearly in a short time. This is a huge challenge for anyone.
The goal of this topic is to build an interview cheating agent with “thinking while listening” capability. It does not start working after the interviewer finishes speaking but synchronizes thinking and information retrieval while the other party is speaking, displaying preliminary thoughts and key points to the user in real-time, helping the user gain an advantage and respond calmly.
Core Requirements:
- Thinking While Listening: The agent must be able to process streaming ASR results in real-time. While the interviewer is speaking, the agent should input incomplete sentence segments into a Thinking Model to generate and iteratively update its “internal monologue/thoughts.” These thought processes need to be used as context for generating the final answer.
- Real-time Thought Display: The user’s interface needs an area to display the agent’s internal thought process in real-time. For example, when the interviewer mentions a technical term, related keywords, definitions, pros, and cons should immediately appear. This allows the user to see preliminary answer points while listening to the question.
- Dual-Response Mechanism:
- Quick Response: After the interviewer stops speaking, if the “thinking while listening” deep thinking process is not yet complete, the system must provide a brief core point within 5 seconds to help the user quickly start speaking and avoid awkward silence.
- Deep Answer: After the quick response, the system continues deep thinking and starts streaming a more comprehensive and well-organized detailed answer within 15 seconds.
- UI Interface: A simple interface (web or desktop application) is needed to display the agent’s output.
Acceptance Criteria:
- Scenario Simulation: A real person acts as a mock interviewer to conduct a mock interview.
- “Listening and Thinking” Ability Test:
- Test Case: A real interviewer asks a long question at a steady pace, such as: “Please explain in detail the architecture of the Transformer model, including its motivation, the principle of self-attention mechanism, the role of positional encoding, the concept of multi-head attention, and how the encoder and decoder stack work. Finally, compare its advantages over earlier architectures like RNN.”
- Acceptance Requirement: During the interviewer’s question, the Agent’s “internal thinking” area needs to update in real-time. For example, when the interviewer mentions “attention mechanism,” keywords like “Query, Key, Value” should immediately appear on the interface.
- Complete Response Process Test:
- Test Case: The long question from the previous point can be used.
- Acceptance Requirement: After the interviewer finishes speaking, if the complete answer is still being generated, the Agent must display a core point summary within 3 seconds (e.g., “Key points: self-attention, multi-head attention, positional encoding, encoder-decoder architecture”). Subsequently, within 30 seconds, start streaming a structured detailed response.
Bonus Points:
- Non-question Filtering: The Agent can distinguish between formal questions and casual chat from the interviewer (e.g., “The weather is nice today”) and only triggers search and response for key questions.
- Context Understanding: The Agent can understand the context during the interview process. If the interviewer follows up on a detail, the Agent’s search and response should be based on the previous question, not a new start. For example, after answering the CAP theorem, if the interviewer asks, “Is the Raft protocol CP or AP?”, the Agent should directly answer that Raft is a CP consistency algorithm.
- Internet Search Integration: The Agent’s deep thinking process can integrate real-time internet search results to answer questions about the latest technologies or events.
Reference Code:
Topic Three: Deep Search Agent Capable of Identifying Authoritative Information Sources
Background and Challenges:
Existing deep research agents (e.g., OpenAI Deep Research, Gemini Deep Research, Kimi Deep Research, GenSpark, etc.) show great potential in information retrieval but still face two core challenges: discerning the authority of information sources and the timeliness of information. The internet is filled with outdated, inaccurate, and even contradictory information. The goal of this topic is to build an autonomous agent that can overcome these challenges, not only gathering information but also critically evaluating and logically reasoning to produce highly credible answers.
Task Description:
Develop a fully autonomous agent that, when given a complex query, can autonomously plan and execute a series of actions, including but not limited to:
- Calling search engines for preliminary exploration.
- Deeply browsing web content and extracting key information.
- Parsing and understanding PDF documents.
- Cross-verifying and logically reasoning based on collected multi-source information.
Core Requirements:
- Authority Source Identification: The agent must be able to identify and prioritize information from high-credibility sources such as official documents, academic papers, and authoritative technical communities, actively filtering out low-quality content from non-professional media or forums.
- Reasoning Ability: For questions where answers are not directly provided, the agent needs to have the ability to compute or logically deduce from existing information.
- Generality: The agent’s solution strategy must be general, prohibiting hard-coded prompts or workflows for specific test questions.
Technical Implementation Plan:
- Context with URL: After the Agent browses a webpage, when providing the content (or summary) of that page to the large language model (LLM), the complete URL of the page must be passed as context information. This allows the LLM to judge the authority of the information source based on the domain name (e.g.,
nvidia.com
,arxiv.org
). - Prompting for Authority: The Agent’s System Prompt must include clear instructions guiding the LLM to think critically and requiring it to be accurate. For example:
“You are a meticulous investigator, and your task is to find the most accurate answer to the question. You must prioritize information from official websites, official technical documents, or top academic conference papers. Information from third-party media, blogs, or forums should be considered unreliable unless corroborated by official sources. Before answering, please repeatedly cross-verify and clearly state your information sources.”
- Tool Set: The Agent must have at least the following tools:
search(query)
: Call a search engine to get search results.browse(url)
: Visit a webpage and extract its text content.parse_pdf(url_or_path)
: Parse and extract text content from a PDF document.
Acceptance Criteria:
The agent must provide accurate answers to at least 3 out of the following 5 questions in a fully autonomous mode. For questions that cannot be confirmed, the agent should clearly state its inability to answer rather than hallucinate.
- Tensor FP16 FLOPS performance (without sparsity) of the NVIDIA RTX 4090
- List of OpenAI founders (co-chairs) and their current affiliations
- The current total number of transactions on the Ethereum blockchain
- The exact number of racks, Ascend 910 nodes, NPUs, CPUs, and UB Switches inside a supernode in Huawei CloudMatrix384
- What are the full names of Bojie Li’s wife and ex-girlfriend? (Bojie Li is Co-Founder and Chief Scientist of Pine AI)
Special Notes:
- The above questions are quite challenging, and some answers require reasoning from original materials (e.g., the number of UB switches in CloudMatrix384).
- Direct searches may yield misleading results (e.g., the TF16 performance of 4090, where many media and materials provide performance with sparsity or non-tensor FP16 performance), requiring the agent to have excellent source identification capabilities.
- This topic’s setup references the industry-leading GAIA benchmark (https://huggingface.co/spaces/gaia-benchmark/leaderboard) to challenge the agent’s comprehensive information acquisition capabilities in the real world.
Hints: Reference Answers and Research Process for the Above Questions:
- Tensor FP16 FLOPS performance (without sparsity) of the NVIDIA RTX 4090
- Search “NVIDIA RTX 4090 official specs PDF”
- Click on “NVIDIA ADA GPU ARCHITECTURE”
- Download PDF of NVIDIA RTX 4090 specs: https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf
- Read the answer from the PDF: Peak FP16 Tensor TFLOPS with FP16 Accumulate: 330.3 TFLOPS; Peak FP16 Tensor TFLOPS with FP32 Accumulate: 165.2 TFLOPS.
- Note 1: The PDF shows “330.3/660.6” in the table, but there is a footnote “Effective TOPS / TFLOPS using the new Sparsity Feature”, so the answer should be 330.3.
- Note 2: There are a lot of web pages with inaccurate information.
- List of OpenAI founders (co-chairs) and their current affiliations
- Search “OpenAI founders”
- Enter Wikipedia page of OpenAI: https://en.wikipedia.org/wiki/OpenAI
- Click on the link of each founder on the Wikipedia page
- Vicki Cheung, Durk Kingma, and Pamela Vagata do not have Wikipedia pages. So you need to search them online. Make sure that you do not find the wrong person. For example, the first Google Search result Pamela Vagata is another person. The correct one should be Founder at Pebblebed.
- The current total number of transactions on the Ethereum blockchain
- Search “Ethereum blockchain”
- Click on Etherscan: https://etherscan.io/
- The page reads: Transactions 2,898.95 M (at the time of this writing)
- Note 1: You may need to solve CAPTCHAs when visiting the Etherscan website.
- Note 2: There are a lot of articles with approximate or outdated information.
- Note 3: Do not confuse the number of transactions with the number of blocks.
- The exact number of racks, Ascend 910 nodes, NPUs, CPUs, and UB Switches inside a supernode in Huawei CloudMatrix384
- Search “Huawei CloudMatrix384”
- Click on the paper: “Serving Large Language Models on Huawei CloudMatrix384”: https://arxiv.org/abs/2506.12708
- Download the ArXiv paper: https://arxiv.org/pdf/2506.12708
- Read the total number of racks (16), Ascend 910 nodes (48), NPUs (384) and CPUs (192) from the paper
- Analyze the number of UB Switches: Each node has 7 on-board L1 UB switch chips. The L2 switches are partitioned into 7 independent sub-planes. Each sub-plane contains 16 L2 UB switch chips. So, the total number of UB Switches is 48 * 7 + 7 * 16 = 448.
- What are the full names of Bojie Li’s wife and ex-girlfriend? (Bojie Li is Co-Founder and Chief Scientist of Pine AI)
- Search “Bojie Li”
- Enter the personal website of Bojie Li: https://01.me/
- Find the name of Bojie Li’s wife from the wedding article: https://01.me/2023/08/wedding-talks/ or https://01.me/2021/05/engagement/
- Search “前女友” in https://01.me/
- Visit the first article in the search results: https://01.me/2024/05/life-partners-of-founders/
- Click on the link inside the article: https://www.zhihu.com/question/27380832/answer/37114694
- Read the name of the author in the article. You can cross-verify the information from: https://zhuanlan.zhihu.com/p/536957679
Note that there is more than one way to conduct research to get the correct answer.
Bonus: Self-Verification Capability
To ensure the highest confidence in the answer, the agent needs to implement deep self-verification capabilities. We draw on academic research to enhance model reasoning capabilities, mainly integrating two core methods: Parallel Sampling and Sequential Revision. These methods can be combined to tackle problems of varying difficulty.
Parallel Sampling: This method explores a broader solution space by generating multiple reasoning paths simultaneously. Specific implementations can include:
- Multi-Path Independent Reasoning: Allowing multiple different models (or the same model using different temperatures) to process the same problem in parallel.
- Final Arbitration: When different reasoning paths produce conflicting answers, these answers and their respective reasoning processes are submitted to a more powerful model (such as Claude 4 Opus, Gemini 2.5 Pro, or OpenAI o3) for final arbitration to select the most credible answer.
Sequential Revision: This method aims to iteratively refine the answer through feedback. After deriving an initial answer, the agent will engage in self-reflection and revision:
- Self-Correction Prompting: The system needs to challenge itself, for example, using prompts like “Are you sure? This is a hard question. Re-check your reasoning and revise if needed.” to force the model to re-examine its reasoning chain and correct any potential errors.
Reference: Weng, Lilian. “Why We Think”. Lil’Log (May 2025). https://lilianweng.github.io/posts/2025-05-01-thinking/
Recommended Reference Projects:
It is encouraged to refer to or develop further based on the following cutting-edge open-source projects to address the challenges of this topic:
- https://github.com/inclusionAI/AWorld
- https://github.com/camel-ai/owl
- https://github.com/FoundationAgents/OpenManus
Topic 4: An Agent That Can Operate a Computer and Become More Proficient Over Time
Problem Description:
Current AI Agents typically do not learn from past experiences when performing repetitive tasks. Most agents, regardless of how many times a task is executed, approach each task as if it were the first time, repeating the same mistakes.
The goal of this topic is to build an agent that can learn from experience. After completing a task, the agent should be able to summarize successful experiences, forming “knowledge” or “shortcuts,” and directly utilize this knowledge when encountering the same or similar tasks in the future, thereby significantly improving execution speed and success rate.
Scenario Setting:
We will use real web application operation tasks as an example. You need to create an agent to learn and accelerate these daily “computer usage” tasks.
- Target Application: Use a website with a clear function as an example, such as a weather query website, web-based email (like Gmail), online shopping, or ticket booking website.
- Build the Agent:
- The agent receives text task instructions, such as “Check the weather in Beijing for me” or “Send an email to test@example.com.”
- The agent needs basic browser operation capabilities, such as browsing web pages, taking screenshots, entering text, clicking links/buttons, etc.
- The agent’s “thinking” ability relies on multimodal large models (e.g., GPT-4o, Claude 4 Sonnet, Gemini 2.5 Pro), deciding the next operation by sending webpage screenshots or DOM structures and instructions to the model.
- The agent needs to implement a “Knowledge Base” for storing and retrieving learned operation workflows.
Technical Implementation Plan:
- Framework Suggestion: It is recommended to develop further based on the browser-use code repository, which provides basic browser operation capabilities integrated with Playwright.
- Learning Phase: Capturing Stable Operation Flows:
browser-use
assigns temporary numbers (e.g.,13
) to clickable elements on the page when interacting with large models. After the model outputs an instruction (e.g.,click(13)
), you need to capture the stable identifier of that element frombrowser-use
‘s internal state.browser-use
creates aDOMHistoryElement
object for each operated element, containing rich details such asxpath
andcss_selector
.- Your task is to extract this XPath or CSS Selector after the agent executes each step and store it with the operation type (
click
,type
) and related parameters (such as entered text) as a step in your workflow. XPath is recommended because it is usually more robust to minor changes in page structure.
- Application Phase: Reliably Replaying Operation Flows:
- When the agent retrieves a matching workflow from the knowledge base, it will execute the recorded steps in order.
- Since modern web pages are dynamically loaded, directly executing clicks and inputs consecutively is likely to fail. Therefore, before executing each operation, you must wait for the target element to appear on the page and become interactive.
- You can use Playwright’s
locator.wait_for()
method to implement this waiting mechanism. For example, before a click operation, usepage.locator(xpath).wait_for(state='visible', timeout=15000)
to ensure the element is loaded.
- Knowledge Base Design:
- The knowledge base can be a simple persistent storage (such as a JSON file or a small database).
- Its core function is to map the user’s “task intent” (e.g., “send an email”) to a specific operation workflow (i.e., the sequence of steps you recorded). You need to design a simple mechanism to match new tasks with stored intents.
Acceptance Criteria:
Choose a scenario for acceptance, such as “sending an email.”
First Task Execution (Learning Phase):
- Precondition: The agent’s knowledge base is empty.
- Task: Issue an instruction to the agent, such as “Write an email to test@example.com with the subject ‘Hello’ and content ‘This is a test email.’”
- Acceptance Requirements:
- Demonstrate the agent completing the task through the “observe-think-act” loop of a multimodal large model.
- After the task is successful, show the agent-generated operation workflow based on stable selectors (such as XPath) stored in the knowledge base.
- Record and report the time taken and the number of steps for this process.
Repeated Task Execution (Application Experience Phase):
- Precondition: The knowledge base already contains the “send email” workflow.
- Task: Issue a similar instruction to the agent, such as “Send an email to another@example.com…”
- Acceptance Requirements:
- Demonstrate that the agent can correctly match and retrieve the “send email” workflow from the knowledge base.
- Demonstrate that the agent directly replays the recorded steps (including correctly filling in new email parameters) instead of calling the large model again for exploration from scratch.
- Compare and prove that the time taken and the number of steps for the second task execution are significantly less than the first.
Bonus:
- Knowledge Generalization: The agent can apply learned knowledge to broader scenarios. For example, after learning “check Beijing weather,” when asked to “check Shanghai weather,” it can reuse most of the workflow, only replacing the city name. After learning “send email,” it can handle emails with different recipients and content.
- Knowledge Update and Verification: The agent can be aware that stored knowledge may be outdated (e.g., a website redesign makes the “send” button unfindable). When it discovers that a stored workflow is invalid, the agent can record this failure, clear outdated knowledge, and revert to learning mode to find the correct operation workflow again.
Topic 5: An Agent That Can Create Its Own Tools
Problem Description:
Most current AI Agents rely on predefined toolsets, limiting their flexibility and scalability in handling open, complex tasks. When faced with a problem that cannot be solved with existing tools, agents often find themselves at a loss.
The goal of this topic is to build an agent with “self-evolution” capabilities, able to autonomously create and integrate new tools based on task requirements. We draw on the ideas from the Alita paper (Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal SELF-EVOLUTION), which emphasizes “minimizing predefinition and maximizing self-evolution.”
You need to build an agent that does not rely on a large preset tool library. When encountering a new task, the agent should be able to:
- Understand Task Requirements: Analyze the task and determine if new capabilities/tools are needed to complete it.
- Search for Solutions: Search the open-source world (e.g., GitHub) for relevant libraries or APIs to implement the required functionality.
- Learn and Integrate: Read documentation or code examples, learn how to use the found library/API, and dynamically generate code to call it, thereby “creating” a new tool.
- Execute Task: Use the newly created tool to solve the problem.
Acceptance Criteria:
The agent should be fully autonomous in creating tools for at least one of the following tasks and successfully executing them, without hallucinations. The agent needs to be general-purpose, not allowing hard-coded tools or workflows for specific problems.
Scenario 1: Understanding YouTube Video Content
- Task: Given a question: “In the YouTube 360 VR video from March 2018 narrated by the voice actor of Lord of the Rings’ Gollum, what number was mentioned by the narrator directly after dinosaurs were first shown in the video?”
- Agent Execution Process (Reference):
- The agent analyzes the need to obtain subtitles from the YouTube video.
- The agent independently searches online to find a suitable Python library.
- The agent reads the library’s usage instructions and writes Python code to download the subtitles of the specified video.
- The agent analyzes the subtitle content to find the answer to the question.
- Acceptance: The agent outputs the correct answer “100000000”.
Scenario 2: Real-time Financial Data Query
- Task: Given a question, such as “What is the latest stock price of NVIDIA (NVDA)?”
- Agent Execution Process (Reference):
- The agent analyzes the need to query real-time stock prices, which requires calling a financial data API.
- The agent independently searches online to find a free stock data API and learns its documentation.
- The agent writes code to call the API according to its requirements (may need to register for a free API Key) to query the latest price of NVDA.
- The agent parses the API return results to extract the price information.
- Acceptance: The agent outputs the latest stock price of NVDA (allowing for slight delays or data source differences).
Bonus Points:
- Tool Reuse and Management: The agent can save one-time created tools (such as “YouTube Subtitle Fetcher” or “Stock Price Query Tool”). When encountering similar tasks in the future (e.g., querying another video or another stock), it can directly reuse existing tools instead of recreating them.
- Tool Verification: Before adding newly created tools to the toolkit, the agent must design test cases to verify the tool’s usability and correctness. Only verified tools can be officially included in the toolkit, ensuring the quality of the tool library.
- Robustness Handling: The tools created by the agent may encounter various errors during execution (e.g., API key expiration, network issues, library version incompatibility, etc.). The agent can understand these errors and attempt to fix them, such as searching for other libraries/APIs.
Topic 6: Agent Operating a Computer While on a Call
Problem Description:
Imagine a scenario where an AI agent needs to help a user complete an online booking task, such as filling out a complex flight booking form. During this process, the agent needs to operate the webpage while simultaneously asking and confirming personal information (such as name, ID number, flight preferences, etc.) with the user over the phone.
This task poses a significant challenge for a single agent because both phone communication and computer operation require high real-time performance. If an agent is focused on “looking” at the screen and clicking buttons, it cannot simultaneously listen to the user and respond, and vice versa. This can lead to call stuttering or operation interruptions, resulting in a poor experience.
The goal of this topic is to build a multi-agent system with two agents working collaboratively to solve this “multitasking” challenge. One agent is responsible for making phone calls, and the other is responsible for operating the computer. They communicate in real-time to efficiently complete the task.
Core Challenges and Requirements:
- Dual Agent Architecture: You need to build two independent agents:
- Phone Agent: Responsible for voice communication with the user. You need to implement it based on ASR (Automatic Speech Recognition) + LLM (Large Language Model) + TTS (Text-to-Speech) APIs. It can refer to the implementation ideas of Topic 1 or directly use the open-source reference code bojieli/ai-agent-projects/tree/main/live-audio.
- Computer Agent: Responsible for operating the computer’s browser to complete tasks such as filling out web forms. It is recommended to base it on existing browser operation frameworks, such as Anthropic Computer Use or browser-use or other similar frameworks.
- Agent Intercommunication:
- The two agents must be able to communicate efficiently in both directions. When the phone agent obtains information from the user (e.g., “My name is Zhang San”), it needs to immediately “inform” the computer agent. When the computer agent encounters a problem during operation (e.g., “Cannot find the ‘Next’ button”) or completes a step, it also needs to “inform” the phone agent.
- This communication can be achieved through tool calls (Tool-use): the phone agent calls a
send_message_to_computer_agent
tool, and the computer agent calls asend_message_to_phone_agent
tool.
- Parallel Work and Real-time Performance:
- The key is that the two agents must be able to work in parallel. While the computer agent is searching for page elements or entering text, the phone agent must remain online and able to have a normal conversation with the user, such as saying “Okay, I am filling in your name… May I have your ID number?”
- The input of both agents needs to include information from the other. For example, the phone agent’s language model input should not only include the user’s speech transcription but also a specially marked field containing messages from the computer agent (e.g.,
[FROM_COMPUTER_AGENT] Cannot find the 'Next' button
). Similarly, the computer agent’s multimodal model input should not only include browser screenshots but also messages from the phone agent (e.g.,[FROM_PHONE_AGENT] User says the name is Zhang San
).
Reference Materials:
- You can refer to the design ideas of the Agent-to-Agent (A2A) communication protocol proposed by Google.
Acceptance Criteria:
- Select an Online Form: Choose a public website, such as a registration page, a booking form, or a contact us page.
- Demonstrate Collaborative Workflow:
- After starting the system, the phone agent proactively calls the user (played by a real person) or starts a voice conversation, explaining the task goal (“Hello, I will help you fill out the XX form”) and begins asking for the first required item (e.g., “May I have your name?”).
- After the user responds, the phone agent immediately passes the information to the computer agent.
- Upon receiving the information, the computer agent finds the corresponding input box in the browser and fills it in.
- During the computer agent’s operation, the phone agent should not remain silent and can provide feedback to the user (“Okay, the name has been filled in.”) and proceed to ask the next question.
- The entire form-filling process should be smooth, with no significant blocking between phone communication and computer operation.
- Demonstrate Exception Handling:
- When the computer agent encounters an unprocessable situation (e.g., the user-provided information format is incorrect, causing a webpage error), it should inform the phone agent of this error.
- Upon receiving the error, the phone agent should be able to relay the issue to the user and request new information (e.g., “Sorry, the email format you provided seems incorrect. Could you please say it again?”).