UCAS 2025 Spring AI Agent Practical Course

The AI Agent Practical Course is a hands-on course conducted by Professor Liu Junming from UCAS and myself. The first session in 2024 had over 50 participants, and the second session in 2025 had over 100 participants. The 2025 Spring AI Agent Practical Course took place in early February 2025 in Beijing.

Click here to see some of the practical results from the UCAS 2025 AI Agent Practical Course
Click here to view reference project code (note, not complete code, for reference only)

Course Directory:

Project 1: Interactive Novel
Project 2: Voice Werewolf
Project 3: Intelligence Gathering Expert
Project 4: Paper Video Explanation
Project 5: Multimodal AI Assistant

Course Content

Explore the wonders of AI Agent programming practice!

With the advent of large models, intelligent agents (AI Agents) are no longer an unattainable concept but a part of our lives and studies. Now, you have the opportunity to shape this future with your own hands!

This AI Agent programming practical course aims to lead undergraduates interested in technology and innovation to deeply understand the mysteries of large models through practice and create their own Agents.

Students are free to form teams for this practical course, and each team can choose one of the following projects. It is recommended that different teams choose different projects:

Project 1: Interactive Novel

Have you ever wanted to turn a novel into an interactive game and experience the world of a novel firsthand?

An interactive novel allows users to input a chapter from a novel, and the AI extracts the plot and characters from the novel to automatically generate an interactive game.

Overall Process:

The user inputs a chapter from a novel (e.g., a chapter from “Journey to the West”).
The AI extracts the plot background and characteristics of each character from the novel, and these extracted contents become part of the prompt for each subsequent generation.
Based on the novel content, the AI splits the plot of this chapter into several levels, designs the plot content and clearance conditions for each level (all described in a paragraph), and generates a background image for each level.
At the start of the game, the AI generates a descriptive image of the plot overview as the background image.
At the start of the game, a list of characters is presented for the user to choose from.
At the start of each level, the AI displays the pre-determined plot content and background image to the user. Based on the overall plot, the AI selects a character (of course, not the one chosen by the user) to speak. The platform’s voice cloning feature can be used to output the character’s speech in voice.
The user must simulate the chosen character and speak within the specified time. The AI converts the user’s input voice into text to determine whether the level is passed.
If the level is passed, proceed to the next level.
If the level is not passed, based on the chat history and overall plot, the AI selects a character to speak and enters the next round of the loop.

Project 2: Voice Werewolf

Werewolf is an interesting LARP (Live Action Role-Playing) game. AI Agents can also play various roles in Werewolf, allowing AI Agents to play the game with humans. Werewolf tests the AI’s reasoning ability and ability to hide its true identity.

Requirements:

Use the platform’s real-time voice capability to develop a voice Werewolf game where a real user and several AI characters play Werewolf in the same room via voice connection.
At least the roles of Judge, Werewolf, Villager, Witch, and Prophet are needed. Roles like Hunter and Police can be added if interested.
One role in the game is a real person, and the others are AI Agents.
Each AI Agent and human participant must follow the game rules, with roles assigned randomly, only seeing the information they should see, and not seeing information they shouldn’t.
Agents need to have some basic game skills (which can be specified in the prompt), such as werewolves generally not revealing their identity, werewolves should not self-harm in most cases, werewolves should learn to hide their identity, and witches and prophets should use their abilities wisely.
Agents need to have the ability to analyze others’ speeches and infer who the werewolves are, not choosing randomly.

Project 3: Intelligence Gathering Expert

We often need to gather information online, but many people are not very skilled at using search engines. However, AI’s search and summarization capabilities are now very strong.

This project requires AI to automatically analyze complex information-gathering questions, search step by step, read search results, and obtain answers.

The overall process for the Agent is:

Analyze the question and propose search terms.
Call the Google Search API.
Access the searched webpages and extract the main content of the webpages.
Send the webpage content to the AI model to answer the question. If it can answer, the Agent process ends.
If the searched content is insufficient to answer, the AI can choose to click on links within the webpage to enter new pages and repeat step 3; it can also choose to propose new search terms and repeat step 2.

Requirements: For the 32 high-difficulty information retrieval questions from the past 6 years of Hackergame competitions, the AI must answer correctly at least 30% (10 questions) to pass the project.

Requirements: The AI’s capabilities need to be general-purpose, and it is not allowed to hard-code the questions into the AI or search for Hackergame solutions (these questions can be directly searched on Google to find solutions, and using the solutions found is not allowed).

Project 4: Paper Video Explanation

Google has a popular app called NotebookLM, which can input any paper and use AI to generate a podcast with two people talking, explaining the paper.

However, a paper explained by just two people talking contains too little information, and it’s difficult to see the charts in the paper or clearly express the structure of the paper. A more efficient paper explanation might still be in the form of a Bilibili-style video.

This project aims to use AI to generate a video explanation of a paper, inputting any paper and generating an explanatory video. The image part of the video is an AI-generated PPT, and the voice part of the video is the voice explanation of this PPT.

The principle of PPT generation is to let the large model generate several pages of PPT based on the paper, with each page of PPT being a segment of SVG or HTML code. The large model can generate the explanation text for each page of PPT while generating the PPT content. Then, using a speech synthesis model, the explanation text is synthesized into speech, and finally, the generated PPT content and speech are combined to obtain the explanatory video.

This experiment will provide models for AI-generated structured PPT text content and speech synthesis. For structured PPT generation, you can experience the PPT generation feature in the Alibaba Tongyi Qianwen app.

Bonus: The generated PPT not only contains the generated text outline but also includes charts from the original paper. The charts and explanatory text in the PPT need to correspond.

Project 5: Multimodal AI Assistant

“Her” is an interesting movie from 2013, a story about a male protagonist falling in love with an AI. The latest products from OpenAI, Anthropic, and Google all have shadows of “Her.” Samantha in “Her” is an AI operating system that can listen, see, speak, help complete work by operating a computer, make phone calls to solve social anxiety problems, and provide emotional value to users.

Of course, we still find it difficult to create a full-featured version of Her. But we can make a simplified version that supports voice input (listening), voice output (speaking), and can see the content in front through a camera (seeing).

The requirement is to achieve the capabilities demonstrated in the Gemini demo video: https://www.youtube.com/watch?v=UIZAiXYceBI (Bilibili link: https://www.bilibili.com/video/BV1Xg4y1o7PB/?spm_id_from=333.337.search-card.all.click)
It should be able to answer questions with voice based on the content seen by the camera and the user’s voice questions.

Bonus:

Support for web search
Support for operating apps on the phone or computer (can use open-source projects like AppAgent, Mobile Agent, etc.)
Support for controlling smart home devices

It seems like your message is empty. Please provide the Markdown content you would like translated, and I’ll be happy to assist you!