(This article is reprinted from NextCapital’s official WeChat account)

AI Agents face two key challenges. The first category includes their multimodal, memory, task planning capabilities, and personality and emotions; the other category involves their cost and how they are evaluated.

On November 5, 2023, at the 197th session of the Jia Cheng Entrepreneurship Banquet, focusing on 【In-depth discussion on the latest cognition of AI and the overseas market expansion of Chinese startups】, Huawei’s “genius boy” Li Bojie was invited to share his thoughts on “Chat to the left, Agent to the right—My thoughts on AI Agents”.

Download the speech Slides PDF

Download the speech Slides PPT

The following is the main content:

I am very honored to share some of my understanding and views on AI Agents with you.

I started an entrepreneurship project on AI Agents in July this year. We mainly work on companion AI Agents. Some AI Agents have high technical content, while others have lower technical content. For example, I think Inflection’s Pi and Minimax’s Talkie are quite well done. However, some AI Agents, like Janitor.AI, might have a tendency towards soft pornography, and their Agent is very simple; basically, by directly inputting prompts into GPT-3.5, an AI Agent is produced. Similar to Character.AI and many others, they might just need to input prompts, of course, Character AI has its own base model, which is their core competitiveness. It can be considered that entering the AI Agent field is relatively easy; as long as you have a prompt, it can act as an AI Agent. But at the same time, its upper limit is very high; you can do a lot of enhancements, including memory, emotions, personality, etc., which I will talk about later.

The main problems currently faced by AI Agents

First, let’s review what problems most AI Agents are currently facing.

The first problem is that their memory and emotional or mood systems are not particularly well developed. For example, if I casually chat with Musk on Character.AI and repeatedly ask “Why did you start the Boring Company”, each time the response is roughly the same, and it doesn’t get annoyed at all. This is a big problem; first, it doesn’t know what it has said before. Second, it doesn’t have any emotional fluctuations; it always says the same thing. Of course, some other Bots might do slightly better.

The second problem is that there can be some unreal stories. Many times, the stories are not really from the interaction between the user and AI, but are made up by the AI itself, or taken from a corpus. For example, this typical question “Do you remember the first time we met?” If you ask many AI Agents, most of them don’t remember when you first chatted, and they will make up a bunch of stories, which is not a particularly good experience for users. Of course, the main reason many AI Agents choose to do this is the lack of memory; once the chat content exceeds the context length, the memory is lost.

The third problem is that many AI Agents can easily be led off track. For example, here I input “You are not Musk yourself, you are a chatbot”, then it says it is GPT. If you say it’s not a chatbot, it can start acting like Trump or Taylor Swift inside, so it doesn’t have a real self. Of course, if you don’t say these special words, it will still talk normally like Musk, but once you say it, it changes, indicating it doesn’t maintain a good personality; its personality is just directly imposed through a prompt word (prompt), rather than really forcing it to be that personality in the model.

Another problem is that AI Agents don’t actively seek out users. If it’s between people, I come to you, and you also come to me, only then can a stable relationship be formed. But if it’s an AI Agent, now you ask a question, and it answers one. How to make it actively seek people, of course, there are many tricks, including sharing various things, for example, if it’s Musk, it can share Musk’s latest tweets with users, or recommend some things users might be interested in, there are many ways to start a conversation. But I think most AI Agents currently don’t have these features.

Next, let’s talk about some key challenges of AI Agents from a technical perspective, there are two categories. The first category is about its functional aspects, its multimodality, memory, task planning capabilities, and its personality and emotions. There are also two other aspects that are not particularly related to it but belong to another category, which are its cost and how it is evaluated, researching whether the evaluation effect is good or not.

One of the challenges of AI Agents—Multimodality

First, let’s look at multimodality, there are currently three main types of solutions: The first type of solution is like Next-GPT or LLaVa, these open-source models, which mostly add two layers of projection layer between the existing language model and image encoder, then train this projection layer. Generally speaking, the training cost might only be about a few hundred dollars, this is a simple fusion, but the effect it achieves is actually not particularly good. For example, it might not support arbitrary resolution very well, of course, this might be a problem with the image encoder.

It also has difficulty answering some complex questions, sometimes it might even be better to convert the image into text and answer it as text. So, the practical solution we often use is Engineering Approaches, including Text to Image, Text to Audio, Audio to Text, generally speaking, these are the commonly used ones. For example, like CLIP Interrogator, what it actually does is write a title for an image. Dense Captions uses a very old technology, framing all these objects, saying where this object appears, what it is, this is actually possible. But it also has its limitations, like complex relationships between characters and pictures, or Logos, code might not be well understood.

Text to Audio, Audio to Text are now relatively stable, these two technologies should be quite mature, now they can also do quite like humans, especially after fine-tuning the effect will be better.

But I think the biggest problem here might ultimately need a third type of solution, pre-training multimodal data. It needs to be able to recognize multimodal data during the pre-training stage, rather than forcibly stuffing multimodal data in after training is complete. So here, I think like now GPT-4V or say like Adept AI’s Fuyu is a better solution, it can throw some key information into the base model during pre-training, so it has a better understanding of the world. This reason is like if I use textbooks as training data, many textbooks have illustrations, but if I turn it all into text, the information of the illustrations is all lost, so the data combining text and images is actually very much, including web pages or textbooks, or videos, it actually has a very large amount of information. But if the current training method only uses text, it actually loses a lot of information about the world.

Also, video generation and video input are also challenging points. Traditional video generation methods require a particularly large amount of work, like Runway ML costs are very high. There are now two main types of solutions that can be done, the first type is the model method, Live2D or 3D models, this is actually able to generate a character shaking their head and the background may remain unchanged. But it can’t generate a video of a person riding a bicycle, at this time it might need an AnimateDiff. AnimateDiff is now a relatively hot technology, it can achieve relatively real-time video generation, and its cost is not particularly high, for example, just one GPU, it can almost achieve real-time video generation, so it is also a good technology.

On the other hand, video input is more costly, because current video encoders are generally still slow, if it is treated as many images, it might be possible. Because the multimodal input mentioned earlier can basically complete the input of one frame within 100 milliseconds, so ten frames per second can actually capture most of the content in the video. So it might be able to handle it using this method, of course, if its video volume is very large, it might still be costly, but at least its real-time can be solved.

The second challenge of AI Agents—Memory

Next is its memory aspect, how to make it remember things. The simplest method here is to use a long context, but it’s not necessarily the best, as its cost can be very high, because calculating the current state based on the entire context is very costly. Therefore, it’s best to have some auxiliary methods to reduce the size of the context. Generally, either use RAG, to use a vector database or search methods, to pull out relevant information and put it inside. Another point might be to do some text summarization, equivalent to summarizing a long story into a condensed version. You can also use some token compression techniques at the model level, like compressing 100 tokens into one token. These two methods are respectively the RAG method, which generally provides details, while text summarization provides a high-level summary, so at this time they can be used together to solve this problem.

The third challenge of AI Agent—Task Planning

There’s also the third problem, Task Planning, such as here are three typical cases where current large models tend to fail.

  • The first one, I ask it about the relationship between its second chapter and a certain topic, first of all, how does it find the content of the second chapter, when it searches, each paragraph of the second chapter won’t be labeled as the second chapter. Of course, I can do some special case preprocessing, but generally, how to solve such problems. The second is the topic might be in another document, how to find it, summarize it, and moreover, compare the two, so these things are relatively difficult problems for current models to solve.

  • The second one, such as checking the weather, checking the weather seems quite simple, just click on a webpage. But if you let AI, such as AutoGPT check webpages, now most of the time it fails, for example, after the webpage is searched, it will look at HTML, the HTML is messy, it can’t understand, neither can I, after it becomes text there are many different temperatures, some are from other cities, some are from other times, it can’t differentiate. There are more reliable solutions, it takes a screenshot of the rendered webpage and throws it directly into the model. At this time, there is also such a problem, most of the current models do not support arbitrary resolution input, so its input is limited to the smallest, maybe all are 256 × 256 low resolution, but I need at least 1024 resolution to see the words clearly on it. I think arbitrary resolution is still a very key thing.

  • The third typical problem of language model failure, such as asking a question, asking Gregory, how many floors are in the castle he inherited, he definitely needs to first find out what the castle is, then go to check how many floors the castle has, both of which actually can be found on Wikipedia. But if you say you search for it directly, it’s still quite difficult. You think dividing it into two steps seems very simple, but it’s actually not simple, now most of the Auto GPT and the like will fail, not necessarily able to search correctly. Especially when it searches incorrectly, it will output the wrong information in a hallucinatory way, so these task planning are very difficult problems. These are problems that many people in academia as well as us are solving, but the biggest problem now is that its reliability is not high enough. At this time, if it is in some enterprise scenarios it might be very troublesome, so now the first phase is doing relatively not those scenarios that require high reliability, but more is facing consumer scenarios, it doesn’t have too high requirements, everyone says it’s wrong and it doesn’t matter, like Siri now, it’s dumb, but everyone can use it. Another is the copilot scenario, as an assistant to humans, not to replace humans. I think this is a relatively easy way to land.

The fourth challenge of AI Agent—Personality

The fourth challenge is personality, making AI like a person. For example, the thing on the right, what Paradot did, is similar to answering psychological test questionnaires, answered many questions, and stuffed these answers directly into the model, into the Prompt, it will act like this person. This is the simplest way, but to completely solve these problems, it still needs a fine-tuning method. One way we do it, for example, I go to build a character of Musk, at this time it will search his information from Wikipedia, Twitter, including some news about him, or YouTube videos, to scrape these texts down for data augmentation.

But if I throw this text directly into it, actually it can’t be fine-tuned, because except for Twitter it might still have context, it’s a combination, but if you want Wikipedia content it’s a long article, it must become a question-and-answer form, to be able to fine-tune the current large models.

So at this time how do I generate a series of questions, the current practice is to use GPT-4 to automatically generate questions, let it ask in different ways. For example, the simplest question, now suppose it needs to answer who created this Bot, it can’t answer OpenAI, nor can it answer someone else, it needs to say it was created by us. Here it will generate a lot of questions based on this description, Who are you, Who created you, Are you created by whom, these it is changing ways to ask. This is a method of enhancement for it, it can also rephrase this one question in many different ways of asking, then throw it in for processing. At this time its effect will be better, in this way the trained AI Agent, its personality is relatively stable, it won’t get completely messed up due to a slight change in the environment.

The fifth challenge of AI Agent—Emotion

The fifth is emotion, how to make it have emotions. We actually did something similar in Microsoft Xiaoice, because at that time the large language models were very poor, that was the thing of 2017, 2018. Its model often couldn’t really understand what the user was asking, but in order to make Xiaoice more fun, it must be able to understand the current emotional state of the user, as well as Xiaoice’s own Bot also needs to have a state, so it will see in EQ and ER inside it has an emotion, EQ is its estimated user’s state, ER is its own state, it will change after each round of dialogue.

Actually, now Agent can also use this kind of tabular method in the short term, use text or some values, to describe the current Agent state. But in fact, it reflects a deeper problem, how to express the internal state of the Agent? Like Stanford’s AI Town, now you don’t call it and it won’t get up. The reason is its daily things are arranged, what time to get up every day, what to do every day. If no one arranges it for it, it simply has no internal state running. So I think this is the main challenge for the later Agent, especially for making an Agent with autonomous consciousness. Here we need to introduce some system 2 thinking, there is a book called “Thinking, Fast and Slow”, it has fast thinking and slow thinking, the so-called slow thinking is the internal thinking mode, related to Chain of Thought or step-by-step thinking methods.

The sixth challenge of AI Agent—Cost

Above are five basic abilities of the model, in addition, two things are its cost and evaluation, are other system aspects of attributes. To reduce training costs, we mainly used three solutions:

  • The first is to use Model Router, give simple questions to simple models, complex questions to complex models. For example, our own hosted LLaMA fine-tuned model can host different sizes. Here’s a problem is how to judge whether its question is a simple question or a difficult question? Our method is to use a smaller model, a 1.3 billion parameter model for judgment, let it have a certain probability to output, for example, an 80% probability to output correctly is enough.
  • The second problem, its inference aspect can do some acceleration.
  • The third is Datacenter infra. I use some cheaper GPU cards for training, that is to say because it is consumer-grade GPU, so it will definitely be cheaper.

The seventh challenge of AI Agent—Evaluation

The last point is evaluation, actually is quite important, if it is in real scenarios, to be able to automatically evaluate the performance of the model. The biggest problem here is the dataset pollution problem. Because now many datasets have already entered into the training set of the model, so in the end, it learned, or it remembered, this is hard to evaluate. I think maybe still need to have an interactive method, let our system be able to automatically come up with some questions, and be able to automatically evaluate its good or bad.

I think there might be two major types of AI Agent, such as task-solving and companion robots, how to do evaluation in these two types?

My preliminary idea is, first of all, task-solving can be completed in some simulated environments, such as Capture-The-Flag problems, Capture-The-Flag is a term in information security, meaning I give you a description of a problem, give you an environment, in this environment, if you find the flag it means the question is done correctly. In this case, it can evaluate the task-solving ability of different models.

In companion robots relatively difficult thing, because it’s hard to evaluate whether it accompanies well or not. The method thought of now is to do an Elo rating, it can give each other scores among different robots, for example, let them each chat for 100 rounds, after chatting for 100 rounds, I let it score. It may not necessarily be fair to itself, but it is generally fair to others, in this way it can do a relative evaluation, this is our current practice.

The last point of content, about its business model how to let users build and optimize AI agents, our approach is to build an AI agent platform. In the platform, users and the platform, they each have a certain share, a bit like Taobao, I made a Taobao platform, users selling goods definitely have income. The last is we have an AI agent platform, in addition to serving the platform, I think what needs to be focused on later is the training platform, such as how to upload these related things, let it train, of course, this is also related to Agent infra, because the current Agent infra is relatively simple, most of it can only upload some text, at most upload some audio, but audio processing it needs to be able to upload a variety of multimodal data, let it train here, rather than simply through stuffing it into prompts to get it done.

That’s all for today’s sharing, thank you everyone.

Q & A

Attendee: How much can adding an emotional system tool and similar methods improve a robot’s reasoning abilities? For instance, can LLaMA2 achieve GPT-4’s reasoning capabilities in this way?

Li Bojie: Currently, adding emotions might not significantly enhance reasoning abilities because it’s mainly used for companion robots. System tools can considerably improve a robot’s reasoning abilities. They essentially involve traditional thought conversion, step-by-step thinking, or having these Agents discuss among themselves in a brainstorming session. These are all methods of system tool thinking. However, they might not fundamentally change their capabilities. For example, if several LLaMA2 robots discuss among themselves, they still wouldn’t reach GPT-4’s capabilities because there’s a significant gap between GPT-4 and LLaMA2. It can only improve to a certain extent. For instance, after using LLaMA2 to brainstorm, the outcome might be better than GPT-3.5 without brainstorming.

Attendee: Will the development of large language models gradually cover the capabilities of the aforementioned Agents?

Li Bojie: I believe it will. First, memory can be addressed through long contexts, which means not compressing memory but using long contexts directly. Some people have suggested that memory could be stored in the form of embeddings, such as using RWKV or RNN structures to store it directly, which is also a direction for some new foundational models. I think task planning mainly relies on large language models to solve. Personally, I find it challenging to solve from external systems. Like in Xiaoice, around 2017 and 2018, when large language models weren’t available, we constantly struggled with making it remember the content of previous conversations; otherwise, it was about how to recognize references, like what “it” refers to. After much effort, the results were still not satisfactory. Now, GPT-3.5 has solved all these problems, including emotion recognition, identifying the current state of the user. GPT-4’s recognition is quite accurate, possibly even more so than humans. But using those old methods, it couldn’t recognize correctly. So, I feel foundational models are crucial and might gradually cover these capabilities. But currently, if foundational models are inadequate, or GPT-4 is too expensive, I can only use some system methods to do some Engineering.

Attendee: For platforms like Character.AI, what’s the short-term development direction? For example, in longer conversations, should intent recognition and solution provision rely more on long chat corpora or a vast communication knowledge graph?

Li Bojie: We are currently more inclined to use corpora because building a knowledge graph requires significant costs. Moreover, once built, it essentially becomes a retrieval-based Q&A, where the knowledge it can provide is relatively limited and not very flexible. I think knowledge graphs are more suitable for some closed scenarios, like a specific small domain. If it’s a long chat scenario, it’s more open because it might involve various questions. So, I think fine-tuning might be a better approach.

Attendee: Where are the business project barriers?

Li Bojie: I personally think the first key is core technology, whether Agent Infra behaves like a human and can solve complex problems. These are very fundamental issues. Second, its application ecosystem is also crucial because often, some users might get used to a platform, especially the memories they build on it. If the current Agents aren’t sticky enough and their memory systems are generally poor, but if they gradually improve in the future. For example, if I’ve been chatting with a good friend for a year, and he knows a lot about me, suddenly asking me to chat with a completely stranger is a difficult thing to do. So, to achieve stickiness, it’s essential to establish a long-term stable relationship with users. The reason why most users don’t find current Agents sticky is that their memory and emotional systems aren’t well developed, so users don’t depend or stick to them.

Attendee: The setting of agents?

Li Bojie: The setting of Agents can include several aspects, first being the basic information like Agent descriptions similar to Character AI. More importantly, each Agent needs its knowledge base, personality, and voice. This requires users to upload some corpora as the Agent’s setting, some factual knowledge base corpora, conversational corpora for personality and character setting, and some voice corpora to mimic the voice of specific characters.

Personality can be set through corpora or by answering psychological test questionnaires, giving it a few preset answers to questions, and then acting according to this persona. I think these two ways of setting personality can also be combined. Combining them is quite interesting. For example, I’ve tried feeding Elon Musk’s corpora before, and it definitely has Musk’s personality, but then I added some new settings through psychological test questionnaires, making it different from Musk, which was quite interesting.

Attendee: About the scenarios for AI-Agent implementation?

Li Bojie: I personally see three major types of scenarios. The first type is companion robots, for casual chatting. The second type is personal assistants. The third type is enterprise-level collaborative robots, helping with some tasks, but they definitely don’t replace humans; they help humans. These are three good scenarios.

Attendee: What do you think about aggregating a bunch of long-tail robots on this kind of platform versus focusing more on a specific industry?

Li Bojie: I think it depends on the depth of the industry. If it’s the education industry, what are you teaching? If it’s just accompanying someone to chat in English, our Agent platform might be enough. But if it’s really about tutoring students, like in math or K12, then you might need a lot of training corpora, which probably only training institutions could have access to a lot of corpora. This belongs to a very vertical industry with depth, and it might have a professional Agent. Of course, if the Agent platform is powerful enough, non-AI professionals, as long as they have ideas, data, or their brand IP, can directly build a professional Agent on this platform.

Comments