2024 Yunqi Conference: Two Bitter Lessons on Foundational Models, Applications, and Computing Power
Invited to attend the 2024 Yunqi Conference on September 20-21, I spent nearly two days exploring all three exhibition halls and engaged with almost every booth of interest.
- Hall 1: Breakthroughs and Challenges in Foundational Models
- Hall 2: Computing Power and Cloud-Native, the Core Architecture Supporting AI
- Hall 3: Application Implementation, AI Empowering Various Industries
My previous research focused on the computing infrastructure and cloud-native aspects of Hall 2. Now, I mainly work on AI applications, so I am also familiar with the content of Hall 1 and Hall 3. After two days of discussions, I really felt like I had fully experienced the Yunqi Conference.
After the conference, I spoke into a recorder for over two hours, and then had AI organize this nearly 30,000-word article. I couldn’t finish organizing it on September 22, and with my busy work schedule, I took some time during the National Day holiday to revise it with AI, spending about 9 hours in total, including the recording. In the past, without AI, it was unimaginable to write 30,000 words in 9 hours.
Outline of the full text:
Hall 1 (Foundational Models): The Primary Driving Force of AI
- Video Generation: From Single Generation to Diverse Scenarios
- From Single Text to Video Generation to Multimodal Input Generation
- Action Reference Generation: From Static Images to Dynamic Videos
- Lip Sync and Video Generation for Digital Humans
- Speech Recognition and Synthesis
- Speech Recognition Technology
- Speech Synthesis Technology
- Music Synthesis Technology
- Future Directions: Multimodal End-to-End Models
- Agent Technology
- Inference Technology: The Technological Driving Force Behind the Hundredfold Cost Reduction
- Video Generation: From Single Generation to Diverse Scenarios
Hall 3 (Applications): AI Moving from Demos to Various Industries
- AI-Generated Design: A New Paradigm for Generative AI
- PPT Generation (Tongyi Qianwen)
- Chat Assistant with Rich Text and Images (Kimi’s Mermaid Diagram)
- Displaying Generated Content in Image Form (New Interpretation of Chinese)
- Design Draft Generation (Motiff)
- Application Prototype Generation (Anthropic Claude)
- Intelligent Consumer Electronics: High Expectations, Slow Progress
- AI-Assisted Operations: From Hot Information Push to Fan Interaction
- Disruptive Applications of AI in Education: From Personalized to Scenario-Based Learning
- AI-Generated Design: A New Paradigm for Generative AI
Hall 2 (Computing Infrastructure): The Computing Power Foundation of AI
- CXL Architecture: Efficient Integration of Cloud Resources
- Cloud Computing and High-Density Servers: Optimization of Computing Power Clusters
- Cloud-Native and Serverless
- Confidential Computing: Data Security and Trust Transfer in the AI Era
Conclusion: Two Bitter Lessons on Foundational Models, Computing Power, and Applications
- The Three Halls of the Yunqi Conference Reflect Two Bitter Lessons
- Lesson One: Foundational Models are Key to AI Applications
- Lesson Two: Computing Power is Key to Foundational Models
Hall 1 (Foundational Models): The Primary Driving Force of AI
At the 2024 Yunqi Conference, Hall 1, though not large in area, attracted the highest density of visitors, showcasing the most cutting-edge foundational model technologies in China.
Video Generation: From Single Generation to Diverse Scenarios
Video generation technology has advanced rapidly in the past two years, especially after the release of Sora. Many companies have launched video generation models, showcasing their capabilities at the Yunqi Conference, covering text-based, image-based, and video-based generation methods. The generated content includes not only videos but also 3D models.
1. From Single Text to Video Generation to Multimodal Input Generation
In the early stages, AI video generation mainly followed the “text-to-video” model, generating corresponding video content based on text descriptions provided by users, converting text descriptions into 5-10 second videos. While this method helped users achieve visual expression to some extent, it had obvious limitations in terms of style consistency.
Style consistency refers to the uniformity of visual style and action performance across multiple generated video clips. For a complete video work, inconsistent styles can lead to a fragmented viewing experience. For example, an AI-generated video might include multiple different scenes and characters, but due to the diversity of generation algorithms, the characters’ styles and lighting effects between scenes might be inconsistent. This issue is particularly prominent in scenarios requiring high artistic unity, such as advertising and film production. While improving generation efficiency and content diversity, AI models must maintain overall style consistency, which is a significant challenge for video generation technology.
Currently, video generation has gradually expanded from single text generation to supporting more modalities of input, including images, skeleton diagrams, and 3D models. Compared to single text input, image-based video generation can more accurately reproduce character appearances and scene details.
For example, in the advertising industry, users can upload a product image, and the AI model generates corresponding advertising video clips based on the image. This method can generate more complex and dynamic content by combining text prompts.
Some models under development support 3D models as input, allowing text to control the actions of 3D models, which can then be integrated into AI-generated backgrounds. Compared to image input methods, 3D models can achieve more precise style consistency control, making them important for industries like film and gaming that require highly consistent character images.
2. Action Reference Generation: From Static Images to Dynamic Videos
Another significant breakthrough is the action reference generation mode. In this mode, users can upload static images and action skeleton diagrams to generate animated dynamic videos. For example, users upload a static image of a character and combine it with a reference action (such as dancing or walking), and the AI model will recognize and simulate the action skeleton, “animating” the static character to generate an action video.
For example, the model behind Tongyi Dance King, Animate Anyone, can turn static character images into dynamic characters by inputting character images and reference action skeleton diagrams. By matching and animating the skeleton, AI can generate videos that match the reference actions. This technology has already impacted multiple creative fields, especially in short videos, social media, and film animation production, significantly reducing the time and cost of traditional animation production.
Unlike early “text-to-video” methods, image and action skeleton-based generation modes can more accurately control the posture and actions of characters in videos. Especially in character animation, this method allows users to customize the dynamic performance of characters without manually adjusting complex animation frames. Through AI-generated methods, many users can easily create complex dynamic video content with simple image inputs.
3. Lip Sync and Video Generation for Digital Humans
The typical technical route for digital humans is “lip sync,” where users can upload a pre-recorded speaking video, and the AI model will recognize the character’s face in the video. When the digital human needs to speak, the model adjusts the character’s mouth movements to sync with the new input audio. The key is to replace the character’s facial expressions and lip movements without changing the overall background and dynamics of the video. This mode has been widely used in short videos, virtual hosts, digital humans, and especially in real-time interactive scenarios.
However, digital human technology still requires the “real person” to upload a video, which can be difficult in many scenarios. Additionally, this type of digital human can only modify lip movements, not actions or backgrounds, leading to some unnatural situations and limited application scenarios.
Tongyi’s EMO model is a new type of digital human technology that can generate highly realistic speaking videos from a single photo and an audio clip, bringing figures like the Terracotta Warriors and historical characters “back to life.” Tongyi also collaborated with CCTV to launch an AI-generated “Terracotta Warriors Singing” program. Before watching this video, I really couldn’t imagine what it would be like for the Terracotta Warriors to sing.
At the Yunqi Conference, models like EMO and Animate Anyone opened their APIs through the Bailian platform and provided to-C services through the “My Past and Present Life” digital human feature in the Tongyi Qianwen App. In the “My Past and Present Life” feature, users can upload their photos, and AI trains a digital human in 20 minutes, allowing users to have real-time voice and video conversations with their digital avatars.
The principle of the “My Past and Present Life” digital avatar is to first match the user’s photo with an existing character, use DeepFake face-swapping technology to replace the face in a historical character image with the user’s face, and then use the EMO model to convert it into a short speaking video of the digital human. This video is then used to train a digital human, which adjusts the digital human’s lip movements to sync with the synthesized TTS voice during real-time conversations.
This digital avatar technology route has two drawbacks:
- The current EMO model focuses on facial expressions and can only generate short speaking videos, not complex actions like dancing, and the background is fixed. To generate complex actions and background changes, like in movies, a general video generation model is needed.
- Both the EMO model and general video generation models based on diffusion models require high computing power and cannot achieve real-time video generation. Therefore, for real-time interactive digital humans, traditional digital human technology is still needed for lip sync.
Speech Recognition and Synthesis
Advancements in speech technology mainly fall into two directions: speech recognition and speech synthesis. Both are important components of multimodal technology, each with different technical challenges and development trends. This section will explore the principles, performance challenges, and current status of speech recognition and speech synthesis in detail.
1. Speech Recognition Technology
The core of Automatic Speech Recognition (ASR) technology is to convert speech signals into text content. Current mainstream speech recognition models like Whisper and Alibaba Cloud’s FunAudioLLM have made significant progress in accuracy.
Compared to some overseas models, the main advantage of domestic speech recognition models like Alibaba’s is their support for dialect recognition, whereas overseas models typically only support standard Mandarin.
Currently, speech recognition technology still faces several challenges, mainly the accuracy of recognizing specialized terms, emotion recognition, and latency issues.
- Accuracy Issue: Due to the smaller size of speech recognition models and their knowledge bases, the accuracy of recognizing specialized terms is not high.
- Emotion Recognition Issue: Most existing speech recognition models cannot output the emotions expressed in speech. If an application requires emotion recognition, an additional classification model is needed.
- Latency Issue: Real-time speech interaction requires streaming recognition, and latency is a key indicator. Streaming recognition latency has two metrics: the delay of the first word output and the delay for the recognition result to stabilize. The first word delay refers to how long it takes for the recognition model to output the first word after hearing the first word of a speech segment; the stabilization delay refers to how long it takes for the model to provide a final stable text result after the entire sentence is spoken. Older recognition models like Google’s streaming recognition model, despite having slightly lower recognition rates, have relatively shorter latency, typically not exceeding 100 milliseconds. Newer recognition models, while improving recognition accuracy, also have higher latency, generally between 300 to 500 milliseconds. Although this latency may seem short, it can significantly impact user experience in end-to-end systems. Ideally, speech recognition latency should be controlled within 100 milliseconds.
2. Speech Synthesis Technology
Speech synthesis (Text-to-Speech, TTS) encompasses various task scenarios, including fixed voice synthesis and voice cloning. Voice cloning is further divided into synthesis based on a large amount of reference speech and synthesis based on a small amount of reference speech.
Starting with GPT-soVITS, there have been significant advancements in speech synthesis technology, especially with end-to-end Transformer-based models like ChatTTS, Fish Speech, and Alibaba’s Cosy Voice. These models have greatly improved in terms of naturalness of pronunciation and few-shot voice cloning capabilities compared to traditional models like VITS.
Similar to speech recognition, speech synthesis also faces performance challenges. The main issues are synthesis speed and latency. Two key metrics are Real-Time Ratio (RTR) and Time to First Token.
- Real-Time Ratio: This refers to how long it takes to synthesize 1 second of speech. For example, CosyVoice can achieve an RTR of 0.6 on a V100 GPU, meaning it only takes 6 seconds to synthesize 10 seconds of speech, which is fast enough to support real-time voice calls. The speech API service on the Bailian platform has made some inference optimizations, outperforming the open-source version.
- Time to First Token: This refers to the time taken from the start of synthesis to the generation of the first audio segment. While older models like VITS and GPT-SoVITS can achieve a Time to First Token of less than 1 second, newer models like ChatTTS and CosyVoice, despite producing more human-like speech, have longer delays, often requiring 1 second or more for the first token.
In terms of voice cloning, the best results are still achieved by fine-tuning models based on a large amount of reference speech, requiring a substantial amount of training data (tens of minutes). Synthesis based on a small amount of reference speech (a few seconds to a dozen seconds), known as “zero-shot synthesis,” still needs improvement.
From live experience, the best zero-shot synthesis model currently is Fish Speech 1.4, which also has good inference performance, achieving low latency similar to the previous generation VITS technology, with an RTR exceeding 0.1 (i.e., it only takes 1 second to synthesize 10 seconds of speech).
3. Music Synthesis Technology
Music synthesis technology differs somewhat from speech synthesis. Current music synthesis technologies, such as AI singing and AI instrument playing, produce results that are even more realistic than speech synthesis. Live experiments at the Yunqi Conference showed that most people find it difficult to distinguish between AI-generated music and human singing or playing. In contrast, current speech synthesis is still easily recognizable as AI-generated content.
Although there are more researchers in speech synthesis and its potential applications are broader, music synthesis is more mature, with several AI music synthesis applications having tens of millions of users both domestically and internationally.
4. Future Direction: End-to-End Multimodal Models
The ultimate goal of speech technology is to integrate speech recognition, large model responses, and speech synthesis into an end-to-end large model, achieving multimodal interaction similar to GPT-4o. Alibaba is also set to release a similar end-to-end multimodal model. According to discussions with on-site engineers, although end-to-end models have lower latency, there are always some hard-to-solve corner cases. Despite the lower latency, the user experience is currently not as good as the pipeline optimized separately for speech recognition, large text models, and speech synthesis.
Agent Technology
Alibaba has open-sourced Mobile Agent, a vision-based mobile smart assistant. Through this smart assistant, users can automate mobile operations such as opening apps, ordering food, sending WeChat messages, and sending emails. Mobile Agent uses a vision-based approach, unlike App Agents that use XML element trees, so Mobile Agent can directly control various apps without training or fine-tuning for specific applications, and it works with common domestic and international apps.
Currently, the main bottleneck of Mobile Agent is the task execution speed. Due to its Multi-Agent architecture, each mobile interface interaction requires 3-4 calls to the large model for inference, with each call having an end-to-end latency of 3-5 seconds, resulting in a delay of over ten seconds for a single mobile interface operation, much slower than human reaction speed. An operation like ordering food requires multiple mobile interface interactions, taking several minutes in total. Of course, using models with lower latency can speed up Mobile Agent’s interactions.
Additionally, I suggested to the Mobile Agent team that the scenarios with greater user demand are not simple tasks like ordering food or sending WeChat messages, but tasks requiring mechanical repetition. For example, price comparison, information collection, and mass messaging.
Price Comparison: Price comparison is one area where App assistants can significantly improve efficiency. Currently, users often need to manually browse multiple e-commerce platforms, compare prices one by one, and check for discounts. This process is time-consuming and prone to errors. In the future, AI assistants can automatically aggregate information from different platforms based on user needs and provide personalized recommendations. For example, when buying a phone, the assistant can automatically collect prices and promotional information from major platforms and summarize the best purchase channels.
Information Collection: App assistants can help users search for and summarize information from multiple sources. For example, if a user needs to understand the latest market trends for a product, the AI assistant can automatically monitor relevant websites and news channels and promptly push related information.
Batch Repetitive Operations: App assistants are not just for executing a single task but can handle similar tasks in batches. For example, when users need to send messages in bulk, manage orders in bulk, or process files in bulk, the smart assistant can help complete these repetitive tasks. In the future, smart assistants will not only be automation tools on mobile phones but also work partners that help users handle a large number of repetitive tasks.
Inference Technology: The Driving Force Behind a Hundredfold Cost Reduction
Another highlight at the Yunqi Conference was the breakthrough in inference technology. Compared to the initial version of the GPT-4 model, GPT-4o mini can achieve the same user experience but with a hundredfold reduction in inference cost and a tenfold increase in output speed. In the past, high inference costs often deterred enterprises, especially in consumer-facing applications, where high costs prevented large-scale adoption of large models. Now, with the significant reduction in inference costs, large models can be made available to ordinary users.
The reduction in inference costs involves a lot of complex engineering and algorithm optimization work.
- Maturity of Quantization Technology: Through model compression and quantization techniques, inference efficiency has been greatly improved.
- Application of MLA (Multi-head Latent Attention) and Mixture of Experts (MOE) Models: These technologies not only improve training efficiency but also optimize inference efficiency. Deep attention mechanisms can handle large-scale input data more effectively, while MOE models dynamically adjust inference paths to reduce ineffective computations, significantly shortening computation time.
- Prefix Cache Technology: By caching parts of the results (KV Cache) already computed during model inference for reuse in future inference tasks, redundant computation costs are significantly reduced. This caching technology is particularly suitable for multi-turn conversations or long text generation scenarios, greatly improving inference speed. Alibaba Cloud uses low-cost SCM (Storage-Class Memory) to build a CXL memory pool, caching KV Cache in the low-cost memory pool to reduce the cost of recomputing KV. Due to the high bandwidth of SCM and CXL, the cache loading latency is much lower compared to caching on SSDs.
- Prefill/Decode Separation Technology: By separating the prefill and decode stages in the model inference process, considering that the prefill stage is compute-intensive and the decode stage is memory access-intensive, the separation allows for full utilization of the computational power and memory bandwidth of different GPU models.
Hall 3 (Applications): AI Moving from Demo to Industry
In Hall 3 of the Yunqi Conference 2024, the focus was on AI application scenarios. Unlike Hall 1, which focused on foundational model displays, Hall 3 explored how AI is moving from the lab to real-world industry applications. From intelligent design tools to digital humans, to industrial robots and autonomous driving, the exhibits in Hall 3 were full of explorations of AI’s specific applications across various industries.
As I explored Hall 3, I increasingly felt that AI applications are reaching a critical turning point: moving from showcasing innovative technology to truly empowering industries. The key to application landing is whether it can solve industry pain points and effectively combine with the deep needs of specific industries.
AI-Generated Design: A New Paradigm for Generative AI
AI is reshaping the design process and methods, from automatically generating PPTs to more complex content generation with images and text, various applications are changing traditional design models.
Early chatbots merely presented tokens generated by large models in text form. Although simple formatting methods like Markdown and code syntax highlighting were used, this pure text presentation was not suitable for human reading habits. Now, AI-generated design has taken an important step forward by presenting tokens generated by large models in more intuitive forms like charts and images, greatly enhancing user experience.
PPT Generation (Tongyi Qianwen)
Tongyi Qianwen’s newly launched PPT generation feature parses user-uploaded documents or PDFs, generates content outlines, and fills in content based on existing layouts in the template library, creating PPTs. It can quickly and effectively summarize and organize information, making it particularly suitable for internal reports or quickly generating initial PPT frameworks.
Working Principle: Users first upload a PDF file and select a PPT template. The AI system summarizes the PDF file, generates an outline, and then selects suitable templates from the layout library to fill in text or generated image content, forming a complete PPT.
Current Drawbacks:
- It cannot paste existing charts and images, especially professional charts and image examples in the document.
- It cannot automatically generate charts based on data in the PDF.
- Customization capabilities are limited, only allowing selection from built-in PPT templates, which affects its use in formal settings with specific company templates.
Graphical Chat Assistant (Kimi’s Mermaid Charts)
Kimi has introduced a feature to generate charts, particularly Mermaid charts, in the chat assistant, automatically generating chart codes and drawing corresponding charts. This graphical approach greatly enhances the user reading experience, providing more intuitive and visual expressions compared to pure text.
- Working Principle: After the user inputs data, the system parses the generated chart code, such as Mermaid code, and then automatically draws the chart, providing a visual information display.
- Current Drawbacks: The current chart generation still faces some flexibility issues. Fine control of chart layout and drawing complex graphics need further optimization.
Display Generated Content in Image Form (New Interpretation of Chinese) {/examples/}
The New Interpretation of Chinese application greatly enhances users’ desire for social sharing by displaying the generated text content in the form of game cards. Users input a word, and the system not only generates interesting anecdotes related to it but also designs it into exquisite cards, giving the content visual appeal.
Working Principle: The system first generates the text content, then fits the text into a design template, and through the combination of background images and layout, generates visual cards that encourage users to share.
Interestingly, the process of fitting the design template in the New Interpretation of Chinese application is not done outside the large model but is achieved using Claude’s code generation and Artifacts capabilities. This shows that a general chat assistant can directly become a development and deployment platform for small applications by leveraging code generation and Artifacts capabilities.
The Claude prompt for New Interpretation of Chinese is as follows:
1 | (defun 新汉语老师 () |
Current Drawbacks: The current generation method relies on fixed design templates, with limited capabilities for personalization and custom design. There is still significant room for improvement in dynamically adjusting the visual effects of image generation according to different user needs.
Design Draft Generation (Motiff) {/examples/}
In the application showcase at Hall 3, AI design tools were a highlight. The main function of these tools is to help designers generate user interfaces or graphic designs through simple prompts, greatly improving design efficiency and reducing manual labor. For example, AI tools can automatically generate interfaces in different styles based on prompts and fill suitable UI elements into the design drafts.
Motiff is a representative AI design tool showcased in Hall 3. Motiff is an AI-assisted design application started by the Yuanfudao team in 2023. Users can input a text requirement, and the AI will generate a design draft.
For example, input the following prompt:
An AIGC (AI-generated) content sharing community. Composed of 5 main pages:
- The homepage displays all user-created images and videos, supporting category filtering and different sorting methods;
- The content page displays an image or video, the AI model and prompt used to generate the video, and users can comment below the image or video, or like or dislike the image;
- In the creation page, users can select an AI model, input a prompt, and generate an image or video;
- Generation result page: After generation is complete, users can preview the result, regenerate if unsatisfied, or click publish if satisfied;
- Personal center page: Displays all images and videos created by the user, with support for deletion.
In 3 minutes, you can get the following design draft, which should be a good prototype, at least better looking than the interface I designed myself. (Click here to view the generated design draft in Motiff)
Current Drawbacks:
- Currently, both Motiff and Figma AI rely on existing template libraries, unable to adhere to the project’s existing design system, and unable to ensure design consistency across multiple interfaces. For example, the above design draft is more similar to the style of Xiaohongshu, while the weather application design draft generated by Figma AI is closer to Apple’s style.
- Motiff AI currently does not support automatically modifying existing design drafts based on user prompts.
- The design drafts generated by AI design tools are static and cannot automatically generate user interaction logic.
- When exporting AI-generated design drafts into HTML, CSS, or React front-end code, there are issues with code clutter and unclear logic, making it unusable directly by front-end engineers for actual development. Pasting the design draft images into GPT-4o also cannot yield front-end code that fully matches the style. How to automatically generate clean front-end code from design drafts remains a challenge.
Application Prototype Generation (Anthropic Claude) {/examples/}
Anthropic Claude can not only generate text content but also generate complete, runnable application prototypes based on user instructions. This greatly simplifies the application development process, significantly shortening the time from conception to realization.
Working Principle:
- User provides application requirement description
- Claude analyzes the requirements and generates front-end and back-end code
- Automatically deploys the generated code, creating a directly runnable application
Main Advantages:
- High Code Quality: Claude generates code with clear structure and complete comments, which can run directly without much manual adjustment.
- Full-Stack Capability: It can generate not only front-end interfaces but also back-end APIs and database structures.
- Rapid Iteration: Users can request Claude to modify or add features through conversation, achieving rapid prototype iteration.
- File Generation: In addition to code, it can generate various formats of documents, configuration files, etc., supporting direct preview and download.
Application Cases:
- Web Applications: Users describe a simple blog system, and Claude can generate a complete website with an article list, detail page, and comment function within minutes.
- Mobile Applications: Through simple instructions, Claude can generate React Native code, quickly creating cross-platform mobile application prototypes.
- Mini-Games: Claude can generate HTML5 mini-games that run directly in the browser based on users’ game ideas, making it easy to share on social media.
- Data Visualization: Users provide datasets and visualization requirements, and Claude can generate interactive data visualization applications.
Social Sharing:
Claude-generated application prototypes can be easily deployed to online platforms, and users can share them on social networks via links. This instant creation and sharing model greatly increases Claude’s exposure and user stickiness.
AI Travel Assistant: Data is Key
There are currently many AI applications for travel assistant scenarios, such as:
- GenSpark.AI integrates network search results and uses AI to generate Wiki pages about each tourist attraction and city. Users can search for these Wiki pages by keywords or create and edit such pages themselves. In the three months since the project launched, there have been over 7 million Wiki pages. I knew about the GenSpark application before; it is not limited to travel scenarios but is a general research report writing tool. For example, if I input my name, it can generate a detailed Wiki page.
- Go China is a travel assistant for overseas tourists, automatically collecting data from government-published public accounts, official scenic spot websites, and specific information channels, and then generating high-quality travel guides based on this data. It helps users get the latest information about scenic spots in real-time, such as current exhibitions, event arrangements, and whether the attractions are open.
- Huangshan Travel Assistant collaborates with the Huangshan scenic area, collecting data on Huangshan’s signature attractions, routes, food, accommodation, and transportation. Users can take a photo in the scenic area and get a detailed introduction of the photo and recommended routes to nearby attractions.
I noticed that the key to these AI travel-related applications providing timely and comprehensive guide services lies in their information sources and update mechanisms. These applications automatically collect data from government-published public accounts, official scenic spot websites, and specific information channels, integrating this scattered information into the application.
This intelligent data processing method is particularly suitable for tourist guide services. Especially in a country like China, rich in tourism resources and with complex scenic area management, many local scenic areas rely on local government or related institutions’ public accounts and websites for information release. It is difficult for tourists to get all the information through a unified platform, especially for specific exhibit introductions, exhibition times, or special event arrangements at some niche attractions, which may be hidden in hard-to-search places. AI applications can automatically identify these information sources and update them in real-time, ensuring users get the latest and most accurate guide content.
Non-Consensus: Data is More Important than Technology {/examples/}
In communication with relevant vendors, I found an interesting phenomenon: Despite the progress in AI technology making digital humans and guide tools appear more intelligent, their actual utility largely depends on the richness of data and timeliness of updates. If the application cannot timely obtain key data sources, no matter how intelligent the AI interaction interface is, the actual experience of the application will be greatly reduced.
First, traditional applications rely solely on general search engines to retrieve public data, while a large amount of data in China is in “silos” on closed platforms, such as WeChat public accounts, Xiaohongshu, Douyin, etc. These platforms are not friendly to search engines, making it difficult for AI applications to obtain this data. Secondly, data on many niche attractions’ exhibits is not directly available on the public network, and must be obtained through direct contact with the scenic area or cultural institutions.
To solve this problem, AI guide applications must deeply cooperate with various scenic areas and museums to obtain exclusive exhibit data and real-time updated scenic area dynamic information. Interestingly, most scenic areas and museums are willing to cooperate with AI applications because developing an app for a scenic area using traditional outsourcing methods requires a lot of manpower and resources, while cooperating with AI companies can greatly reduce development costs.
Digital Humans and Virtual Avatars: The Challenge of Not Looking Like Real People {/examples/}
Digital Humans that Did Not Pass the Turing Test {/examples/}
With multimodal capabilities, the voice and video interaction experience of digital humans and virtual avatars seems smooth, but AI still shows obvious limitations.
- Although digital human technology can simulate human facial expressions, actions, and even voices, facial expressions and voices are not natural enough, making it obvious that they are AI-generated.
- The reaction speed of AI digital humans is slower than that of humans. For example, some virtual tour guides in the showcase can recognize the user’s position through the camera and follow the user’s line of sight, but their response speed is still slow, incomparable to real tour guides. This technical limitation mainly comes from the current digital human technology stack not using end-to-end large models but a pipeline composed of multiple models such as speech recognition, large models, speech synthesis, and lip-syncing, resulting in high end-to-end latency.
- Large models perform well in executing simple tasks, but when faced with open-ended questions or emotional interactions from users, the content generated by AI often lacks depth and flexibility. Many AIs still rely on fixed template dialogue frameworks, making it difficult to generate truly personalized responses based on users’ immediate needs. For example, in a travel guide scenario, users may ask in-depth questions about local history or culture. If the AI has not pre-connected to the relevant knowledge base, it often cannot give satisfactory answers and may even fall into repetitive dialogue logic.
AI Outbound Calls: Using Real Human Recordings Instead of TTS
A commercially successful application scenario is AI outbound calls. Two companies in Hall 3 showcased AI-based outbound call systems, with applications including telemarketing, customer service, and after-sales follow-up. The core function of these systems is to simulate conversations with users through AI models, reducing human involvement while increasing outbound efficiency. However, this technology also faces significant challenges, especially in making AI-generated voices sound more human-like to reduce users’ wariness of robots.
Currently, the main issue with outbound calls is that the generated voice cannot fully mimic a human, making it easy for users to recognize it as a machine, thus affecting user experience and communication effectiveness. To address this issue, some companies have adopted a combination of generated scripts and human recordings. That is, the AI system generates standardized dialogue frameworks and script templates, while humans are responsible for recording these templates. This way, in outbound calls, users still hear human voices instead of AI-generated synthetic voices.
Although this method increases the “humanization” of outbound calls, it also brings new challenges. First, the cost of human recordings is relatively high, especially in large-scale outbound scenarios where the diversity of recording content requires additional resource investment. Second, since the recordings are pre-prepared, they cannot be personalized based on users’ immediate feedback, making outbound calls appear rigid and inflexible in handling complex conversations.
Non-consensus: Although outbound call systems have improved the naturalness of conversations through a combination of human recordings and generated scripts, this method still cannot match humans in handling complex, personalized conversations. For AI to truly achieve large-scale application in outbound scenarios, breakthroughs in voice synthesis quality, dialogue logic flexibility, and real-time response speed are necessary to truly replace human outbound operations.
Humanoid Robots: Dual Bottlenecks of Mechanics and AI
Humanoid robots have always been a symbol of sci-fi movies and future technology. Hall 3 of the Yunqi Conference showcased several humanoid robot projects from different manufacturers. However, after communicating with multiple exhibitors, I found that the current technical bottleneck of humanoid robots is not solely from mechanical structures but more importantly from the limitations of AI algorithms, especially the shortcomings of large models and traditional reinforcement learning.
The Dilemma Between Traditional Reinforcement Learning and Large Models
Currently, humanoid robots mainly rely on traditional reinforcement learning algorithms to complete task planning and motion control. These algorithms can optimize the robot’s action path in the environment through continuous trial and error, ensuring the completion of specified tasks. However, despite the excellent performance of reinforcement learning in laboratory environments, it exposes robustness issues in practical applications. Traditional reinforcement learning algorithms cannot flexibly respond to sudden situations and changing tasks in complex, dynamic real-world environments, making them inadequate in applications requiring high flexibility.
In recent years, large models, with their capabilities in natural language understanding, multi-modal perception, and complex reasoning, theoretically can enhance the intelligence of humanoid robots in task execution. However, the reasoning speed of large models is relatively slow, with high latency, making it difficult to meet the real-time response requirements of humanoid robots in practical scenarios; additionally, large models before OpenAI o1 were not good at complex task planning and reasoning. Compared to planning algorithms based on traditional reinforcement learning, large models still cannot effectively generate complex, long-chain task planning when reasoning complex task sequences.
AI is Currently the Biggest Bottleneck for Humanoid Robots
I once thought that the biggest obstacle to the development of humanoid robots was mechanical limitations, especially in precise mechanical control and motion flexibility. However, after in-depth discussions with multiple manufacturers, I found that this perception needs to be corrected. In fact, current humanoid robot mechanical technology has made considerable progress, with continuous optimization in precision, speed, and cost, especially with the rise of domestically produced components significantly reducing mechanical costs. For example, many key robot components no longer rely on expensive foreign parts, with Shenzhen manufacturers providing components that perform as well as imports, greatly reducing overall costs. However, these mechanical advancements have not led to the expected explosion in robot applications, with the main bottleneck instead appearing in AI.
In actual robot control, the reasoning speed and complex planning capabilities of AI are the core issues limiting robot flexibility and precision. Currently, AI can only solve simple tasks such as voice dialogue and basic perception relatively well, but its performance is still unsatisfactory in handling complex planning tasks and real-time adjustments. Traditional reinforcement learning performs poorly in path planning, while the reasoning capabilities of large models have not met expectations. The OpenAI o1 model has recently made good progress in reasoning capabilities using reinforcement learning, but its interaction latency still cannot meet the real-time requirements of robots.
Non-consensus: The real bottleneck in the current humanoid robot field is AI reasoning capabilities. In the future, if the reasoning speed and capabilities of large models can be significantly improved, combining reinforcement learning with large models, we may see robots becoming more flexible, robust, and efficient in complex tasks.
Autonomous Driving: Finally Mature Enough for Commercial Use
Autonomous driving has been a hot topic for many years and has recently matured to the point where it can be commercially deployed on a large scale. In fact, I feel that the AI requirements for autonomous driving are very similar to those for humanoid robots, both needing strong real-time perception and path planning capabilities.
In the Hall 3 exhibition, Tesla showcased its latest Full Self-Driving (FSD) system, achieving a highly automated driving experience. The futuristic-looking electric pickup Cybertruck also launched with FSD full self-driving, and it was my first time seeing the real Cybertruck.
Tesla’s FSD system is based on a pure vision route for autonomous driving, which contrasts sharply with other companies that use LiDAR or multi-sensor fusion. Tesla’s display indicates that the vision-first route has significant cost advantages in certain application scenarios because it does not rely on expensive LiDAR equipment, allowing for faster mass-market adoption.
However, the vision-first autonomous driving route also faces some skepticism. Although the vision system can accurately recognize through extensive training data in ideal environments, in complex weather conditions (such as fog, heavy rain) or extreme lighting environments, autonomous vehicles relying solely on vision systems often struggle to make accurate judgments.
Smart Consumer Electronics: High Expectations, Slow Progress
Perhaps because the Yunqi Conference is not a hardware exhibition, the display of smart wearable devices was relatively limited, with only a few AR/VR manufacturers.
Domestic consumer electronics products rely on supply chain advantages, with the main advantage being low prices. For example, the Vision Pro is priced at 30,000 RMB, while manufacturers like Rokid only charge four to five thousand, although there is still a significant gap in experience compared to the Vision Pro, the queue for experiencing it was still long. Humane’s AI Pin is priced at $699, while domestic competitors with similar functions cost only over $100.
Most companies developing large models agree that the smart assistant on mobile phones is a good entry point for AI. But for some reason, top mobile phone manufacturers like Apple and Huawei, despite their strong comprehensive accumulation in various aspects of the system, have relatively slow progress in self-developed foundational models and large model applications. Moreover, not only is the progress of large model products slow, but the internal adoption of large model-assisted coding development and office work is also relatively conservative.
Alibaba has widely adopted Tongyi large models to assist in coding development. For example, the ModelScope AIGC section (a community for experiencing AIGC models similar to Civitai and Stable Diffusion WebUI) team consists of only 8 people, all working on AI Infra and algorithms, with no designers or front-end engineers. After spending a year or two polishing algorithms and optimizing performance, they learned React in just two months and completed the development and launch of the website’s front and back ends, which is inseparable from AI-assisted programming. In contrast, only some Huawei employees voluntarily use AI tools to assist in programming, and it has not been integrated into the software development process. Moreover, due to information security concerns, the company has some restrictions on using world-leading models like GPT.
For everyday AI assistants, the mobile phone form may not be the most suitable because its input and output forms cannot meet the needs of AI’s multi-modal capabilities. Users need to hold the phone when using the AI assistant, and the posture of holding the phone is usually not conducive to the camera seeing the environment in front of them.
I believe that the goal of AI assistants is to explore the larger world with users. Smart wearable devices will change the paradigm of human-computer interaction, allowing AI assistants to see, hear, and speak, interacting with humans naturally in a multi-modal way.
Since movies like “Her,” smart wearable devices have been given high expectations, but so far, no sufficiently good products have been seen. AR/VR/spatial computing products can input and output video and audio, solving the problem of multi-modal interaction, but they are only suitable for indoor scenarios and are inconvenient to wear during outdoor activities. Products like AI Pin require extending the hand as a screen, which is actually impractical. This also indicates that the story of large models has just begun, and there is still much room for exploration in product forms. When Zuckerberg announced AI Studio, he said that even if the capabilities of foundational models stopped progressing, product forms would still need five years to evolve, let alone now when the capabilities of foundational models are still rapidly advancing.
AI-Assisted Operations: From Hotspot Information Push to Fan Interaction
With the increasing demand for corporate social media operations, AI-assisted automated operation systems are gradually becoming key tools for improving work efficiency.
Automatically Generating Hotspot Content and Push
In daily social media management, operators need to keep an eye on industry trends, capture hotspot information, and quickly generate relevant content to post on platforms like Twitter and WeChat. AI’s powerful information collection and summarization capabilities have greatly optimized this process. Through automated analysis, the system can filter out hotspot content related to the enterprise from massive information and automatically generate concise news, tweets, and other text content, helping operators quickly respond to market hotspots.
Automatically Interacting with Fans
AI can also help enterprises interact with fans on social media platforms. Traditional fan interaction requires a lot of time and effort, especially when the number of fans is large, making it difficult for enterprises to respond to each one. AI systems can automatically generate personalized replies based on users’ comments and questions. This interaction can not only improve fan satisfaction but also greatly enhance operational efficiency.
However, despite AI’s significant optimization of operational processes, the content generated by the system still lacks precision and personalization in actual operations, especially in fan interactions, where AI’s responses often appear “mechanical” and lack human touch. This is a major bottleneck in AI’s application in operations.
Non-consensus
Although AI’s role in social media operations is significant, its effectiveness is still limited by the current model’s reasoning and generation capabilities. AI-generated content performs well in summarizing hotspot news, but in-depth interactions with fans still appear stiff and lack personalization. In the future, for AI to truly become a main tool in operations, it needs to further enhance its content generation flexibility and interaction personalization levels.
Disruptive Applications of AI in Education: From Personalized to Contextual Learning
Education has always been the cornerstone of social development, and the emergence of AI has brought unprecedented revolutionary changes to the education industry. In the application hall of this Yunqi Conference, several education-related applications were showcased, further revealing AI’s potential in the education field. I particularly focused on how these AI applications perform in education, especially how they achieve personalized teaching, contextual learning, and help students improve programming and language skills. These technologies not only improve the efficiency and coverage of traditional teaching but also have the potential to completely change the future landscape of education.
Shortage of Teacher Resources and AI One-on-One Teaching
One of the biggest challenges in the education industry for a long time has been the shortage of teacher resources. Each teacher needs to be responsible for the learning progress of multiple students simultaneously, often leading to a lack of personalized education. Traditional teaching models, especially in large classrooms, find it difficult to ensure that each student receives attention that matches their level and needs. The advent of AI is expected to change this situation. AI can act as a one-on-one tutor, providing personalized learning plans for each student and dynamically adjusting teaching content based on each student’s progress and feedback.
In the field of language learning, AI has already demonstrated its powerful potential. By integrating language learning courses with AI, students can practice anytime and anywhere, without being restricted by time and place. This method is particularly advantageous in speaking practice. Traditional language learning requires face-to-face communication with foreign teachers, but this mode is costly and unrealistic for many students in various regions. AI can simulate real language communication scenarios through natural language processing technology, allowing students to converse with AI in daily life. This not only significantly lowers the learning threshold but also enhances learning flexibility. Whether in the classroom or on the go, students can achieve a contextual learning experience through AI, improving learning outcomes.
Contextual and Immersive Learning: Expanding AI’s Educational Scenarios
Contextual learning is another major application of AI in education. Traditional language learning or other knowledge learning is often confined to fixed classrooms or textbook content, making it difficult to achieve a truly immersive experience. However, by combining AI with contextual learning, learners can use the knowledge they have learned in real-life situations. For example, some language learning applications can simulate immersive overseas travel, life, and campus scenarios with the help of VR devices, allowing students to interact with AI in virtual scenarios as if they have a “foreign teacher” accompanying them at all times; or with the assistance of AR devices, act as a “tour guide” and “foreign teacher” in real city walks or travel scenarios, helping you see a bigger world. This contextual learning can greatly enhance the flexibility and realism of language application, helping students make rapid progress in actual communication.
The same concept applies to learning in other disciplines. With the help of AI, programming learning can become more intuitive and interactive. In traditional programming learning, students often face tedious documents and examples, making the learning process slow and prone to errors. AI can provide real-time programming guidance to students, not only pointing out errors in the code but also offering optimization suggestions. For example, in the process of learning programming languages, AI can provide students with instant code reviews and corrections, helping them develop good programming habits. Such real-time feedback and interactive experiences enable students to master programming skills more effectively and gradually improve code quality.
Application of AI in Programming Education: From “Assistant” to “Mentor”
In the field of programming, AI’s role is not just a tool but more like a real-time mentor. In the past, programming learning often led students to develop bad programming habits due to the lack of a team or code review mechanism. However, with the help of AI, every line of code written by students can receive timely feedback. If the code format is not standardized, AI can immediately prompt and provide optimization suggestions, helping students form standardized programming styles. At the same time, AI can help students quickly master new programming languages and frameworks. In the past, learning a new programming language often required consulting a large number of documents, but AI can quickly generate corresponding code snippets based on students’ needs and explain their functions. This efficient and intuitive learning method not only saves time but also greatly improves learning effectiveness.
Taking programming assistants like Cursor and GitHub Copilot as examples, these AI tools can provide real-time suggestions and feedback while students are coding. AI can understand students’ intentions based on the context, automatically complete code, optimize logic, and even point out potential security risks. This makes programming no longer an isolated process but an interactive learning experience. Additionally, AI can help students solve problems instantly during the learning process, such as automatically generating HTTP request code or providing suitable libraries and parameters. This interactive learning experience allows students to quickly master new skills and gradually improve their programming abilities through continuous attempts and corrections.
AI-Assisted Mathematics Learning and Logical Thinking Training
AI’s application in education is not limited to language learning and programming; it can also be widely applied in mathematics, logical reasoning, and other fields. In the past, a significant challenge for AI in STEM education was the poor reasoning ability of large models, resulting in low accuracy in solving problems independently, thus relying only on question banks to match answers.
OpenAI o1 indicates that reinforcement learning and slow thinking can solve the problem of reasoning ability. Now OpenAI o1 mini can solve most undergraduate science course problems, such as the four major mechanics, mathematical analysis, linear algebra, stochastic processes, differential equations, and o1 mini can solve about 70%-80% of complex calculation problems and over 90% of simple concept and calculation problems. For computer science undergraduate coding problems, it’s even less of an issue. I think o1 mini can graduate from the mathematics, physics, and computer science departments. The official version of o1 is expected to be even stronger. As I tested it, I joked that my intelligence is limited, and I can’t understand advanced mathematics, so I can only create something smarter than me to make up for my lack of intelligence.
The price of OpenAI o1 mini is not high, with a per-token price lower than GPT-4o. The current pricing is still due to OpenAI being the only model with reasoning ability, with a certain premium. The model size and reasoning cost of o1 mini are likely comparable to GPT-4o mini, while GPT-4o mini’s pricing is 30 times lower than o1 mini. As other foundational model companies catch up in reinforcement learning and slow thinking, the cost of strong reasoning ability models will only further decrease.
AI’s ability to solve problems will have a significant impact on STEM education. In traditional STEM learning, students often rely on textbook answers to determine if they have solved a problem correctly, but AI can provide more detailed feedback. AI can not only give the correct answer but also point out specific error steps in the student’s problem-solving process, helping them understand the core of the problem.
Moreover, AI can demonstrate the problem-solving process through “step-by-step thinking,” allowing students to learn the problem-solving thought process rather than just memorizing formulas and steps. This “step-by-step thinking” ability is particularly evident in complex logical reasoning problems. AI can help students better understand the logical chain by demonstrating each step of the reasoning process, enhancing their logical thinking abilities.
Compared to traditional education methods, AI’s advantage lies in providing personalized learning paths, allowing each student to adaptively adjust according to their learning progress. AI can dynamically generate new practice problems and provide detailed explanations at each step, largely solving the problem of individualized teaching that traditional classrooms cannot address due to limited teacher energy.
Pavilion 2 (Computing Infrastructure): The Computational Foundation of AI
At Pavilion 2 of the Alibaba Cloud Yunqi Conference, we entered a field closely related to AI development but often overlooked—computing infrastructure. If AI models and algorithms are the forefront technologies driving artificial intelligence development, then computing infrastructure is the solid backing that supports it all. As the complexity of AI models continues to increase, the demand for computing power is growing exponentially, making computing infrastructure one of the core competitive strengths in the AI era.
The computing technologies showcased at this conference are no longer limited to single hardware or network architectures but revolve around how to improve the efficiency of AI model inference and training, reduce inference costs, and break through computing power bottlenecks through new architectures and technologies. In this process, CXL (Compute Express Link) technology, cloud computing clusters, and confidential computing became the focus of discussion, showcasing the latest advancements and challenges in current AI infrastructure.
CXL Architecture: Efficient Integration of Cloud Resources
CXL (Compute Express Link) is an emerging hardware interconnect technology designed to improve memory sharing efficiency between server nodes. In AI training and inference, memory demand is particularly important, especially as large model inference often requires a significant amount of memory resources. In traditional server architectures, memory is often confined within a single computing node, when a node’s memory resources are insufficient, the system must write data to slower external storage, greatly affecting overall computing efficiency.
CXL technology allows different server nodes to share memory resources, breaking the boundaries of traditional server memory usage. At the conference, Alibaba Cloud showcased their latest achievements in CXL technology, connecting multiple computing nodes to a large memory pool through self-developed CXL switches and SCM (Storage Class Memory) memory disks, achieving cross-node memory sharing and greatly improving memory resource utilization.
The CXL application scenarios demonstrated by Alibaba Cloud mainly focus on databases. When processing large-scale data, databases are often limited by memory, especially when rapid response to large-scale queries is required, or when user traffic surges at a particular database node, insufficient memory can lead to severe performance bottlenecks. Through the CXL architecture, each computing node can share the resources of the entire memory pool, no longer limited by the memory capacity of a single node.
Non-consensus View: In the AI field, many people habitually focus on high-performance computing devices like GPUs and TPUs, believing that as long as there is enough computing power, AI model training and inference can proceed smoothly. However, memory and data transfer speed have actually become new bottlenecks. The CXL architecture is not only an enhancement of computing power but also a breakthrough in memory resource utilization efficiency. In AI inference scenarios, the CXL architecture can provide large-capacity, low-cost, high-speed memory resources, significantly reducing the storage cost of Prefix KV Cache (intermediate results of model input context), thereby reducing the latency and GPU overhead of the large model Prefill stage, making it a key technology supporting long-text scenarios.
High-Density Servers
Traditional server clusters often require a large number of hardware devices stacked to provide computing power, leading to increased costs and physical space limitations. Alibaba Cloud’s high-density server nodes integrate two independent computing units (including motherboards, CPUs, memory, network cards, etc.) within a single 2U rack server, reducing overall cabinet space occupation. This means that in the same physical space, Alibaba Cloud’s high-density servers can provide higher computing power.
Cloud-Native and Serverless
The term “cloud-native” is not unfamiliar. The goal of cloud-native technology is to enable applications to fully leverage the elasticity and distributed architecture of cloud computing through containerization, automated management, microservices architecture, and continuous delivery. Traditional deployment methods usually require running on virtual machines or physical servers, meaning developers need to manually scale the system to handle traffic peaks and pay for excess resources during idle times. Through the automatic scaling capabilities of cloud-native, the system can dynamically adjust resources based on actual usage needs, greatly reducing the maintenance difficulty and operating costs of the infrastructure.
During the Yunqi Conference, Alibaba Cloud showcased its cloud-native services and Serverless technology, revealing how these technologies help developers reduce operational costs, improve system elasticity, and their applications in large model inference.
Serverless: The Key to Lowering Operational Thresholds
The biggest feature of Serverless is that developers do not need to manage servers and infrastructure, but instead focus entirely on business logic. Through the Serverless architecture, developers only need to upload application code or container images to the cloud service platform, which will automatically allocate resources dynamically based on the changes in request volume. This approach not only reduces the scaling pressure during peak traffic but also avoids paying for excessive idle resources during off-peak periods. Serverless excels in high scalability, automated operations, and seamless integration with various cloud-native services.
In the presentation, Alibaba Cloud demonstrated their Serverless Application Engine. Applications developed using this engine can automatically scale horizontally according to traffic changes and provide developers with multi-language support. For example, a traditional FastAPI application deployed on a virtual machine usually requires developers to pre-configure resources based on user volume, leading to wasted idle resources. However, with the Serverless Application Engine, developers can package the application as a container image without preparing for traffic peaks in advance. The system will automatically scale application instances as needed and shrink them when traffic decreases.
This automated scaling capability is crucial for handling traffic fluctuations, allowing development teams to focus on application development and optimization without worrying about scaling and resource management. Additionally, a notable advantage of Serverless is its support for scheduled tasks and background tasks, which is particularly important for applications that need to handle periodic tasks, such as data backup, offline data analysis, and regular report generation.
The Evolution of Serverless Technology: From Functions to Any Application
In the early development of Serverless, application developers often had to rely on specific Serverless function frameworks to deploy code in the cloud, which limited developers’ freedom, especially for applications already developed based on traditional frameworks. Serverless transformation faced the enormous workload of rewriting code. However, the Serverless technology showcased by Alibaba Cloud at this conference is no longer limited to function frameworks but can support any type of application. Developers only need to package existing application images to host them on the Serverless platform, greatly facilitating the cloud migration of many traditional applications.
Today, Serverless is no longer just a development framework; it has evolved into a universal architecture supporting different language frameworks from Python FastAPI to Node.js and even Java. This versatility makes application architecture migration more flexible, no longer constrained by a specific development paradigm. At the same time, combined with cloud-native technologies, Serverless further promotes automated operations of serverless applications, not only automatically scaling during peak loads but also automatically scaling horizontally and vertically according to different application needs.
Cloud-Native Databases and Message Queues
The concept of cloud-native is not only applied at the application and service level but also widely used in key infrastructure areas such as databases and message queues. For example, cloud services like MongoDB and Milvus vector databases have stronger scalability and performance optimization capabilities than traditional community versions. For instance, the cloud version of the Milvus vector database performs 10 times better than the open-source version in some scenarios and can automatically scale without worrying about local memory shortages. Similarly, while the open-source version of MongoDB has decent scalability, the cloud-native version can automatically scale instances when facing massive data, eliminating the need for manual configuration and adjustment of storage and computing resources, significantly reducing the operational burden of databases.
In addition to databases, cloud-native message queue services have also become an indispensable part of modern application architectures. In handling large-scale concurrent requests and cross-service message delivery, traditional message queue systems often require developers to manually optimize performance, while cloud-native message queue services provide automatic scaling and high availability guarantees, further simplifying the operational process.
BaiLian Platform: Cloud-Native Large Model API
When using open-source large models, renting a large number of GPUs to handle peak loads incurs significant costs, and dynamically applying for and releasing GPUs from the cloud platform and deploying services also brings high operational costs.
Alibaba Cloud’s BaiLian platform provides large model API services that offer developers stable and scalable inference services. Through the API, developers can easily call multimodal models for inference without worrying about the underlying infrastructure and scaling issues.
We can see that Alibaba Cloud offers a wide variety of models. On its BaiLian platform, 153 out of 186 models are from Alibaba’s own Tongyi series, while others are third-party models, such as Baichuan, ZeroOneWanwu, Moonshot, etc.
The models provided by Alibaba through the API include various categories. First are the text generation models, covering both old and new versions of the Qwen models, including both open-source and closed-source versions. There are also models for video understanding and video generation. For example, the EMO model for portrait video generation can turn static images into dynamic videos, and models like AnimateAnyone can generate dancing videos. Style transfer models can handle portrait style changes and image retouching. Additionally, there are models specifically designed for poster design, anime characters, backgrounds, and other specific scenarios. ControlNet models can generate artistic text and embed it into images. Speech synthesis and speech recognition include models from the CosyVoice and SenseVoice series.
The newly released Qwen 2.5 model by Alibaba has significantly improved its mathematical capabilities. Previously common errors, such as determining whether 3.1416 or π is larger, have been corrected through the SFT (Supervised Fine-Tuning) method. For example, the Qwen 2.5 model now compares 3.1416 and π digit by digit instead of relying on intuition to judge the size of the numbers. Humans also compare numbers digit by digit, so the SFT data essentially teaches the large model the way humans think about such problems. Although this does not match the “slow thinking” capability improvement of OpenAI’s o1 model through reinforcement learning, SFT, as a quick optimization method, can effectively fix common logical flaws.
Cost Challenges of Cloud-Native Large Model APIs
The main challenge for enterprises using cloud-native large model APIs now is that the cost of voice and image large model APIs is relatively high, such as speech recognition, speech synthesis, and image generation API services, which are often several times more expensive than running these models on self-deployed GPUs. Additionally, the latency of voice and image large model APIs is generally high, so most latency-sensitive applications still choose local deployment. This is an issue that cloud-native large model services need to address.
The cost of text generation large model APIs has recently become more competitive, often lower than the cost of running these models on locally deployed GPUs, so unless there are special needs for model fine-tuning or data security, using cloud-native large model APIs is almost always the better choice.
Confidential Computing: Data Security and Trust Transfer in the AI Era
In the process of AI model training and inference, data security issues are becoming increasingly important, especially in the context of cloud computing, where enterprises face potential risks of data privacy breaches when uploading sensitive data to cloud service platforms for processing. Cloud service providers offering API services also need to prove that they are not “cutting corners” and that they are indeed using flagship large models to provide API services, rather than using smaller models to impersonate large models.
Alibaba Cloud’s confidential computing platform achieves trusted large model inference through TEE technology. Users no longer need to overly rely on the security measures of cloud service providers but instead trust the underlying hardware’s trusted execution environment, which is a significant advancement in security and privacy.
The Efficiency Battle Between Cryptographic Security and TEE
Traditional multi-party secure computation (MPC) and other cryptographic technologies ensure data privacy even in untrusted environments by encrypting data and processing it collaboratively among multiple parties. This technology is particularly suitable for scenarios involving cooperation between multiple entities, such as cross-company data joint modeling. However, the performance overhead of multi-party secure computation is too high, making it difficult to meet the inference needs of large-scale AI models in practical applications. Even simple tasks introduce significant computational and communication overhead, resulting in very slow inference processes.
In contrast, TEE (Trusted Execution Environment) technology in confidential computing provides more efficient privacy protection for data processing through hardware-level isolation and encryption. TEE technology relies on hardware-built encryption processing modules to ensure that data is encrypted and strictly access-controlled during use, even cloud service providers cannot access the data being processed. Compared to MPC, TEE’s advantage lies in its near-native computing performance, meeting the stringent speed and performance requirements of large-scale AI model inference.
At this conference, Alibaba Cloud released an AI inference service based on TEE, which can provide an isolated environment in hardware to ensure that data remains encrypted even when processed in the cloud.
Attestation: Trust Transfer from Cloud Service Providers to Hardware Manufacturers
A core mechanism in confidential computing is attestation, which allows users to verify whether the current computing environment is trustworthy before and after executing AI model inference. This process ensures that AI model inference can only be performed in verified environments.
Traditionally, enterprises had to trust cloud service providers to properly manage and protect data when uploading it to the cloud. However, through attestation technology, users no longer need to trust Alibaba Cloud but instead trust the hardware manufacturers providing the underlying hardware, such as Intel, AMD, or NVIDIA.
Specifically, the attestation process is as follows:
- Verify the hardware and software environment: When users submit data and models for inference, they first verify whether the hardware is a trusted hardware execution environment (TEE). This means that cloud service providers cannot use untrusted hardware as trusted hardware or cut corners, such as disguising lower-performance GPUs as higher-spec GPUs.
- Execute AI inference: Perform AI inference tasks in a trusted hardware environment. Even Alibaba Cloud administrators cannot access the data in this trusted environment because all data is encrypted during transmission and storage.
- Generate attestation token: After the hardware environment passes verification, the system generates a unique attestation token, which contains a trusted hash value generated by the hardware, proving that the computing environment has not been tampered with.
- User verifies token: Users can independently verify this attestation token to ensure that the inference service runs on trusted hardware. Alibaba Cloud also provides remote verification services to facilitate users in verifying whether confidential computing execution is trustworthy.
Through attestation, users can not only ensure that their data is strictly protected in the cloud but also continuously verify the trustworthiness of the hardware and software environment throughout the data processing process. This mechanism effectively shifts the user’s trust from Alibaba Cloud itself to the underlying hardware manufacturers, ensuring that even if the cloud service provider itself encounters issues, the user’s data remains secure.
Confidential Computing-Based Large Model Inference Service
At this conference, Alibaba Cloud released a confidential computing-based large model inference service, allowing users to upload sensitive data to the cloud and perform encrypted inference through TEE technology, and view the verified inference results on the web. Users can also verify the trustworthiness of the inference results offline. This inference service is not only suitable for standard text generation tasks but also capable of handling more complex multimodal model inference tasks, such as video generation and speech recognition.
Even industries with high data privacy requirements, such as finance, healthcare, and government agencies, can securely use cloud-based AI inference services without needing to trust the cloud service provider. Through hardware-isolated confidential computing, Alibaba Cloud ensures that any sensitive data will not be leaked during the inference process while providing performance comparable to traditional inference services.
Conclusion: Two Bitter Lessons of Foundational Models, Computing Power, and Applications
There is a classic article in the AI field, “The Bitter Lesson”, published by Rich Sutton in 2019, which has been validated in the development of large models. The core viewpoint of this article is that in AI research, general methods that rely on computing power always outperform specific methods that rely on human knowledge. Sutton points out that although researchers tend to inject human knowledge and wisdom into AI systems, history repeatedly proves that over time and with the growth of computing power, simple and scalable methods eventually surpass meticulously designed systems.
Comparing the content of the three exhibition halls at the Yunqi Conference, I found that the relationship between foundational models, computing power, and applications also aligns with the prediction of “The Bitter Lesson”: Foundational models are key to AI applications, and computing power is key to foundational models. This relationship reflects Sutton’s emphasis that in AI development, computing power and scalability are more important than domain-specific expertise.
The Two Bitter Lessons Reflected in the Three Exhibition Halls of the Yunqi Conference
From the three exhibition halls of the Yunqi Conference, we can see the two Bitter Lessons in the AI field: foundational models are key to applications, and computing power is key to foundational models.
Hall 1—The Key Role of Foundational Models
Foundational models are the cornerstone of successful AI applications, almost all capabilities and performance of general applications depend on the progress and maturity of foundational models. The success of general applications hinges on the capabilities of foundational models. If the foundational model is insufficient, no matter how optimized the application layer is, it is difficult to achieve a real breakthrough. Currently, successful general applications are almost all led by foundational model companies because these companies can grasp the latest progress of the model months or even half a year before the foundational model is released, allowing application development to proceed simultaneously. Application companies only realize new opportunities when the foundational model is publicly released, by which time they are often already behind.
Hall 2—Computing Power Determines the Evolution of Foundational Models
Computing power is another decisive factor in AI development, especially for training and inferring complex foundational models. Only companies with strong computing power can effectively train and optimize foundational models, further driving AI innovation. Without sufficient computing power, it is impossible to support the training of complex models, leading to significant performance differences in AI models domestically and internationally. As reflected in the exhibition content, the shortage of computing power limits the rapid development of high-complexity models domestically, and the foundation for advancing AI progress remains the expansion and optimization of computing power.
Hall 3—The Capability of Foundational Models Determines Application Capability
AI applications are the joint result of foundational models and industry know-how. Through the numerous vertical field cases displayed in the application hall, we see some application companies that have successfully found product-market fit (PMF) by combining industry data and scenario advantages. For example, AI-driven code editors, design tools, smart wearable devices, cultural tourism guides, music generation, etc., are achieving application innovation by combining with foundational models and leveraging data and industry moats.
First Lesson: Foundational Models are Key to AI Applications
The capability of foundational models determines the upper limit of general application capabilities. The success of general AI applications is often led by foundational model companies, as these models build the underlying support for almost all AI applications. As Rich Sutton described in “The Bitter Lesson,” early AI researchers tended to inject human knowledge into models, attempting to achieve progress by building AI with specific domain understanding. However, over time, the increase in computing power has made general methods, i.e., general foundational models that rely on computing power expansion, the most effective methods. Abandoning manually implanted human knowledge and relying on large-scale data and computing power has instead brought qualitative leaps.
Foundational Model Companies are More Suited for General Applications
Comparing the AI applications displayed in Hall 1 (Foundational Models) and Hall 3 (Applications), a significant characteristic can be observed: many successful general applications are developed by foundational model companies themselves. This is because:
- Foundational Models Dominate Application Capabilities: General applications almost entirely rely on the capabilities and performance of foundational models. Foundational model companies have widely mastered the capabilities and potential of new models months before their release, allowing application development to proceed simultaneously. This means that when external application companies discover the progress of foundational models, foundational model companies are already prepared with related applications, occupying a competitive advantage.
- External Companies Lag Behind Model Release Progress: Application companies usually start developing corresponding applications only after the foundational model is released, leading to significant lag in time and innovation. The capabilities of general applications are fundamentally determined by foundational models, making it difficult for external companies to catch up.
Application Companies Should Combine Industry and Data
Although successful general applications are almost all led by foundational model companies, in specific industries or scenarios, some application companies have found product-market fit (PMF) by combining industry data and expertise (know-how) and achieved success. Here are some typical cases:
- AI-Assisted Programming: By combining AI’s code generation and logical reasoning capabilities, Cursor has greatly improved developers’ programming efficiency, especially in real-time generation and error correction.
- Smart Wearable Devices: Smart wearable devices can become users’ assistants, exploring a larger world with them. The current input and output forms of mobile phones cannot fully meet the needs of AI’s multimodal capabilities. Smart wearable devices will propose new interaction methods, allowing AI assistants to see, hear, and speak, interacting with humans naturally in a multimodal way.
- Educational Applications: AI has already achieved significant success in educational applications, such as language learning applications that can simulate one-on-one conversation scenarios, helping users engage in interactive language learning.
- Assisted Design: AI design tools help designers quickly generate prototypes, improving design efficiency. Although there are still issues with exporting HTML in the generated design drafts, their advantages in initial prototype construction are very obvious.
- Cultural Tourism Guide Applications: These applications integrate government data, scenic spot information, etc., providing timely updated guide services, especially forming data barriers in tourism and cultural scenarios.
- Music Generation: AI can identify tracks by listening to music, create music based on lyrics or humming, and perform AI covers. These technologies are already quite mature and have gained many users.
Second Lesson: Computing Power is Key to Foundational Models
Without sufficient computing power, the training of foundational models will be difficult to achieve. This is the second “Bitter Lesson” in the field of artificial intelligence—computing power determines the upper limit of foundational model capabilities. Historically, from computer chess and Go to speech recognition, it has always been the general methods that can fully utilize computing power that have won.
The Impact of Insufficient Computing Power on Model Training and Inference
After communicating with several domestic companies working on foundational models, it was found that there is currently a significant gap in computing power between domestic and international companies, especially in obtaining high-performance GPU resources.
At the Yunqi Conference, NVIDIA had a fairly large exhibition area but only displayed two Ethernet switches and a BlueField 3 smart network card. All AI-related introductions were played on TV, with no actual GPU exhibits. Visitors in the exhibition area were watching videos on their own, while the switches and network cards placed in the center were unattended, and three NVIDIA staff were chatting around the display cabinet. I asked the NVIDIA staff a few network-related questions, and they said, “Aren’t you all doing AI here? How do you know so much about networks?” I said I had been working on data center networks before. I asked why there were no GPU exhibits this time, and they said GPUs are quite sensitive, so they only displayed network equipment.
The gap in computing power has brought issues in both inference and training:
- Inference Stage Latency Issues: The time to first token (TTFT) for many domestic AI models during the inference stage is relatively high, meaning the time to generate the first token is long, significantly affecting real-time experience. Compared to abroad, domestic inference speed is noticeably slower. This is mainly because the GPUs commonly used domestically are relatively outdated, such as A800 or even V100, while foreign companies generally use H100 or A100, which can significantly improve inference speed. The sanctions on high-end hardware have created a significant gap in inference performance between domestic and foreign companies.
- Scarcity of Training Resources: The process of training foundational models requires massive computing power support, especially for training large-scale models (such as end-to-end multimodal models and models with strong inference capabilities based on reinforcement learning), where hardware resource limitations are particularly critical. Leading US companies can easily obtain tens of thousands of H100 or A100 level GPUs and build ultra-large-scale clusters for training through NVLink interconnection. Domestically, even obtaining ten thousand A100 or A800 GPUs is already very difficult. This imbalance in computing resources leads to significant disadvantages in the efficiency and effectiveness of large model training domestically. Since the start of the large model battle last year, some large model companies have already fallen behind.
Rapid Development of Small-Scale Models
Despite the computing power limitations in training large-scale models, domestic companies have achieved rapid development in small-scale models and vertical domain models. Since the training of these models requires lower computing power, domestic enterprises can conduct more flexible development and iteration with existing hardware resources:
Video Generation and Speech Models: In specific scenarios, domestic video generation models, digital human models, and speech recognition, speech synthesis, and music synthesis models have made rapid progress. Although these models are not as complex as multimodal large models in overall complexity, they can already meet certain market demands in practical applications, especially having strong competitiveness in short-term commercialization.
Vertical Domain Models: For example, in fields like healthcare and education, these vertical domain applications have lower computational power requirements and strong market adaptability. By focusing on optimization for specific tasks and scenarios, these small-scale models have developed relatively quickly domestically and have already achieved a certain degree of commercialization. For instance, Microsoft Xiaoice could chat, write couplets, compose poems, and guess riddles back in 2016, powered by many small models in vertical domains. If MSRA had concentrated all its computational power on large models back then, even with a prophet bringing back the Transformer and GPT papers from the future, they still wouldn’t have been able to create ChatGPT due to the vast difference in computational power between then and now. Similarly, the combined computational power of all foundational model companies in China today might still be less than that of OpenAI alone, so blindly imitating OpenAI’s technical route may not be feasible.
AI Agent: Using methods like chain of thought and tree of thought for slow thinking to enhance the logical reasoning and complex problem-solving abilities of existing models.
Computational Power Determines the Capability and Innovation of Large Models {/examples/}
As pointed out in “The Bitter Lesson,” general methods that rely on computational power are the fundamental driving force of AI development. For example, OpenAI’s o1 model, trained with a large amount of reinforcement learning computational power, significantly surpasses AI Agents that only use chain of thought and tree of thought methods during the reasoning phase in terms of mathematical and programming abilities. The GPT-4o model, trained with a large amount of multimodal data, may significantly surpass the pipeline composed of speech recognition, text models, and speech synthesis models in terms of multimodal capabilities and response speed.
An interesting observation is that under the same computational power, the performance of general models often cannot surpass that of vertical domain models. Therefore, if two teams within a company are competing, one working on a general model and the other on a specialized model, and both are given the same resources, the specialized model often wins. This is why many large AI companies with deep accumulations have not been able to produce models as strong as OpenAI’s. An important premise of “The Bitter Lesson” is that computational power will become increasingly cheaper. This is also why domestic companies, when faced with limited computational power, often choose the more conservative technical route of small-scale vertical models, which can quickly achieve state-of-the-art (SOTA) in a specific field. However, the performance of these small-scale vertical models will be surpassed by general models using more computational power.
Currently, globally, access to computational power remains a core factor affecting AI competitiveness. The top three foundational model companies globally are backed by the top three public clouds: OpenAI is backed by Microsoft Azure, Anthropic by AWS, and Google by Google Cloud. Domestic large model companies also rely on computational power provided by public clouds. Without sufficient computational power, domestic enterprises will continue to face challenges in large model training and inference.
At present, large model training in China still heavily relies on NVIDIA chips. Although Huawei Ascend and other training chips have achieved good performance, the developer ecosystem is not yet complete. Many domestic companies like to focus on performance metrics because they are easier to quantify and make it easier to boost KPIs during internal reporting. However, the ease of use of operator development is relatively subjective and not easy to quantify for evaluation. A poor developer ecosystem can also find many excuses, allowing for prolonged self-deception. Especially when ease of use and performance conflict, internal decision-making within companies often sacrifices ease of use for performance. If domestic training and inference chips can develop a better developer ecosystem, the issue of insufficient computational power can certainly be resolved, and they may even have a cost advantage globally.