(This article is the invited talk I gave at the first Intelligent Agent Networks and Application Innovation Conference on December 20, 2025)

View Slides (HTML), Download PDF Version

Slide Source Code

Abstract

Today’s agent–human interaction is centered on text, but that deviates from natural human cognition. From first principles, the modality humans are best at for output is speech (speaking is three times faster than typing), and the modality humans are best at for input is vision. Vision is not text, but intuitive UI.

The first step is achieving real‑time voice interaction. The traditional serial VAD–ASR–LLM–TTS architecture suffers from having to wait for the user to finish speaking before it can start “thinking,” and it cannot output before the thinking is done. With an Interactive ReAct continuous‑thinking mechanism, the agent can listen, think, and speak at the same time: it starts thinking while the user is talking, and keeps deepening its reasoning while it’s speaking itself, making full use of all idle time gaps.

The second step is to expand the observation space and action space on top of real‑time voice. By extending the Observation Space (from voice input to Computer Use–style visual perception) and the Action Space (from voice output to UI generation and computer control), the agent can operate existing computer/phone GUIs while on a call, and generate dynamic UI to interact with the user. One implementation path for generative UI is generating front‑end code; Claude 4.5 Sonnet has already reached the threshold for this. Another path is generating images; Nano Banana Pro is also close to this threshold.

This is exactly the path used to realize Samantha in the movie Her. As an operating system, Samantha needs five core capabilities: real‑time voice conversation with the user, making phone calls and handling tasks on the user’s behalf, operating traditional computers and phones for the user, bridging data across the user’s existing devices and online services, having her own generative UI interface, and possessing powerful long‑term user memory for personalized proactive services.

Part I: The Efficiency Bottleneck of Text‑Based Interaction

Cognitive Mismatch in Today’s Agent Interfaces

Human output modalities

Modality Speed Cognitive load
Speech 150 words/minute Low
Typing 40–50 words/minute High
  • Speech is the most natural output modality
  • Typing requires fine motor coordination and visual attention
  • Spoken communication is the foundation of human interaction

Human input modalities

Modality Bandwidth Understanding method
Vision High (~10 Mbps) Pattern recognition
Hearing Medium (~16 kbps) Sequential processing
Reading Low (~100 bps) Requires literacy
  • The visual cortex processes information in parallel
  • Reading is an acquired skill, not innate
  • UI/graphics leverage natural visual processing capabilities

Fundamental insight: Today’s text‑based agent interfaces force humans to use suboptimal modalities for both input and output.

The Optimal Interaction Paradigm

Optimal human output: speech

  • Naturally produces language at conversational speed
  • Minimal cognitive overhead
  • Supports complex expression and subtle nuance
  • Supports real‑time interaction and interruption

Use cases:

  • Task delegation and clarification
  • Real‑time feedback while the agent executes tasks
  • Multi‑turn problem solving

Optimal human input: visual UI

  • Rapid scanning and understanding of information
  • Spatial organization aids memory and navigation
  • Interactive elements support direct manipulation
  • Progressive disclosure manages complexity

Use cases:

  • Result presentation and comparison
  • Interactive data exploration
  • Workflow visualization and status monitoring

Target architecture: Human‑to‑agent communication via real‑time voice + agent‑to‑human communication via generative UI

Part II: Real‑Time Voice Interaction

Typical Voice Agent Architecture

A typical voice agent architecture is divided into three layers:

  1. Perception layer: VAD + ASR

    • Converts continuous signals into discrete events
  2. Thinking layer: LLM

    • Asynchronous processing
  3. Execution layer: TTS

    • Converts discrete commands into continuous actions

Problems with the Traditional VAD + ASR Architecture

Issues with VAD (Voice Activity Detection)

  1. Unavoidable latency: Must wait for 500–800 ms of continuous silence to confirm that the user has finished speaking
  2. Poor interruption detection: Can’t distinguish background noise/music; “uh‑huh” easily triggers false interruptions
  3. Low accuracy in speech detection: Fails in complex acoustic environments; mid‑sentence pauses cause truncation

Issues with ASR (Automatic Speech Recognition)

  1. Low accuracy from lack of context: VAD slices audio into isolated segments; can’t use context for disambiguation; high error rate on emails, names, phone numbers
  2. Lack of world knowledge: Can’t leverage common sense; low accuracy on addresses, brands, technical terms
  3. Pure text output loses acoustic details:
    • Loses emotion: happiness, frustration, excitement
    • Loses paralinguistic information: laughter, sighs, breathing
    • Loses environmental information: noisy, musical, quiet

Streaming Speech Perception Models: A Replacement for VAD + ASR

Multimodal architecture

  1. Audio encoder (from Whisper): Converts audio into audio tokens
  2. Qwen LLM (autoregressive): Processes audio tokens and outputs text + events

Key advantages:

  • Streaming: Real‑time output (non‑batch)
  • Context: Preserves full dialogue history
  • In‑context learning: More accurate recognition of personal info and domain terms
  • World knowledge: Higher accuracy on addresses, brands, and amounts

Rich output: text + acoustic events

In addition to text tokens, it outputs special tokens (acoustic events):

  • <speak_start> <speak_end>: Speech boundaries
  • <interrupt>: Interruption intent
  • <emotion:happy>: Emotion tags
  • <laugh> <sigh>: Paralinguistic information
  • <music>: Environmental sounds

Interactive ReAct: Flexibly Interleaving Observation, Thought, and Action

Traditional ReAct: rigid OTA loop

1
2
3
4
5
O₁: "我想把我的 Xfinity 账单降到每月 79 美元"
T₁: (思考 5 秒... 然后被打断,全部丢失)
O₂: "而且我不想删减任何功能"
T₂: (思考 15 秒...)
A₁: "明白了!这是一个包含所有功能的 79 美元套餐..."
  • Fixed loop: Must complete an entire observe–think–act sequence
  • Lost thinking: Can’t think while listening; high latency
  • Rigid: Must wait for complete input before thinking

Interactive ReAct: flexible interleaving OTA

1
2
3
4
5
6
7
8
O₁: "我想把我的 Xfinity 账单降到每月 79 美元"
T₁: (快速思考 0.5 秒:用户话未说完,等待)
T₂: (思考 5 秒... 然后被打断)
O₂: "而且我不想删减任何功能"
T₃: (快速思考 0.5 秒:用户想降价到 79 美元)
A₁: "我可以帮你!让我查看可用的套餐"
T₄: (继续思考... 10 秒)
A₂: "明白了!这是一个包含所有功能的 79 美元套餐..."
  • Think while listening: New observations can be inserted at any time, and ongoing thoughts are preserved
  • Think while speaking: Respond quickly, then continue thinking
  • Intelligent turning‑point decisions: Decides when to speak and when to remain silent

SEAL Thinking Layer: An Interruptible Interactive ReAct Loop

Key insight: LLM thinking is far faster than voice I/O; fully exploit “idle time”

  • LLM processing speed:

    • Input processing: 500+ tokens/second
    • Thinking/output: 100+ tokens/second
  • Voice I/O speed:

    • Voice input/output: only ~5 tokens/second
    • Speed difference: 20–100×

During the “idle time” between observation (voice input) and action (voice output), we have plenty of time for deep thinking. A rigid observe–think–act loop cannot use this idle time.

Fast thinking → slow thinking → continuous thinking

  1. Fast response (0.5s): 50 tokens of quick thinking → immediate preliminary response (within 5 seconds)
  2. Deep analysis (after 5s): 500 tokens of slow thinking → generate a more complete answer
  3. Continuous thinking (as needed): If 500 tokens still aren’t enough, keep thinking for another 5 seconds → continue generating answers until both thinking and speaking are done. If multiple rounds of thought are needed, the result is continuous output of current‑round thought summaries, like someone “thinking out loud.”

Think While Listening

Handling interruptions in conversation gracefully.

Traditional ReAct: Once the user interrupts, all previous thinking is discarded and you must start over.

Interactive ReAct: Preserves the interrupted thought process, attaches the new user input, and lets the model continue thinking from the point of interruption.

1
2
3
4
5
6
7
8
9
<user>我想把我的套餐从现在的 109 美元换成你们的新套餐...</user>
<think>用户想换套餐,目前是 109 美元的套餐。
让我查一下新套餐的信息...
需要知道:1) 用户当前套餐详情 2) 新套餐价格...<interrupted/></think>
<user>(打断)顺便问一下,新套餐是每月 79 美元的那个吗?</user>
<think>(继续之前的思考)用户确认了新套餐价格是 79 美元。
这将每月节省 30 美元,从 109 美元降到 79 美元。
我还需要确认:1) 套餐内容差异 2) 是否有合约限制...</think>
<assistant>是的,79 美元的套餐。让我确认一下,你目前的 109 美元套餐包括...</assistant>

Advantage: A coherent thought process that can quickly adjust strategy based on the latest information.

Speak While Thinking

Uses “filler speech” to buy time for deeper thought, reducing first‑token latency.

Scenario: The user asks a complex question and the agent needs time to think.

Traditional ReAct:

1
2
3
<user>Do you confirm ordering this plan?</user>
* (up to 10 seconds of silence...) *
<assistant>After considering it, I confirm the order.</assistant>

Interactive ReAct:

1
2
3
4
5
<user>你确认订购这个套餐吗?</user>
<think> (快速思考, <0.5s) </think>
<assistant>让我确认一下,这是每月 79 美元的套餐,对吗?</assistant> (初步响应)
<think> (深度思考) </think>
<assistant>是的,这个套餐很划算。我确认订购。</assistant> (最终答案)

Advantage: Greatly improves interaction fluency and avoids awkward long waits.

SEAL Architecture Summary

A unified event‑driven loop that decouples perception, thinking, and execution, achieving truly real‑time and parallel processing.

Perception layer

  • Input: Continuous signals (voice, GUI)
  • Output: Discrete event streams
  • Solves: Latency, unnatural interruptions, and acoustic information loss in traditional speech perception

Thinking Layer

  • Input: Discrete event stream
  • Output: Interleaved thoughts/action commands
  • Solves: The serial bottleneck of traditional ReAct, enabling interruptible, asynchronous listening-while-thinking and thinking-while-speaking

Execution Layer

  • Input: Discrete action commands
  • Output: Continuous signals + feedback events
  • Solves: The “last mile” problem of agents being clumsy and lacking feedback, forming a closed-loop action cycle

Future Outlook: End-to-End Models

  • Current SEAL Architecture:

    • Perception Layer LLM: audio → text + acoustic events
    • Thinking Layer LLM: text + acoustic events → thoughts + actions
    • Execution Layer LLM: actions → audio
  • Future End-to-End Architecture:

    • Audio encoder: audio → audio tokens
    • Unified LLM: perception + thinking + execution
    • Audio decoder: audio tokens → audio

Part III: Making Phone Calls and Using a Computer at the Same Time

Extended Observation and Action Space

Extended Observation Space

Traditional Extended
Voice input Voice input
+ Screen visual perception
+ Application state monitoring
+ System notifications

Computer-Use Integration:

  • Real-time screen understanding
  • UI element recognition and tracking
  • Cross-application context awareness

Extended Action Space

Traditional Extended
Voice output Voice output
+ Mouse/keyboard operations
+ UI generation
+ Application control

Multimodal Output:

  • Voice for human communication
  • GUI operations for task execution
  • Generated UI for result presentation

Target Capability: An agent that can carry on a phone conversation and operate a computer interface at the same time, similar to a human assistant talking on the phone while using a computer.

Multi-Agent Architecture for Concurrent Calling and Computer Use

Architecture Design

Phone Agent:

  • Handles real-time voice conversations
  • Low-latency ASR → LLM → TTS pipeline
  • Extracts key information from the conversation
  • Communicates with the Computer Agent via message passing

Computer Agent:

  • Responsible for GUI operations (browser, apps)
  • Visual understanding and action planning
  • Receives information from the Phone Agent
  • Reports task status and requests additional information

Communication Protocol

1
2
3
4
5
6
7
{
"type": "info_collected" | "task_completed" | "error",
"sender": "phone_agent" | "computer_agent",
"field": "name" | "phone" | "email" | ...,
"value": "...",
"timestamp": "..."
}

Autonomous Orchestration Method

The agent autonomously decides when to spawn a collaborative agent:

  1. Task analysis: The Computer Agent encounters a complex form that requires user information
  2. Capability assessment: Determines that voice interaction is more efficient than text
  3. Agent generation: Calls initiate_phone_call_agent(purpose, required_info)
  4. Parallel execution: Two agents run independently with asynchronous communication

Real-Time Collaboration Pattern

1
2
3
4
5
6
7
电话 Agent: "请问您的姓名是?"
用户: "张三"
→ 消息: {field: "name", value: "张三"}
电脑 Agent: (填写姓名字段)
→ 消息: {status: "name_filled"}
电话 Agent: "您的邮箱地址是?"
...

Key Requirement: Agents must run in truly parallel threads without blocking each other.

Solving Computer-Use Latency: Small Specialized Models

Step-GUI: An Efficient GUI Agent Using Small Models (arxiv:2512.15431)

Performance vs. SOTA Models

Benchmark OpenAI CUA Claude-4.5 Gemini-2.5 Step-GUI 8B
OSWorld-Verified 23.0 61.4 - 48.5
AndroidWorld - - 69.7 80.2
ScreenSpot-Pro 23.4 - - 62.6
OSWorld-G 36.4 - - 70.0

Why Can Small Models Outperform Frontier Models?

Self-Evolution Training Pipeline:

  • Calibrated Step Reward System (CSRS): Converts model trajectories into training signals, >90% accuracy, cost only 1/10–1/100 of human labeling
  • Domain-targeted data: 11.2M mid-train + 1.67M cold-start samples

Core Insight: Frontier models lack domain knowledge (UI conventions of Chinese apps, behavior of local apps). Small models + targeted training can fill these gaps.

AndroidDaily Benchmark (a Chinese mobile-app benchmark proposed by Step-GUI)

Real-world mobile app tasks across 5 scenarios:

  • 🚄 Travel: Buy train tickets on 12306
  • 🎵 Entertainment: Play a playlist on NetEase Cloud Music
  • 🛒 Shopping: Check the shopping cart on Taobao
  • 💬 Social Media: Change privacy settings on Zhihu
  • 🍜 Local Services: Set a Dianping review to be visible only to yourself

AndroidDaily (Static) Results

Model Average Accuracy
Claude-4.5-sonnet 10.90
Gemini-2.5-Pro Thinking 43.74
Step-GUI-8B 89.91

Conclusion: Small model + domain data > general-purpose large model. With targeted domain training, Step-GUI-8B achieves 2× Gemini and 8× Claude performance on Chinese mobile apps.

NVIDIA ToolOrchestra: Small Models for Multi-Agent Coordination

Based on NVIDIA research (developer.nvidia.com)

Core Concept

ToolOrchestra trains small orchestration models to supervise and manage larger models and tools according to user preferences for:

  • Speed
  • Cost
  • Accuracy

Key Insight: Small models are not burdened by excessive knowledge and can be trained to capture the essential decision patterns of orchestration.

Training Method

  1. Synthetic data generation: Automatic trajectory generation and verification
  2. Multi-objective RL: Optimize for accuracy, cost, and solution time
  3. Minimal data requirements: Orchestrator-8B uses only 552 synthetic samples

Architectural Advantages

  • Small (8B) model precisely guides larger models
  • Automatically balances capability and cost
  • Supports heterogeneous multi-agent systems
  • Suitable for real-world deployment scenarios

Case Studies of Small Language Models in Agentic AI

Based on “Small Language Models are the Future of Agentic AI” (arxiv:2506.02153)

Core Argument

Position: Small language models (SLMs) are:

  • Powerful enough for specialized tasks
  • Naturally better suited to agentic applications
  • Inevitably more economical for high-frequency calls

Arguments for SLMs in Agent Systems

Capability Argument:

  • Modern SLMs perform strongly on focused tasks
  • Agentic systems involve repetitive, specialized operations
  • General-purpose conversational ability is often unnecessary

Economic Argument:

  • Agent systems make a large number of model calls
  • Cost grows linearly with model size and number of calls
  • SLMs reduce operating costs by 10–100×

Algorithm for Transitioning from LLM to SLM

  1. Identify specialized subtasks in the agent workflow
  2. Curate task-specific training data from LLM outputs
  3. Fine-tune SLMs on specialized data
  4. Validate performance on target metrics
  5. Deploy SLMs for production workloads

Part IV: Generative UI – Web Frontend Code Generation + Image Generation

Path One: Web Frontend Code Generation

Anthropic’s Approach: Claude Artifacts and “Imagine with Claude”

Claude Artifacts

Claude can generate complete frontend code and render it in a sandboxed preview environment:

Supported Output Types:

  • React apps with hooks and components
  • Interactive data visualizations (D3.js, Chart.js)
  • SVG graphics and charts
  • Native HTML/CSS/JavaScript
  • Markdown documents

Workflow:

  1. The user prompt describes the desired interface
  2. The model generates complete frontend code
  3. The code is rendered in a sandboxed preview
  4. The user iterates via conversation

“Imagine with Claude” (Research Preview)

A temporary research preview released with Claude Sonnet 4.5:

Key Characteristics:

  • Claude generates software instantly
  • No prebuilt features
  • No prewritten code
  • Claude creates everything in real time, responding to and adapting to user requests

Technical Demo:

  • Shows what’s possible when a powerful model is combined with the right infrastructure
  • Dynamic software creation without predefined templates
  • Real-time adaptation to user interactions

Watch the demo: youtu.be/dGiqrsv530Y

Google Research’s Approach: Generative UI

“Generative UI: LLMs are Effective UI Generators” (November 2025)

Project page: generativeui.github.io | Research blog: research.google/blog

Abstract

AI models are good at creating content, but they are usually rendered with static, pre-defined interfaces. Specifically, LLM outputs are usually markdown “walls of text”.

Generative UI is a long-term commitment where models not only generate content, but also generate the interface itself.

We show that when correctly prompted and equipped with the right toolset, modern LLMs can robustly generate high‑quality custom UIs for almost any prompt.

Implementation: three main components

  1. Server: exposes endpoints for key tools (image generation, search)
  2. System instructions: carefully designed prompts including goals, planning guidelines, examples
  3. Post‑processor: fixes common issues that cannot be solved through prompting

Evaluation results

When ignoring generation speed, the results are overwhelmingly preferred by humans, beating standard LLM markdown outputs.

PAGEN benchmark user preference (ELO scores):

  1. Human expert‑designed websites (highest)
  2. Generative UI: 1710.7 (on par with experts 44% of the time)
  3. Top Google Search results (significant gap)
  4. Standard markdown LLM outputs
  5. Plain text outputs

Emergent capability

This robust Generative UI capability is emergent, a substantial improvement over previous models.

Example categories

  • Education: What is a fractal? Probability of rolling 8 with two dice, Ising model, history of timing devices
  • Children’s education: explain speculative decoding to a child, kids’ chemistry experiments, explain slope and tangent using puppies
  • Practical tasks: hosting Thanksgiving, choosing a carpet, how to make a baby mobile
  • Simple queries: what time is it (custom clock interface), green things (visual gallery), dragon fruit (interactive exploration)
  • Games: learn fast typing, clicker game, fashion advisor, memory game, four‑player elemental tic‑tac‑toe, Japanese visual novel, text adventure game

Explore interactive examples: generativeui.github.io

Path Two: Image generation + hybrid architecture

Web front‑end code generation + image generation = optimal combination

Why a hybrid architecture?

Limitations of pure image generation:

  • Image generation models struggle with thousands of words of text
  • Long‑form content (articles, documents, detailed UIs) needs web rendering
  • Pure image outputs lack interactivity and accessibility

Optimal division of labor

Component Best method
Long‑form text HTML/CSS rendering
Interactive elements JavaScript/React
Visual assets Image generation
Charts, illustrations Image generation
Infographics with text Hybrid

The role of image generation in Generative UI

Nano Banana Pro (Gemini 3 Pro Image) supports:

  • Clear text in images for short taglines, headings, posters
  • Multilingual support, enhancing multilingual reasoning
  • Multiple textures, fonts, and calligraphy styles

Best use cases in Generative UI:

  • Hero images and banners with stylized text
  • Product mockups and visual previews
  • Charts and conceptual illustrations
  • Branded assets with consistent style

Architecture:

1
2
3
LLM → HTML/CSS/JS (structure + long text)
→ Image generation API (visual assets)
→ Compose and render interface

Part V: Her – OS‑level assistant

The film “Her” (2013): Samantha’s vision

Film summary

“Her” is a 2013 science‑fiction film directed by Spike Jonze that explores the relationship between a person and an AI operating system.

Plot: Theodore, a lonely writer, develops a relationship with Samantha—an AI with a voice, personality, and the ability to learn and evolve.

Why “Her” matters

Samantha represents the ultimate vision of an AI assistant:

  • Not just responding to commands
  • Truly understanding the user’s life, emotions, and needs
  • Anticipating what the user needs before being asked

Key traits of Samantha

  • 🗣️ Voice‑first: natural, conversational interaction
  • 🌐 Always available: runs in the background, accessible anytime
  • 🧠 Deep user memory: builds a mental model—preferences, habits, values
  • 🎯 Proactive service: anticipates needs without explicit requests
  • Autonomous actions: organizes email, schedules, makes calls
  • 🔄 Real‑time processing: listening, thinking, responding, and acting simultaneously

Core insight: just as we understand friends—not by remembering every conversation, but by building a mental model of who they are.

Core capabilities of an OS‑level assistant

  • 🗣️ Real‑time voice UI: low‑latency natural dialogue with cross‑session context
  • 📞 Calling on your behalf: navigating customer service, gathering information, negotiating
  • 💻 Computer and phone usage: GUI automation across apps, orchestration of cross‑app workflows
  • 🔗 Unified data access: integration with devices and cloud services, intelligent indexing and retrieval
  • Generative UI: dynamic result presentation, interactive exploration and manipulation
  • 🎯 Proactive service: anticipating needs before being asked, personalized value alignment

Personalized value alignment

Analogy: recommender systems

  • 📰 Traditional media: everyone reads the same newspaper and sees the same content
  • 📱 ByteDance/TikTok revolution: everyone sees completely different content
    • “Everyone lives in a different world with different values”
  • Result: personalized products are more human‑centric → users prefer them

Future of AI Agent alignment

Current approach: universal human values

  • LLMs are aligned to “universal” values
  • But do we really have universally agreed‑upon human values?

What AI should do:

  • Not just one universal value set
  • Adapt to each user’s values and preferences
  • Recognize that value differences are huge

Proactive service: the highest level of AI memory

User memory is the core ingredient of proactive service

User memory is not just logging every conversation. Like understanding friends:

  • We don’t remember every sentence they say
  • We build a mental model of who they are
  • Their preferences, habits, and values

Two types of memory:

Type Difficulty Example
Facts Simple Birthday, address, card number
Preferences Complex Context‑dependent, constantly evolving

Learning user preferences is much harder than storing factual information:

  • Context‑dependent: paper‑writing style ≠ travel guide style
  • One‑off vs. long‑term: “ordered Sichuan food yesterday” ≠ “likes spicy food”
  • Risk of over‑generalization: AI can easily extrapolate incorrectly
  • Requires fine‑grained evaluation

Three levels of memory capability

Level 1: Basic recall

  • Store and retrieve explicit user information
  • “My membership number is 12345” → reliably recall it
  • Foundation of trustworthiness

Level 2: Cross‑session retrieval

  • Connect information across different conversations
  • Disambiguation: “book a service appointment for my car” → which of the two cars?
  • Understand composite events: “cancel the LA trip” → find flights + hotel
  • Distinguish active contracts from past inquiries

Level 3: Proactive service

  • Anticipate needs without explicit requests
  • Booking an international flight? → check whether the passport is expiring soon
  • Phone broke? → list all protection options (warranty, credit card, carrier insurance)
  • Tax season? → proactively gather all relevant documents

Background GUI Agent: technical challenges

Core problem

Traditional GUI agents require:

  • Foreground application windows
  • Active screen rendering
  • Exclusive access to input devices

This conflicts with:

  • The user simultaneously using the device
  • Battery and resource efficiency

Virtualization approach

Headless browser/app execution:

  • Render apps in a virtual frame buffer
  • The agent interacts with a virtual display

Cloud phone/desktop:

  • Run apps in a cloud environment
  • Stream results to local devices when needed
  • Offload computation from mobile devices

Requirements for fooling apps

Apps must believe they:

  • Are running in the foreground
  • Receive normal user input
  • Render to a real display

Technical mechanisms:

  • Window manager virtualization
  • Input event injection
  • Leveraging accessibility APIs
  • Container/sandbox isolation

Data synchronization

For cloud‑side operations:

  • Application state migration
  • Account credential management
  • Caching and data synchronization

Benefit: operations continue when the device is offline, especially valuable for desktops.

Cross‑device and cross‑service data integration

Unified data access

Example: Gemini integrated with Google Workspace

  • Gmail, Drive, Calendar, Docs
  • Cross‑service querying and retrieval
  • Context‑aware suggestions

Requirements:

  • OAuth and API integrations
  • Permission management
  • Data indexing infrastructure

Indexes for AI Agent retrieval

Local device data:

  • File system indexing
  • App data extraction

Cloud service data:

  • Incremental search index construction

Technical architecture

1
2
3
4
5
6
7
8
9
10
11
12
用户数据源
├── 本地文件
├── 云存储(Drive、iCloud 等)
├── 邮件和日历
├── 消息应用
└── 浏览器历史/书签

索引层

检索 API

Agent 查询接口

Economic model: revenue sharing with app providers

Ecosystem challenge

Current situation:

  • Agents operate apps on behalf of users
  • Traffic is captured by the agent, not the original app
  • Apps lose ad revenue and engagement metrics
  • Result: apps may block agent access

Example: GUI agents being blocked by some messaging and social apps due to traffic and monetization issues

Why blocking is a problem

  • Degrades user experience
  • Fragments Agent capabilities
  • Arms race between Agents and apps
  • Ultimately harms all parties

Proposed solution: revenue sharing

Principle: Agent companies and app providers must establish profit-sharing mechanisms

Possible models:

  1. Transaction-based sharing

    • Agent facilitates purchase → share revenue
  2. Subscription partnerships

    • Joint premium tiers
    • Feature access agreements
  3. API access fees

    • Formalized Agent API access
    • Usage-based pricing

Strategic imperatives

  • Advancement of Agent technology is inevitable
  • Collaboration benefits both sides
  • Standards are needed for a sustainable ecosystem

Key points

Technical roadmap

Real-time voice interaction:

  • SEAL architecture: streaming, event-driven Agent loop
  • Interactive ReAct: listen while thinking, think while speaking
  • Solve the latency bottleneck of serial processing

Multi-Agent coordination:

  • Concurrent phone and computer use
  • Autonomous Agent generation and orchestration
  • Small models (4–8B) are sufficient for specialized tasks

Generative UI:

  • Web front-end code generation (Claude Artifacts, Google Generative UI)
  • Image generation (Nano Banana Pro)
  • Hybrid architectures combining the two

System architecture

Extended modalities:

  • Observation: voice + visual perception
  • Action: dialogue + UI generation + computer use

OS-level integration:

  • Background/cloud operations without occupying UI
  • Unified data access across devices and services
  • Revenue-sharing models for ecosystem sustainability

Vision

The “Her” paradigm: an OS-level assistant combining:

  • Real-time voice conversation
  • Autonomous task execution
  • Generative interface presentation
  • Seamless integration with the user’s digital life

Future: three stages of AI Agent–environment interaction

Real-time asynchronous interaction with the environment is the foundation of Agents

🗣️ Stage 1: Voice

  • Input: voice
  • Output: voice
  • Data rate: 15–50 tokens/s
  • Latency: <500 ms
  • Challenge: balance between fast and slow thinking
  • Solution: Interactive ReAct

💻 Stage 2: Computer use

  • Input: vision (screenshots)
  • Output: mouse/keyboard actions
  • Data rate: ~2K tokens/frame
  • Latency: <1 second
  • Challenge: precise action execution
  • Solution: VLA models + RL

🤖 Stage 3: Physical world

  • Input: vision + voice + touch
  • Output: voice + joint movements
  • Data rate: ~20K tokens/s
  • Latency: <100 ms
  • Challenge: real-time control
  • Solution: VLA + world models

Key insight: complexity increases (data rate ↑, latency ↓), but architectural solutions can transfer across stages

References

Papers

  1. Yan, H., Wang, J., et al. (2025). Step-GUI Technical Report. arXiv:2512.15431
  2. Belcak, P., Heinrich, G., et al. (2025). Small Language Models are the Future of Agentic AI. arXiv:2506.02153
  3. Google Research. (2025). Generative UI: LLMs are Effective UI Generators.

Technical resources

  1. NVIDIA Research. Train Small Orchestration Agents to Solve Big Problems. NVIDIA Developer Blog
  2. Anthropic. Introducing Claude Sonnet 4.5. anthropic.com
  3. Google Research. Generative UI: A Rich, Custom, Visual Interactive User Experience for Any Prompt. research.google

View full talk slides

Comments

2025-12-20
  1. Abstract
  • Part I: The Efficiency Bottleneck of Text‑Based Interaction
    1. Cognitive Mismatch in Today’s Agent Interfaces
    2. The Optimal Interaction Paradigm
  • Part II: Real‑Time Voice Interaction
    1. Typical Voice Agent Architecture
    2. Problems with the Traditional VAD + ASR Architecture
    3. Streaming Speech Perception Models: A Replacement for VAD + ASR
    4. Interactive ReAct: Flexibly Interleaving Observation, Thought, and Action
    5. SEAL Thinking Layer: An Interruptible Interactive ReAct Loop
    6. Think While Listening
    7. Speak While Thinking
    8. SEAL Architecture Summary
  • Part III: Making Phone Calls and Using a Computer at the Same Time
    1. Extended Observation and Action Space
    2. Multi-Agent Architecture for Concurrent Calling and Computer Use
    3. Solving Computer-Use Latency: Small Specialized Models
    4. NVIDIA ToolOrchestra: Small Models for Multi-Agent Coordination
    5. Case Studies of Small Language Models in Agentic AI
  • Part IV: Generative UI – Web Frontend Code Generation + Image Generation
    1. Path One: Web Frontend Code Generation
    2. Path Two: Image generation + hybrid architecture
  • Part V: Her – OS‑level assistant
    1. The film “Her” (2013): Samantha’s vision
    2. Core capabilities of an OS‑level assistant
    3. Personalized value alignment
    4. Proactive service: the highest level of AI memory
    5. Background GUI Agent: technical challenges
    6. Cross‑device and cross‑service data integration
    7. Economic model: revenue sharing with app providers
  • Key points
    1. Technical roadmap
    2. System architecture
    3. Vision
  • Future: three stages of AI Agent–environment interaction
  • References
    1. Papers
    2. Technical resources