The Next Stop in Agent–Human Interaction: Real‑Time Voice and Generative UI

(This article is the invited talk I gave at the first Intelligent Agent Networks and Application Innovation Conference on December 20, 2025)

View Slides (HTML), Download PDF Version

Slide Source Code

Abstract

Today’s agent–human interaction is centered on text, but that deviates from natural human cognition. From first principles, the modality humans are best at for output is speech (speaking is three times faster than typing), and the modality humans are best at for input is vision. Vision is not text, but intuitive UI.

The first step is achieving real‑time voice interaction. The traditional serial VAD–ASR–LLM–TTS architecture suffers from having to wait for the user to finish speaking before it can start “thinking,” and it cannot output before the thinking is done. With an Interactive ReAct continuous‑thinking mechanism, the agent can listen, think, and speak at the same time: it starts thinking while the user is talking, and keeps deepening its reasoning while it’s speaking itself, making full use of all idle time gaps.

The second step is to expand the observation space and action space on top of real‑time voice. By extending the Observation Space (from voice input to Computer Use–style visual perception) and the Action Space (from voice output to UI generation and computer control), the agent can operate existing computer/phone GUIs while on a call, and generate dynamic UI to interact with the user. One implementation path for generative UI is generating front‑end code; Claude 4.5 Sonnet has already reached the threshold for this. Another path is generating images; Nano Banana Pro is also close to this threshold.

This is exactly the path used to realize Samantha in the movie Her. As an operating system, Samantha needs five core capabilities: real‑time voice conversation with the user, making phone calls and handling tasks on the user’s behalf, operating traditional computers and phones for the user, bridging data across the user’s existing devices and online services, having her own generative UI interface, and possessing powerful long‑term user memory for personalized proactive services.

Part I: The Efficiency Bottleneck of Text‑Based Interaction

Cognitive Mismatch in Today’s Agent Interfaces

Human output modalities

Modality	Speed	Cognitive load
Speech	150 words/minute	Low
Typing	40–50 words/minute	High

Speech is the most natural output modality
Typing requires fine motor coordination and visual attention
Spoken communication is the foundation of human interaction

Human input modalities

Modality	Bandwidth	Understanding method
Vision	High (~10 Mbps)	Pattern recognition
Hearing	Medium (~16 kbps)	Sequential processing
Reading	Low (~100 bps)	Requires literacy

The visual cortex processes information in parallel
Reading is an acquired skill, not innate
UI/graphics leverage natural visual processing capabilities

Fundamental insight: Today’s text‑based agent interfaces force humans to use suboptimal modalities for both input and output.

The Optimal Interaction Paradigm

Optimal human output: speech

Naturally produces language at conversational speed
Minimal cognitive overhead
Supports complex expression and subtle nuance
Supports real‑time interaction and interruption

Use cases:

Task delegation and clarification
Real‑time feedback while the agent executes tasks
Multi‑turn problem solving

Optimal human input: visual UI

Rapid scanning and understanding of information
Spatial organization aids memory and navigation
Interactive elements support direct manipulation
Progressive disclosure manages complexity

Use cases:

Result presentation and comparison
Interactive data exploration
Workflow visualization and status monitoring

Target architecture: Human‑to‑agent communication via real‑time voice + agent‑to‑human communication via generative UI

Part II: Real‑Time Voice Interaction

Typical Voice Agent Architecture

A typical voice agent architecture is divided into three layers:

Perception layer: VAD + ASR
- Converts continuous signals into discrete events
Thinking layer: LLM
- Asynchronous processing
Execution layer: TTS
- Converts discrete commands into continuous actions

Problems with the Traditional VAD + ASR Architecture

Issues with VAD (Voice Activity Detection)

Unavoidable latency: Must wait for 500–800 ms of continuous silence to confirm that the user has finished speaking
Poor interruption detection: Can’t distinguish background noise/music; “uh‑huh” easily triggers false interruptions
Low accuracy in speech detection: Fails in complex acoustic environments; mid‑sentence pauses cause truncation

Issues with ASR (Automatic Speech Recognition)

Low accuracy from lack of context: VAD slices audio into isolated segments; can’t use context for disambiguation; high error rate on emails, names, phone numbers
Lack of world knowledge: Can’t leverage common sense; low accuracy on addresses, brands, technical terms
Pure text output loses acoustic details:
- Loses emotion: happiness, frustration, excitement
- Loses paralinguistic information: laughter, sighs, breathing
- Loses environmental information: noisy, musical, quiet

Streaming Speech Perception Models: A Replacement for VAD + ASR

Multimodal architecture

Audio encoder (from Whisper): Converts audio into audio tokens
Qwen LLM (autoregressive): Processes audio tokens and outputs text + events

Key advantages:

Streaming: Real‑time output (non‑batch)
Context: Preserves full dialogue history
In‑context learning: More accurate recognition of personal info and domain terms
World knowledge: Higher accuracy on addresses, brands, and amounts

Rich output: text + acoustic events

In addition to text tokens, it outputs special tokens (acoustic events):

<speak_start> <speak_end>: Speech boundaries
<interrupt>: Interruption intent
<emotion:happy>: Emotion tags
<laugh> <sigh>: Paralinguistic information
<music>: Environmental sounds

Interactive ReAct: Flexibly Interleaving Observation, Thought, and Action

Traditional ReAct: rigid OTA loop

O₁: "我想把我的 Xfinity 账单降到每月 79 美元"
T₁: (思考 5 秒... 然后被打断，全部丢失)
O₂: "而且我不想删减任何功能"
T₂: (思考 15 秒...)
A₁: "明白了！这是一个包含所有功能的 79 美元套餐..."

Fixed loop: Must complete an entire observe–think–act sequence
Lost thinking: Can’t think while listening; high latency
Rigid: Must wait for complete input before thinking

Interactive ReAct: flexible interleaving OTA

O₁: "我想把我的 Xfinity 账单降到每月 79 美元"
T₁: (快速思考 0.5 秒：用户话未说完，等待)
T₂: (思考 5 秒... 然后被打断)
O₂: "而且我不想删减任何功能"
T₃: (快速思考 0.5 秒：用户想降价到 79 美元)
A₁: "我可以帮你！让我查看可用的套餐"
T₄: (继续思考... 10 秒)
A₂: "明白了！这是一个包含所有功能的 79 美元套餐..."

Think while listening: New observations can be inserted at any time, and ongoing thoughts are preserved
Think while speaking: Respond quickly, then continue thinking
Intelligent turning‑point decisions: Decides when to speak and when to remain silent

SEAL Thinking Layer: An Interruptible Interactive ReAct Loop

Key insight: LLM thinking is far faster than voice I/O; fully exploit “idle time”

LLM processing speed:
- Input processing: 500+ tokens/second
- Thinking/output: 100+ tokens/second
Voice I/O speed:
- Voice input/output: only ~5 tokens/second
- Speed difference: 20–100×

During the “idle time” between observation (voice input) and action (voice output), we have plenty of time for deep thinking. A rigid observe–think–act loop cannot use this idle time.

Fast thinking → slow thinking → continuous thinking

Fast response (0.5s): 50 tokens of quick thinking → immediate preliminary response (within 5 seconds)
Deep analysis (after 5s): 500 tokens of slow thinking → generate a more complete answer
Continuous thinking (as needed): If 500 tokens still aren’t enough, keep thinking for another 5 seconds → continue generating answers until both thinking and speaking are done. If multiple rounds of thought are needed, the result is continuous output of current‑round thought summaries, like someone “thinking out loud.”

Think While Listening

Handling interruptions in conversation gracefully.

Traditional ReAct: Once the user interrupts, all previous thinking is discarded and you must start over.

Interactive ReAct: Preserves the interrupted thought process, attaches the new user input, and lets the model continue thinking from the point of interruption.

<user>我想把我的套餐从现在的 109 美元换成你们的新套餐...</user>
<think>用户想换套餐，目前是 109 美元的套餐。
让我查一下新套餐的信息...
需要知道：1) 用户当前套餐详情 2) 新套餐价格...<interrupted/></think>
<user>（打断）顺便问一下，新套餐是每月 79 美元的那个吗？</user>
<think>（继续之前的思考）用户确认了新套餐价格是 79 美元。
这将每月节省 30 美元，从 109 美元降到 79 美元。
我还需要确认：1) 套餐内容差异 2) 是否有合约限制...</think>
<assistant>是的，79 美元的套餐。让我确认一下，你目前的 109 美元套餐包括...</assistant>

Advantage: A coherent thought process that can quickly adjust strategy based on the latest information.

Speak While Thinking

Uses “filler speech” to buy time for deeper thought, reducing first‑token latency.

Scenario: The user asks a complex question and the agent needs time to think.

Traditional ReAct:

1
2
3

<user>Do you confirm ordering this plan?</user>
* (up to 10 seconds of silence...) *
<assistant>After considering it, I confirm the order.</assistant>

Interactive ReAct:

<user>你确认订购这个套餐吗？</user>
<think> (快速思考, <0.5s) </think>
<assistant>让我确认一下，这是每月 79 美元的套餐，对吗？</assistant> (初步响应)
<think> (深度思考) </think>
<assistant>是的，这个套餐很划算。我确认订购。</assistant> (最终答案)

Advantage: Greatly improves interaction fluency and avoids awkward long waits.

SEAL Architecture Summary

A unified event‑driven loop that decouples perception, thinking, and execution, achieving truly real‑time and parallel processing.

Perception layer

Input: Continuous signals (voice, GUI)
Output: Discrete event streams
Solves: Latency, unnatural interruptions, and acoustic information loss in traditional speech perception

Thinking Layer

Input: Discrete event stream
Output: Interleaved thoughts/action commands
Solves: The serial bottleneck of traditional ReAct, enabling interruptible, asynchronous listening-while-thinking and thinking-while-speaking

Execution Layer

Input: Discrete action commands
Output: Continuous signals + feedback events
Solves: The “last mile” problem of agents being clumsy and lacking feedback, forming a closed-loop action cycle

Future Outlook: End-to-End Models

Current SEAL Architecture:
- Perception Layer LLM: audio → text + acoustic events
- Thinking Layer LLM: text + acoustic events → thoughts + actions
- Execution Layer LLM: actions → audio
Future End-to-End Architecture:
- Audio encoder: audio → audio tokens
- Unified LLM: perception + thinking + execution
- Audio decoder: audio tokens → audio

Part III: Making Phone Calls and Using a Computer at the Same Time

Extended Observation and Action Space

Extended Observation Space

Traditional	Extended
Voice input	Voice input
	+ Screen visual perception
	+ Application state monitoring
	+ System notifications

Computer-Use Integration:

Real-time screen understanding
UI element recognition and tracking
Cross-application context awareness

Extended Action Space

Traditional	Extended
Voice output	Voice output
	+ Mouse/keyboard operations
	+ UI generation
	+ Application control

Multimodal Output:

Voice for human communication
GUI operations for task execution
Generated UI for result presentation

Target Capability: An agent that can carry on a phone conversation and operate a computer interface at the same time, similar to a human assistant talking on the phone while using a computer.

Multi-Agent Architecture for Concurrent Calling and Computer Use

Architecture Design

Phone Agent:

Handles real-time voice conversations
Low-latency ASR → LLM → TTS pipeline
Extracts key information from the conversation
Communicates with the Computer Agent via message passing

Computer Agent:

Responsible for GUI operations (browser, apps)
Visual understanding and action planning
Receives information from the Phone Agent
Reports task status and requests additional information

Communication Protocol

{
  "type": "info_collected" | "task_completed" | "error",
  "sender": "phone_agent" | "computer_agent",
  "field": "name" | "phone" | "email" | ...,
  "value": "...",
  "timestamp": "..."
}

Autonomous Orchestration Method

The agent autonomously decides when to spawn a collaborative agent:

Task analysis: The Computer Agent encounters a complex form that requires user information
Capability assessment: Determines that voice interaction is more efficient than text
Agent generation: Calls initiate_phone_call_agent(purpose, required_info)
Parallel execution: Two agents run independently with asynchronous communication

Real-Time Collaboration Pattern

电话 Agent: "请问您的姓名是？"
用户: "张三"
    → 消息: {field: "name", value: "张三"}
电脑 Agent: (填写姓名字段)
    → 消息: {status: "name_filled"}
电话 Agent: "您的邮箱地址是？"
...

Key Requirement: Agents must run in truly parallel threads without blocking each other.

Solving Computer-Use Latency: Small Specialized Models

Step-GUI: An Efficient GUI Agent Using Small Models (arxiv:2512.15431)

Performance vs. SOTA Models

Benchmark	OpenAI CUA	Claude-4.5	Gemini-2.5	Step-GUI 8B
OSWorld-Verified	23.0	61.4	-	48.5
AndroidWorld	-	-	69.7	80.2
ScreenSpot-Pro	23.4	-	-	62.6
OSWorld-G	36.4	-	-	70.0

Why Can Small Models Outperform Frontier Models?

Self-Evolution Training Pipeline:

Calibrated Step Reward System (CSRS): Converts model trajectories into training signals, >90% accuracy, cost only 1/10–1/100 of human labeling
Domain-targeted data: 11.2M mid-train + 1.67M cold-start samples

Core Insight: Frontier models lack domain knowledge (UI conventions of Chinese apps, behavior of local apps). Small models + targeted training can fill these gaps.

AndroidDaily Benchmark (a Chinese mobile-app benchmark proposed by Step-GUI)

Real-world mobile app tasks across 5 scenarios:

🚄 Travel: Buy train tickets on 12306
🎵 Entertainment: Play a playlist on NetEase Cloud Music
🛒 Shopping: Check the shopping cart on Taobao
💬 Social Media: Change privacy settings on Zhihu
🍜 Local Services: Set a Dianping review to be visible only to yourself

AndroidDaily (Static) Results

Model	Average Accuracy
Claude-4.5-sonnet	10.90
Gemini-2.5-Pro Thinking	43.74
Step-GUI-8B	89.91

Conclusion: Small model + domain data > general-purpose large model. With targeted domain training, Step-GUI-8B achieves 2× Gemini and 8× Claude performance on Chinese mobile apps.

NVIDIA ToolOrchestra: Small Models for Multi-Agent Coordination

Based on NVIDIA research (developer.nvidia.com)

Core Concept

ToolOrchestra trains small orchestration models to supervise and manage larger models and tools according to user preferences for:

Speed
Cost
Accuracy

Key Insight: Small models are not burdened by excessive knowledge and can be trained to capture the essential decision patterns of orchestration.

Training Method

Synthetic data generation: Automatic trajectory generation and verification
Multi-objective RL: Optimize for accuracy, cost, and solution time
Minimal data requirements: Orchestrator-8B uses only 552 synthetic samples

Architectural Advantages

Small (8B) model precisely guides larger models
Automatically balances capability and cost
Supports heterogeneous multi-agent systems
Suitable for real-world deployment scenarios

Case Studies of Small Language Models in Agentic AI

Based on “Small Language Models are the Future of Agentic AI” (arxiv:2506.02153)

Core Argument

Position: Small language models (SLMs) are:

Powerful enough for specialized tasks
Naturally better suited to agentic applications
Inevitably more economical for high-frequency calls

Arguments for SLMs in Agent Systems

Capability Argument:

Modern SLMs perform strongly on focused tasks
Agentic systems involve repetitive, specialized operations
General-purpose conversational ability is often unnecessary

Economic Argument:

Agent systems make a large number of model calls
Cost grows linearly with model size and number of calls
SLMs reduce operating costs by 10–100×

Algorithm for Transitioning from LLM to SLM

Identify specialized subtasks in the agent workflow
Curate task-specific training data from LLM outputs
Fine-tune SLMs on specialized data
Validate performance on target metrics
Deploy SLMs for production workloads

Part IV: Generative UI – Web Frontend Code Generation + Image Generation

Path One: Web Frontend Code Generation

Anthropic’s Approach: Claude Artifacts and “Imagine with Claude”

Claude Artifacts

Claude can generate complete frontend code and render it in a sandboxed preview environment:

Supported Output Types:

React apps with hooks and components
Interactive data visualizations (D3.js, Chart.js)
SVG graphics and charts
Native HTML/CSS/JavaScript
Markdown documents

Workflow:

The user prompt describes the desired interface
The model generates complete frontend code
The code is rendered in a sandboxed preview
The user iterates via conversation

“Imagine with Claude” (Research Preview)

A temporary research preview released with Claude Sonnet 4.5:

Key Characteristics:

Claude generates software instantly
No prebuilt features
No prewritten code
Claude creates everything in real time, responding to and adapting to user requests

Technical Demo:

Shows what’s possible when a powerful model is combined with the right infrastructure
Dynamic software creation without predefined templates
Real-time adaptation to user interactions

Watch the demo: youtu.be/dGiqrsv530Y

Google Research’s Approach: Generative UI

“Generative UI: LLMs are Effective UI Generators” (November 2025)

Project page: generativeui.github.io | Research blog: research.google/blog

Abstract

AI models are good at creating content, but they are usually rendered with static, pre-defined interfaces. Specifically, LLM outputs are usually markdown “walls of text”.

Generative UI is a long-term commitment where models not only generate content, but also generate the interface itself.

We show that when correctly prompted and equipped with the right toolset, modern LLMs can robustly generate high‑quality custom UIs for almost any prompt.

Implementation: three main components

Server: exposes endpoints for key tools (image generation, search)
System instructions: carefully designed prompts including goals, planning guidelines, examples
Post‑processor: fixes common issues that cannot be solved through prompting

Evaluation results

When ignoring generation speed, the results are overwhelmingly preferred by humans, beating standard LLM markdown outputs.

PAGEN benchmark user preference (ELO scores):

Human expert‑designed websites (highest)
Generative UI: 1710.7 (on par with experts 44% of the time)
Top Google Search results (significant gap)
Standard markdown LLM outputs
Plain text outputs

Emergent capability

This robust Generative UI capability is emergent, a substantial improvement over previous models.

Example categories

Education: What is a fractal? Probability of rolling 8 with two dice, Ising model, history of timing devices
Children’s education: explain speculative decoding to a child, kids’ chemistry experiments, explain slope and tangent using puppies
Practical tasks: hosting Thanksgiving, choosing a carpet, how to make a baby mobile
Simple queries: what time is it (custom clock interface), green things (visual gallery), dragon fruit (interactive exploration)
Games: learn fast typing, clicker game, fashion advisor, memory game, four‑player elemental tic‑tac‑toe, Japanese visual novel, text adventure game

Explore interactive examples: generativeui.github.io

Path Two: Image generation + hybrid architecture

Web front‑end code generation + image generation = optimal combination

Why a hybrid architecture?

Limitations of pure image generation:

Image generation models struggle with thousands of words of text
Long‑form content (articles, documents, detailed UIs) needs web rendering
Pure image outputs lack interactivity and accessibility

Optimal division of labor

Component	Best method
Long‑form text	HTML/CSS rendering
Interactive elements	JavaScript/React
Visual assets	Image generation
Charts, illustrations	Image generation
Infographics with text	Hybrid

The role of image generation in Generative UI

Nano Banana Pro (Gemini 3 Pro Image) supports:

Clear text in images for short taglines, headings, posters
Multilingual support, enhancing multilingual reasoning
Multiple textures, fonts, and calligraphy styles

Best use cases in Generative UI:

Hero images and banners with stylized text
Product mockups and visual previews
Charts and conceptual illustrations
Branded assets with consistent style

Architecture:

1
2
3

LLM → HTML/CSS/JS (structure + long text)
    → Image generation API (visual assets)
    → Compose and render interface

Part V: Her – OS‑level assistant

The film “Her” (2013): Samantha’s vision

Film summary

“Her” is a 2013 science‑fiction film directed by Spike Jonze that explores the relationship between a person and an AI operating system.

Plot: Theodore, a lonely writer, develops a relationship with Samantha—an AI with a voice, personality, and the ability to learn and evolve.

Why “Her” matters

Samantha represents the ultimate vision of an AI assistant:

Not just responding to commands
Truly understanding the user’s life, emotions, and needs
Anticipating what the user needs before being asked

Key traits of Samantha

🗣️ Voice‑first: natural, conversational interaction
🌐 Always available: runs in the background, accessible anytime
🧠 Deep user memory: builds a mental model—preferences, habits, values
🎯 Proactive service: anticipates needs without explicit requests
⚡ Autonomous actions: organizes email, schedules, makes calls
🔄 Real‑time processing: listening, thinking, responding, and acting simultaneously

Core insight: just as we understand friends—not by remembering every conversation, but by building a mental model of who they are.

Core capabilities of an OS‑level assistant

🗣️ Real‑time voice UI: low‑latency natural dialogue with cross‑session context
📞 Calling on your behalf: navigating customer service, gathering information, negotiating
💻 Computer and phone usage: GUI automation across apps, orchestration of cross‑app workflows
🔗 Unified data access: integration with devices and cloud services, intelligent indexing and retrieval
✨ Generative UI: dynamic result presentation, interactive exploration and manipulation
🎯 Proactive service: anticipating needs before being asked, personalized value alignment

Personalized value alignment

Analogy: recommender systems

📰 Traditional media: everyone reads the same newspaper and sees the same content
📱 ByteDance/TikTok revolution: everyone sees completely different content
- “Everyone lives in a different world with different values”
✅ Result: personalized products are more human‑centric → users prefer them

Future of AI Agent alignment

Current approach: universal human values

LLMs are aligned to “universal” values
But do we really have universally agreed‑upon human values?

What AI should do:

Not just one universal value set
Adapt to each user’s values and preferences
Recognize that value differences are huge

Proactive service: the highest level of AI memory

User memory is the core ingredient of proactive service

User memory is not just logging every conversation. Like understanding friends:

We don’t remember every sentence they say
We build a mental model of who they are
Their preferences, habits, and values

Two types of memory:

Type	Difficulty	Example
Facts	Simple	Birthday, address, card number
Preferences	Complex	Context‑dependent, constantly evolving

Learning user preferences is much harder than storing factual information:

Context‑dependent: paper‑writing style ≠ travel guide style
One‑off vs. long‑term: “ordered Sichuan food yesterday” ≠ “likes spicy food”
Risk of over‑generalization: AI can easily extrapolate incorrectly
Requires fine‑grained evaluation

Three levels of memory capability

Level 1: Basic recall

Store and retrieve explicit user information
“My membership number is 12345” → reliably recall it
Foundation of trustworthiness

Level 2: Cross‑session retrieval

Connect information across different conversations
Disambiguation: “book a service appointment for my car” → which of the two cars?
Understand composite events: “cancel the LA trip” → find flights + hotel
Distinguish active contracts from past inquiries

Level 3: Proactive service

Anticipate needs without explicit requests
Booking an international flight? → check whether the passport is expiring soon
Phone broke? → list all protection options (warranty, credit card, carrier insurance)
Tax season? → proactively gather all relevant documents

Background GUI Agent: technical challenges

Core problem

Traditional GUI agents require:

Foreground application windows
Active screen rendering
Exclusive access to input devices

This conflicts with:

The user simultaneously using the device
Battery and resource efficiency

Virtualization approach

Headless browser/app execution:

Render apps in a virtual frame buffer
The agent interacts with a virtual display

Cloud phone/desktop:

Run apps in a cloud environment
Stream results to local devices when needed
Offload computation from mobile devices

Requirements for fooling apps

Apps must believe they:

Are running in the foreground
Receive normal user input
Render to a real display

Technical mechanisms:

Window manager virtualization
Input event injection
Leveraging accessibility APIs
Container/sandbox isolation

Data synchronization

For cloud‑side operations:

Application state migration
Account credential management
Caching and data synchronization

Benefit: operations continue when the device is offline, especially valuable for desktops.

Cross‑device and cross‑service data integration

Unified data access

Example: Gemini integrated with Google Workspace

Gmail, Drive, Calendar, Docs
Cross‑service querying and retrieval
Context‑aware suggestions

Requirements:

OAuth and API integrations
Permission management
Data indexing infrastructure

Indexes for AI Agent retrieval

Local device data:

File system indexing
App data extraction

Cloud service data:

Incremental search index construction

Technical architecture

用户数据源
├── 本地文件
├── 云存储（Drive、iCloud 等）
├── 邮件和日历
├── 消息应用
└── 浏览器历史/书签
        ↓
    索引层
        ↓
    检索 API
        ↓
    Agent 查询接口

Ecosystem challenge

Current situation:

Agents operate apps on behalf of users
Traffic is captured by the agent, not the original app
Apps lose ad revenue and engagement metrics
Result: apps may block agent access

Example: GUI agents being blocked by some messaging and social apps due to traffic and monetization issues

Why blocking is a problem

Degrades user experience
Fragments Agent capabilities
Arms race between Agents and apps
Ultimately harms all parties

Proposed solution: revenue sharing

Principle: Agent companies and app providers must establish profit-sharing mechanisms

Possible models:

Transaction-based sharing
- Agent facilitates purchase → share revenue
Subscription partnerships
- Joint premium tiers
- Feature access agreements
API access fees
- Formalized Agent API access
- Usage-based pricing

Strategic imperatives

Advancement of Agent technology is inevitable
Collaboration benefits both sides
Standards are needed for a sustainable ecosystem

Key points

Technical roadmap

Real-time voice interaction:

SEAL architecture: streaming, event-driven Agent loop
Interactive ReAct: listen while thinking, think while speaking
Solve the latency bottleneck of serial processing

Multi-Agent coordination:

Concurrent phone and computer use
Autonomous Agent generation and orchestration
Small models (4–8B) are sufficient for specialized tasks

Generative UI:

Web front-end code generation (Claude Artifacts, Google Generative UI)
Image generation (Nano Banana Pro)
Hybrid architectures combining the two

System architecture

Extended modalities:

Observation: voice + visual perception
Action: dialogue + UI generation + computer use

OS-level integration:

Background/cloud operations without occupying UI
Unified data access across devices and services
Revenue-sharing models for ecosystem sustainability

Vision

The “Her” paradigm: an OS-level assistant combining:

Real-time voice conversation
Autonomous task execution
Generative interface presentation
Seamless integration with the user’s digital life

Future: three stages of AI Agent–environment interaction

Real-time asynchronous interaction with the environment is the foundation of Agents

🗣️ Stage 1: Voice

Input: voice
Output: voice
Data rate: 15–50 tokens/s
Latency: <500 ms
Challenge: balance between fast and slow thinking
Solution: Interactive ReAct

💻 Stage 2: Computer use

Input: vision (screenshots)
Output: mouse/keyboard actions
Data rate: ~2K tokens/frame
Latency: <1 second
Challenge: precise action execution
Solution: VLA models + RL

🤖 Stage 3: Physical world

Input: vision + voice + touch
Output: voice + joint movements
Data rate: ~20K tokens/s
Latency: <100 ms
Challenge: real-time control
Solution: VLA + world models

Key insight: complexity increases (data rate ↑, latency ↓), but architectural solutions can transfer across stages

References

Papers

Yan, H., Wang, J., et al. (2025). Step-GUI Technical Report. arXiv:2512.15431
Belcak, P., Heinrich, G., et al. (2025). Small Language Models are the Future of Agentic AI. arXiv:2506.02153
Google Research. (2025). Generative UI: LLMs are Effective UI Generators.

Technical resources

NVIDIA Research. Train Small Orchestration Agents to Solve Big Problems. NVIDIA Developer Blog
Anthropic. Introducing Claude Sonnet 4.5. anthropic.com
Google Research. Generative UI: A Rich, Custom, Visual Interactive User Experience for Any Prompt. research.google

View full talk slides