The Next Stop in Agent–Human Interaction: Real‑Time Voice and Generative UI
(This article is the invited talk I gave at the first Intelligent Agent Networks and Application Innovation Conference on December 20, 2025)
View Slides (HTML), Download PDF Version
Abstract
Today’s agent–human interaction is centered on text, but that deviates from natural human cognition. From first principles, the modality humans are best at for output is speech (speaking is three times faster than typing), and the modality humans are best at for input is vision. Vision is not text, but intuitive UI.
The first step is achieving real‑time voice interaction. The traditional serial VAD–ASR–LLM–TTS architecture suffers from having to wait for the user to finish speaking before it can start “thinking,” and it cannot output before the thinking is done. With an Interactive ReAct continuous‑thinking mechanism, the agent can listen, think, and speak at the same time: it starts thinking while the user is talking, and keeps deepening its reasoning while it’s speaking itself, making full use of all idle time gaps.
The second step is to expand the observation space and action space on top of real‑time voice. By extending the Observation Space (from voice input to Computer Use–style visual perception) and the Action Space (from voice output to UI generation and computer control), the agent can operate existing computer/phone GUIs while on a call, and generate dynamic UI to interact with the user. One implementation path for generative UI is generating front‑end code; Claude 4.5 Sonnet has already reached the threshold for this. Another path is generating images; Nano Banana Pro is also close to this threshold.
This is exactly the path used to realize Samantha in the movie Her. As an operating system, Samantha needs five core capabilities: real‑time voice conversation with the user, making phone calls and handling tasks on the user’s behalf, operating traditional computers and phones for the user, bridging data across the user’s existing devices and online services, having her own generative UI interface, and possessing powerful long‑term user memory for personalized proactive services.
Part I: The Efficiency Bottleneck of Text‑Based Interaction
Cognitive Mismatch in Today’s Agent Interfaces
Human output modalities
| Modality | Speed | Cognitive load |
|---|---|---|
| Speech | 150 words/minute | Low |
| Typing | 40–50 words/minute | High |
- Speech is the most natural output modality
- Typing requires fine motor coordination and visual attention
- Spoken communication is the foundation of human interaction
Human input modalities
| Modality | Bandwidth | Understanding method |
|---|---|---|
| Vision | High (~10 Mbps) | Pattern recognition |
| Hearing | Medium (~16 kbps) | Sequential processing |
| Reading | Low (~100 bps) | Requires literacy |
- The visual cortex processes information in parallel
- Reading is an acquired skill, not innate
- UI/graphics leverage natural visual processing capabilities
Fundamental insight: Today’s text‑based agent interfaces force humans to use suboptimal modalities for both input and output.
The Optimal Interaction Paradigm
Optimal human output: speech
- Naturally produces language at conversational speed
- Minimal cognitive overhead
- Supports complex expression and subtle nuance
- Supports real‑time interaction and interruption
Use cases:
- Task delegation and clarification
- Real‑time feedback while the agent executes tasks
- Multi‑turn problem solving
Optimal human input: visual UI
- Rapid scanning and understanding of information
- Spatial organization aids memory and navigation
- Interactive elements support direct manipulation
- Progressive disclosure manages complexity
Use cases:
- Result presentation and comparison
- Interactive data exploration
- Workflow visualization and status monitoring
Target architecture: Human‑to‑agent communication via real‑time voice + agent‑to‑human communication via generative UI
Part II: Real‑Time Voice Interaction
Typical Voice Agent Architecture
A typical voice agent architecture is divided into three layers:
Perception layer: VAD + ASR
- Converts continuous signals into discrete events
Thinking layer: LLM
- Asynchronous processing
Execution layer: TTS
- Converts discrete commands into continuous actions
Problems with the Traditional VAD + ASR Architecture
Issues with VAD (Voice Activity Detection)
- Unavoidable latency: Must wait for 500–800 ms of continuous silence to confirm that the user has finished speaking
- Poor interruption detection: Can’t distinguish background noise/music; “uh‑huh” easily triggers false interruptions
- Low accuracy in speech detection: Fails in complex acoustic environments; mid‑sentence pauses cause truncation
Issues with ASR (Automatic Speech Recognition)
- Low accuracy from lack of context: VAD slices audio into isolated segments; can’t use context for disambiguation; high error rate on emails, names, phone numbers
- Lack of world knowledge: Can’t leverage common sense; low accuracy on addresses, brands, technical terms
- Pure text output loses acoustic details:
- Loses emotion: happiness, frustration, excitement
- Loses paralinguistic information: laughter, sighs, breathing
- Loses environmental information: noisy, musical, quiet
Streaming Speech Perception Models: A Replacement for VAD + ASR
Multimodal architecture
- Audio encoder (from Whisper): Converts audio into audio tokens
- Qwen LLM (autoregressive): Processes audio tokens and outputs text + events
Key advantages:
- Streaming: Real‑time output (non‑batch)
- Context: Preserves full dialogue history
- In‑context learning: More accurate recognition of personal info and domain terms
- World knowledge: Higher accuracy on addresses, brands, and amounts
Rich output: text + acoustic events
In addition to text tokens, it outputs special tokens (acoustic events):
<speak_start><speak_end>: Speech boundaries<interrupt>: Interruption intent<emotion:happy>: Emotion tags<laugh><sigh>: Paralinguistic information<music>: Environmental sounds
Interactive ReAct: Flexibly Interleaving Observation, Thought, and Action
Traditional ReAct: rigid OTA loop
1 | O₁: "我想把我的 Xfinity 账单降到每月 79 美元" |
- Fixed loop: Must complete an entire observe–think–act sequence
- Lost thinking: Can’t think while listening; high latency
- Rigid: Must wait for complete input before thinking
Interactive ReAct: flexible interleaving OTA
1 | O₁: "我想把我的 Xfinity 账单降到每月 79 美元" |
- Think while listening: New observations can be inserted at any time, and ongoing thoughts are preserved
- Think while speaking: Respond quickly, then continue thinking
- Intelligent turning‑point decisions: Decides when to speak and when to remain silent
SEAL Thinking Layer: An Interruptible Interactive ReAct Loop
Key insight: LLM thinking is far faster than voice I/O; fully exploit “idle time”
LLM processing speed:
- Input processing: 500+ tokens/second
- Thinking/output: 100+ tokens/second
Voice I/O speed:
- Voice input/output: only ~5 tokens/second
- Speed difference: 20–100×
During the “idle time” between observation (voice input) and action (voice output), we have plenty of time for deep thinking. A rigid observe–think–act loop cannot use this idle time.
Fast thinking → slow thinking → continuous thinking
- Fast response (0.5s): 50 tokens of quick thinking → immediate preliminary response (within 5 seconds)
- Deep analysis (after 5s): 500 tokens of slow thinking → generate a more complete answer
- Continuous thinking (as needed): If 500 tokens still aren’t enough, keep thinking for another 5 seconds → continue generating answers until both thinking and speaking are done. If multiple rounds of thought are needed, the result is continuous output of current‑round thought summaries, like someone “thinking out loud.”
Think While Listening
Handling interruptions in conversation gracefully.
Traditional ReAct: Once the user interrupts, all previous thinking is discarded and you must start over.
Interactive ReAct: Preserves the interrupted thought process, attaches the new user input, and lets the model continue thinking from the point of interruption.
1 | <user>我想把我的套餐从现在的 109 美元换成你们的新套餐...</user> |
Advantage: A coherent thought process that can quickly adjust strategy based on the latest information.
Speak While Thinking
Uses “filler speech” to buy time for deeper thought, reducing first‑token latency.
Scenario: The user asks a complex question and the agent needs time to think.
Traditional ReAct:
1 | <user>Do you confirm ordering this plan?</user> |
Interactive ReAct:
1 | <user>你确认订购这个套餐吗?</user> |
Advantage: Greatly improves interaction fluency and avoids awkward long waits.
SEAL Architecture Summary
A unified event‑driven loop that decouples perception, thinking, and execution, achieving truly real‑time and parallel processing.
Perception layer
- Input: Continuous signals (voice, GUI)
- Output: Discrete event streams
- Solves: Latency, unnatural interruptions, and acoustic information loss in traditional speech perception
Thinking Layer
- Input: Discrete event stream
- Output: Interleaved thoughts/action commands
- Solves: The serial bottleneck of traditional ReAct, enabling interruptible, asynchronous listening-while-thinking and thinking-while-speaking
Execution Layer
- Input: Discrete action commands
- Output: Continuous signals + feedback events
- Solves: The “last mile” problem of agents being clumsy and lacking feedback, forming a closed-loop action cycle
Future Outlook: End-to-End Models
Current SEAL Architecture:
- Perception Layer LLM: audio → text + acoustic events
- Thinking Layer LLM: text + acoustic events → thoughts + actions
- Execution Layer LLM: actions → audio
Future End-to-End Architecture:
- Audio encoder: audio → audio tokens
- Unified LLM: perception + thinking + execution
- Audio decoder: audio tokens → audio
Part III: Making Phone Calls and Using a Computer at the Same Time
Extended Observation and Action Space
Extended Observation Space
| Traditional | Extended |
|---|---|
| Voice input | Voice input |
| + Screen visual perception | |
| + Application state monitoring | |
| + System notifications |
Computer-Use Integration:
- Real-time screen understanding
- UI element recognition and tracking
- Cross-application context awareness
Extended Action Space
| Traditional | Extended |
|---|---|
| Voice output | Voice output |
| + Mouse/keyboard operations | |
| + UI generation | |
| + Application control |
Multimodal Output:
- Voice for human communication
- GUI operations for task execution
- Generated UI for result presentation
Target Capability: An agent that can carry on a phone conversation and operate a computer interface at the same time, similar to a human assistant talking on the phone while using a computer.
Multi-Agent Architecture for Concurrent Calling and Computer Use
Architecture Design
Phone Agent:
- Handles real-time voice conversations
- Low-latency ASR → LLM → TTS pipeline
- Extracts key information from the conversation
- Communicates with the Computer Agent via message passing
Computer Agent:
- Responsible for GUI operations (browser, apps)
- Visual understanding and action planning
- Receives information from the Phone Agent
- Reports task status and requests additional information
Communication Protocol
1 | { |
Autonomous Orchestration Method
The agent autonomously decides when to spawn a collaborative agent:
- Task analysis: The Computer Agent encounters a complex form that requires user information
- Capability assessment: Determines that voice interaction is more efficient than text
- Agent generation: Calls
initiate_phone_call_agent(purpose, required_info) - Parallel execution: Two agents run independently with asynchronous communication
Real-Time Collaboration Pattern
1 | 电话 Agent: "请问您的姓名是?" |
Key Requirement: Agents must run in truly parallel threads without blocking each other.
Solving Computer-Use Latency: Small Specialized Models
Step-GUI: An Efficient GUI Agent Using Small Models (arxiv:2512.15431)
Performance vs. SOTA Models
| Benchmark | OpenAI CUA | Claude-4.5 | Gemini-2.5 | Step-GUI 8B |
|---|---|---|---|---|
| OSWorld-Verified | 23.0 | 61.4 | - | 48.5 |
| AndroidWorld | - | - | 69.7 | 80.2 |
| ScreenSpot-Pro | 23.4 | - | - | 62.6 |
| OSWorld-G | 36.4 | - | - | 70.0 |
Why Can Small Models Outperform Frontier Models?
Self-Evolution Training Pipeline:
- Calibrated Step Reward System (CSRS): Converts model trajectories into training signals, >90% accuracy, cost only 1/10–1/100 of human labeling
- Domain-targeted data: 11.2M mid-train + 1.67M cold-start samples
Core Insight: Frontier models lack domain knowledge (UI conventions of Chinese apps, behavior of local apps). Small models + targeted training can fill these gaps.
AndroidDaily Benchmark (a Chinese mobile-app benchmark proposed by Step-GUI)
Real-world mobile app tasks across 5 scenarios:
- 🚄 Travel: Buy train tickets on 12306
- 🎵 Entertainment: Play a playlist on NetEase Cloud Music
- 🛒 Shopping: Check the shopping cart on Taobao
- 💬 Social Media: Change privacy settings on Zhihu
- 🍜 Local Services: Set a Dianping review to be visible only to yourself
AndroidDaily (Static) Results
| Model | Average Accuracy |
|---|---|
| Claude-4.5-sonnet | 10.90 |
| Gemini-2.5-Pro Thinking | 43.74 |
| Step-GUI-8B | 89.91 |
Conclusion: Small model + domain data > general-purpose large model. With targeted domain training, Step-GUI-8B achieves 2× Gemini and 8× Claude performance on Chinese mobile apps.
NVIDIA ToolOrchestra: Small Models for Multi-Agent Coordination
Based on NVIDIA research (developer.nvidia.com)
Core Concept
ToolOrchestra trains small orchestration models to supervise and manage larger models and tools according to user preferences for:
- Speed
- Cost
- Accuracy
Key Insight: Small models are not burdened by excessive knowledge and can be trained to capture the essential decision patterns of orchestration.
Training Method
- Synthetic data generation: Automatic trajectory generation and verification
- Multi-objective RL: Optimize for accuracy, cost, and solution time
- Minimal data requirements: Orchestrator-8B uses only 552 synthetic samples
Architectural Advantages
- Small (8B) model precisely guides larger models
- Automatically balances capability and cost
- Supports heterogeneous multi-agent systems
- Suitable for real-world deployment scenarios
Case Studies of Small Language Models in Agentic AI
Based on “Small Language Models are the Future of Agentic AI” (arxiv:2506.02153)
Core Argument
Position: Small language models (SLMs) are:
- Powerful enough for specialized tasks
- Naturally better suited to agentic applications
- Inevitably more economical for high-frequency calls
Arguments for SLMs in Agent Systems
Capability Argument:
- Modern SLMs perform strongly on focused tasks
- Agentic systems involve repetitive, specialized operations
- General-purpose conversational ability is often unnecessary
Economic Argument:
- Agent systems make a large number of model calls
- Cost grows linearly with model size and number of calls
- SLMs reduce operating costs by 10–100×
Algorithm for Transitioning from LLM to SLM
- Identify specialized subtasks in the agent workflow
- Curate task-specific training data from LLM outputs
- Fine-tune SLMs on specialized data
- Validate performance on target metrics
- Deploy SLMs for production workloads
Part IV: Generative UI – Web Frontend Code Generation + Image Generation
Path One: Web Frontend Code Generation
Anthropic’s Approach: Claude Artifacts and “Imagine with Claude”
Claude Artifacts
Claude can generate complete frontend code and render it in a sandboxed preview environment:
Supported Output Types:
- React apps with hooks and components
- Interactive data visualizations (D3.js, Chart.js)
- SVG graphics and charts
- Native HTML/CSS/JavaScript
- Markdown documents
Workflow:
- The user prompt describes the desired interface
- The model generates complete frontend code
- The code is rendered in a sandboxed preview
- The user iterates via conversation
“Imagine with Claude” (Research Preview)
A temporary research preview released with Claude Sonnet 4.5:
Key Characteristics:
- Claude generates software instantly
- No prebuilt features
- No prewritten code
- Claude creates everything in real time, responding to and adapting to user requests
Technical Demo:
- Shows what’s possible when a powerful model is combined with the right infrastructure
- Dynamic software creation without predefined templates
- Real-time adaptation to user interactions
Watch the demo: youtu.be/dGiqrsv530Y
Google Research’s Approach: Generative UI
“Generative UI: LLMs are Effective UI Generators” (November 2025)
Project page: generativeui.github.io | Research blog: research.google/blog
Abstract
AI models are good at creating content, but they are usually rendered with static, pre-defined interfaces. Specifically, LLM outputs are usually markdown “walls of text”.
Generative UI is a long-term commitment where models not only generate content, but also generate the interface itself.
We show that when correctly prompted and equipped with the right toolset, modern LLMs can robustly generate high‑quality custom UIs for almost any prompt.
Implementation: three main components
- Server: exposes endpoints for key tools (image generation, search)
- System instructions: carefully designed prompts including goals, planning guidelines, examples
- Post‑processor: fixes common issues that cannot be solved through prompting
Evaluation results
When ignoring generation speed, the results are overwhelmingly preferred by humans, beating standard LLM markdown outputs.
PAGEN benchmark user preference (ELO scores):
- Human expert‑designed websites (highest)
- Generative UI: 1710.7 (on par with experts 44% of the time)
- Top Google Search results (significant gap)
- Standard markdown LLM outputs
- Plain text outputs
Emergent capability
This robust Generative UI capability is emergent, a substantial improvement over previous models.
Example categories
- Education: What is a fractal? Probability of rolling 8 with two dice, Ising model, history of timing devices
- Children’s education: explain speculative decoding to a child, kids’ chemistry experiments, explain slope and tangent using puppies
- Practical tasks: hosting Thanksgiving, choosing a carpet, how to make a baby mobile
- Simple queries: what time is it (custom clock interface), green things (visual gallery), dragon fruit (interactive exploration)
- Games: learn fast typing, clicker game, fashion advisor, memory game, four‑player elemental tic‑tac‑toe, Japanese visual novel, text adventure game
Explore interactive examples: generativeui.github.io
Path Two: Image generation + hybrid architecture
Web front‑end code generation + image generation = optimal combination
Why a hybrid architecture?
Limitations of pure image generation:
- Image generation models struggle with thousands of words of text
- Long‑form content (articles, documents, detailed UIs) needs web rendering
- Pure image outputs lack interactivity and accessibility
Optimal division of labor
| Component | Best method |
|---|---|
| Long‑form text | HTML/CSS rendering |
| Interactive elements | JavaScript/React |
| Visual assets | Image generation |
| Charts, illustrations | Image generation |
| Infographics with text | Hybrid |
The role of image generation in Generative UI
Nano Banana Pro (Gemini 3 Pro Image) supports:
- Clear text in images for short taglines, headings, posters
- Multilingual support, enhancing multilingual reasoning
- Multiple textures, fonts, and calligraphy styles
Best use cases in Generative UI:
- Hero images and banners with stylized text
- Product mockups and visual previews
- Charts and conceptual illustrations
- Branded assets with consistent style
Architecture:
1 | LLM → HTML/CSS/JS (structure + long text) |
Part V: Her – OS‑level assistant
The film “Her” (2013): Samantha’s vision
Film summary
“Her” is a 2013 science‑fiction film directed by Spike Jonze that explores the relationship between a person and an AI operating system.
Plot: Theodore, a lonely writer, develops a relationship with Samantha—an AI with a voice, personality, and the ability to learn and evolve.
Why “Her” matters
Samantha represents the ultimate vision of an AI assistant:
- Not just responding to commands
- Truly understanding the user’s life, emotions, and needs
- Anticipating what the user needs before being asked
Key traits of Samantha
- 🗣️ Voice‑first: natural, conversational interaction
- 🌐 Always available: runs in the background, accessible anytime
- 🧠 Deep user memory: builds a mental model—preferences, habits, values
- 🎯 Proactive service: anticipates needs without explicit requests
- ⚡ Autonomous actions: organizes email, schedules, makes calls
- 🔄 Real‑time processing: listening, thinking, responding, and acting simultaneously
Core insight: just as we understand friends—not by remembering every conversation, but by building a mental model of who they are.
Core capabilities of an OS‑level assistant
- 🗣️ Real‑time voice UI: low‑latency natural dialogue with cross‑session context
- 📞 Calling on your behalf: navigating customer service, gathering information, negotiating
- 💻 Computer and phone usage: GUI automation across apps, orchestration of cross‑app workflows
- 🔗 Unified data access: integration with devices and cloud services, intelligent indexing and retrieval
- ✨ Generative UI: dynamic result presentation, interactive exploration and manipulation
- 🎯 Proactive service: anticipating needs before being asked, personalized value alignment
Personalized value alignment
Analogy: recommender systems
- 📰 Traditional media: everyone reads the same newspaper and sees the same content
- 📱 ByteDance/TikTok revolution: everyone sees completely different content
- “Everyone lives in a different world with different values”
- ✅ Result: personalized products are more human‑centric → users prefer them
Future of AI Agent alignment
Current approach: universal human values
- LLMs are aligned to “universal” values
- But do we really have universally agreed‑upon human values?
What AI should do:
- Not just one universal value set
- Adapt to each user’s values and preferences
- Recognize that value differences are huge
Proactive service: the highest level of AI memory
User memory is the core ingredient of proactive service
User memory is not just logging every conversation. Like understanding friends:
- We don’t remember every sentence they say
- We build a mental model of who they are
- Their preferences, habits, and values
Two types of memory:
| Type | Difficulty | Example |
|---|---|---|
| Facts | Simple | Birthday, address, card number |
| Preferences | Complex | Context‑dependent, constantly evolving |
Learning user preferences is much harder than storing factual information:
- Context‑dependent: paper‑writing style ≠ travel guide style
- One‑off vs. long‑term: “ordered Sichuan food yesterday” ≠ “likes spicy food”
- Risk of over‑generalization: AI can easily extrapolate incorrectly
- Requires fine‑grained evaluation
Three levels of memory capability
Level 1: Basic recall
- Store and retrieve explicit user information
- “My membership number is 12345” → reliably recall it
- Foundation of trustworthiness
Level 2: Cross‑session retrieval
- Connect information across different conversations
- Disambiguation: “book a service appointment for my car” → which of the two cars?
- Understand composite events: “cancel the LA trip” → find flights + hotel
- Distinguish active contracts from past inquiries
Level 3: Proactive service
- Anticipate needs without explicit requests
- Booking an international flight? → check whether the passport is expiring soon
- Phone broke? → list all protection options (warranty, credit card, carrier insurance)
- Tax season? → proactively gather all relevant documents
Background GUI Agent: technical challenges
Core problem
Traditional GUI agents require:
- Foreground application windows
- Active screen rendering
- Exclusive access to input devices
This conflicts with:
- The user simultaneously using the device
- Battery and resource efficiency
Virtualization approach
Headless browser/app execution:
- Render apps in a virtual frame buffer
- The agent interacts with a virtual display
Cloud phone/desktop:
- Run apps in a cloud environment
- Stream results to local devices when needed
- Offload computation from mobile devices
Requirements for fooling apps
Apps must believe they:
- Are running in the foreground
- Receive normal user input
- Render to a real display
Technical mechanisms:
- Window manager virtualization
- Input event injection
- Leveraging accessibility APIs
- Container/sandbox isolation
Data synchronization
For cloud‑side operations:
- Application state migration
- Account credential management
- Caching and data synchronization
Benefit: operations continue when the device is offline, especially valuable for desktops.
Cross‑device and cross‑service data integration
Unified data access
Example: Gemini integrated with Google Workspace
- Gmail, Drive, Calendar, Docs
- Cross‑service querying and retrieval
- Context‑aware suggestions
Requirements:
- OAuth and API integrations
- Permission management
- Data indexing infrastructure
Indexes for AI Agent retrieval
Local device data:
- File system indexing
- App data extraction
Cloud service data:
- Incremental search index construction
Technical architecture
1 | 用户数据源 |
Economic model: revenue sharing with app providers
Ecosystem challenge
Current situation:
- Agents operate apps on behalf of users
- Traffic is captured by the agent, not the original app
- Apps lose ad revenue and engagement metrics
- Result: apps may block agent access
Example: GUI agents being blocked by some messaging and social apps due to traffic and monetization issues
Why blocking is a problem
- Degrades user experience
- Fragments Agent capabilities
- Arms race between Agents and apps
- Ultimately harms all parties
Proposed solution: revenue sharing
Principle: Agent companies and app providers must establish profit-sharing mechanisms
Possible models:
Transaction-based sharing
- Agent facilitates purchase → share revenue
Subscription partnerships
- Joint premium tiers
- Feature access agreements
API access fees
- Formalized Agent API access
- Usage-based pricing
Strategic imperatives
- Advancement of Agent technology is inevitable
- Collaboration benefits both sides
- Standards are needed for a sustainable ecosystem
Key points
Technical roadmap
Real-time voice interaction:
- SEAL architecture: streaming, event-driven Agent loop
- Interactive ReAct: listen while thinking, think while speaking
- Solve the latency bottleneck of serial processing
Multi-Agent coordination:
- Concurrent phone and computer use
- Autonomous Agent generation and orchestration
- Small models (4–8B) are sufficient for specialized tasks
Generative UI:
- Web front-end code generation (Claude Artifacts, Google Generative UI)
- Image generation (Nano Banana Pro)
- Hybrid architectures combining the two
System architecture
Extended modalities:
- Observation: voice + visual perception
- Action: dialogue + UI generation + computer use
OS-level integration:
- Background/cloud operations without occupying UI
- Unified data access across devices and services
- Revenue-sharing models for ecosystem sustainability
Vision
The “Her” paradigm: an OS-level assistant combining:
- Real-time voice conversation
- Autonomous task execution
- Generative interface presentation
- Seamless integration with the user’s digital life
Future: three stages of AI Agent–environment interaction
Real-time asynchronous interaction with the environment is the foundation of Agents
🗣️ Stage 1: Voice
- Input: voice
- Output: voice
- Data rate: 15–50 tokens/s
- Latency: <500 ms
- Challenge: balance between fast and slow thinking
- Solution: Interactive ReAct
💻 Stage 2: Computer use
- Input: vision (screenshots)
- Output: mouse/keyboard actions
- Data rate: ~2K tokens/frame
- Latency: <1 second
- Challenge: precise action execution
- Solution: VLA models + RL
🤖 Stage 3: Physical world
- Input: vision + voice + touch
- Output: voice + joint movements
- Data rate: ~20K tokens/s
- Latency: <100 ms
- Challenge: real-time control
- Solution: VLA + world models
Key insight: complexity increases (data rate ↑, latency ↓), but architectural solutions can transfer across stages
References
Papers
- Yan, H., Wang, J., et al. (2025). Step-GUI Technical Report. arXiv:2512.15431
- Belcak, P., Heinrich, G., et al. (2025). Small Language Models are the Future of Agentic AI. arXiv:2506.02153
- Google Research. (2025). Generative UI: LLMs are Effective UI Generators.
Technical resources
- NVIDIA Research. Train Small Orchestration Agents to Solve Big Problems. NVIDIA Developer Blog
- Anthropic. Introducing Claude Sonnet 4.5. anthropic.com
- Google Research. Generative UI: A Rich, Custom, Visual Interactive User Experience for Any Prompt. research.google