Self-Evolving Real-Time Agents: Think While Listening, Speak While Thinking, Learn While Acting
[This post is based on my invited talk at FAISys’25 (The 1st Frontier AI Systems Workshop).]
Hello everyone, I’m honored to speak at FAISys’25 (The 1st Frontier AI Systems Workshop). Today I’m presenting “Self-Evolving Real-Time Agents: Think While Listening, Speak While Thinking, Learn While Acting”.
I’m Co-Founder and Chief Scientist at Pine AI. Currently, Pine AI helps users handle daily tasks through AI-powered phone calls and computer automation. We assist with bill negotiation, subscription cancellation, complaint filing, compensation claims, and more. We’ve saved over $3 million for consumers with a 93% success rate, saving each user an average of 270 minutes.
Learning from experience represents the fundamental challenge in machine learning. Current autonomous AI agents face two core challenges in practical applications: real-time interaction with environments and learning from experience. Today I’ll introduce our technical breakthroughs in both areas.
Two Fundamental Challenges
Challenge I: High Latency in Real-Time Interaction
Real-time voice agents must respond within 1 second like humans, but traditional architectures using reasoning LLMs introduce 2-10 second delays.
VAD (Voice Activity Detection) Challenges:
- Must wait 500-800ms of continuous silence to confirm user finished
- “Uh-huh” mistakenly triggers interruption
- Lost acoustic information (emotions, environment)
ASR (Automatic Speech Recognition) Challenges:
- No context leads to high error rates (emails, names, phone numbers)
- Lack of world knowledge causes transcription errors
LLM Challenges:
- Forced to wait, cannot think while listening
- Cannot speak while thinking (5-10 second silence)
- Poor turn detection (when to speak/stay silent)
Challenge II: Learning from Experience
Models are “intelligent” but not “proficient” — like top graduates lacking real-world experience.
Fixed Models Cannot Learn:
- Cannot learn from successful traces
- Cannot learn from unsuccessful traces
- Parameters frozen after deployment
Big World Hypothesis:
The world is too large to pre-encode all knowledge:
- Business processes are dynamic and non-public
- Verification info varies by company
- Service rules constantly change
- Pre-trained knowledge insufficient for deployment
Part I: Real-Time Agent-Environment Interaction
Agents need real-time interaction with three types of targets:
- Humans: Dialogue and collaboration through real-time voice
- Digital World: Operating computers, browsing web, using mobile devices
- Physical World: Controlling robots, interacting with real environments
Typical Voice Agent Architecture
Traditional architecture has three layers:
Perception Layer: VAD + ASR
- Input: Continuous signals (audio streams)
- Output: Discrete events (
speech_start,interrupt,laugh,speech_fragment, etc.) - Function: Transform continuous signals into discrete events
Thinking Layer: LLM
- Input: Discrete event stream (observations and tool results)
- Output: Interleaved thoughts, tool calls, and output sentences
- Function: Asynchronous processing
Execution Layer: TTS
- Input: Discrete events (text)
- Output: Continuous actions (audio stream)
- Function: Transform discrete events into continuous actions
Problems with Traditional VAD + ASR Architecture
Three VAD Problems:
- Unavoidable Latency: Must wait 500-800ms of continuous silence to confirm user finished
- Poor Interrupt Detection: Cannot distinguish background noise/music; “Uh-huh” mistakenly triggers interruption
- Low Voice Detection Accuracy: Errors in complex acoustic environments; mid-sentence pauses cause truncation; “Hello” in background noise causes unresponsiveness
Three ASR Problems:
- Low Accuracy Without Context: VAD cuts audio into isolated segments; cannot use context for disambiguation; high errors for emails, names, phone numbers
- Lack of World Knowledge: Cannot leverage common sense; low accuracy for addresses, brands, technical terms, amounts
- Text-Only Output Lacks Acoustic Details:
- Lost emotions: happy, frustrated, excited
- Lost paralinguistic info: laugh, sigh, breath
- Lost environment: noisy, music, quiet
Solution: Streaming Voice Perception Model
We propose a Streaming Voice Perception Model to replace VAD + ASR:
Multimodal Architecture:
- Audio Encoder (from Whisper): Converts audio to audio tokens
- Qwen LLM (autoregressive): Processes audio tokens, outputs text + events
Key Advantages:
- Streaming: Real-time output (not batch)
- Context: Full dialogue history preserved
- In-Context Learning: Better recognition for personal info, domain terms
- World Knowledge: Higher accuracy for addresses, brands, amounts
Rich Output: Text + Acoustic Events
In addition to text tokens, outputs special tokens (acoustic events):
<speak_start><speak_end>: Speech boundaries<interrupt>: Interruption intent<emotion:happy>: Emotion markers<laugh><sigh>: Paralinguistic info<music>: Environmental sounds
Interactive ReAct: Flexibly Interleaved Observation-Thinking-Action
Traditional ReAct has a rigid OTA loop (Observation-Thinking-Action):
- Fixed Loop: Must complete entire Observation-Thinking-Action sequence
- Thinking Lost: Cannot think while listening, high latency
- Rigid: Must wait for complete input before thinking
Interactive ReAct enables flexible OTA interleaving:
- Think While Listening: New observations insert anytime, thinking preserved
- Speak While Thinking: Fast response, then continue thinking
- Intelligent Turn Detection: Decide when to speak, when to stay silent
Example Comparison:
Traditional ReAct (total ~20+ seconds):
1 | O₁: "I want to lower my Xfinity bill to $79 per month" |
Interactive ReAct (total ~6 seconds):
1 | O₁: "I want to lower my Xfinity bill to $79 per month" |
Think While Listening
Key Insight: LLM is 20-100x Faster Than Human Speech - Use Gap Time to Think!
LLM Processing Speed:
- Prefill (Input): 1000+ tokens/sec
- Decode (Output): 100 tokens/sec
Human Voice Input/Output Speed:
- Speaking: 5 tokens/sec (text) or 20 tokens/sec (audio tokens)
- LLM is 20-100x faster than humans!
Example: Interview Agent with Async Tool Calls While Candidate Speaks
1 | Candidate: "My previous role involved building distributed systems..." |
Advantage: Async tools + thinking while listening → no waiting, ultra-fast response
Speak While Thinking
Theory: ⚡ Fast → 🐢 Slow → 🐌 Continuous Thinking Using Filler Speech
Three Phases of Thinking:
- ⚡ Fast (0.5s, 50 tokens): Quick judgment → immediate response
- 🐢 Slow (5s, 500 tokens): Deep analysis → complete answer
- 🐌 Continuous (interleaved thinking and speaking): Keep thinking → keep speaking
Key: Use “filler speech” to maintain conversation flow during deep thinking
Example: Interview Agent Asking Complex Question
1 | Candidate: "I'm ready for the technical question." |
Result: Question unfolds naturally sentence-by-sentence, no awkward silence
Future: Three Stages of AI Agent-Environment Interaction
Real-time asynchronous interaction with environment is fundamental to agents.
🗣️ Stage 1: Voice
- Input: Voice
- Output: Voice
- Data Rate: 15-50 token/s
- Latency: <500ms
- Challenge: Fast-slow thinking balance
- Solution: Interactive ReAct
💻 Stage 2: Computer Use
- Input: Visual (screenshots)
- Output: Mouse/keyboard actions
- Data Rate: ~2K token/frame
- Latency: <1 second
- Challenge: Precise action execution
- Solution: VLA models + RL
🤖 Stage 3: Physical World
- Input: Vision+Voice+Tactile
- Output: Voice+Joint actions
- Data Rate: ~20K token/s
- Latency: <100ms
- Challenge: Real-time control
- Solution: VLA + World Models
Key Insight: Complexity increases (data rate ↑, latency ↓), but architectural solutions transfer across stages
Part II: Agents Learning from Experience
“We want AI agents that can discover like we can, not which contain what we have discovered.”
— Richard Sutton
Why Agents Must Learn from Experience: From “Intelligent” to “Proficient”
🎓 SOTA Models ≈ Top Graduates
✅ Knowledgeable: Master vast amounts of general knowledge
❌ Lack Experience: Underperform vs. experienced professionals on specialized tasks (e.g., accounting, tax filing)
💼 Real Challenges in Pine AI
🔑 Verification Info
- 1st call: learns credit card last 4 digits required
- 2nd call: should proactively request it
📋 Service Procedures
- 1st cancellation: told to fill online form instead of phone call
- 2nd cancellation: should directly fill online form
🎯 Service Rules
- Which discounts apply? (veterans, 2-year loyalty, etc.)
💰 Price Estimation
- Is $60/month for 3Gbps broadband high or low? Room to negotiate?
Core Problem: Many business processes are dynamic and non-public. Simply improving the base model’s general capabilities cannot solve these “experience-based” problems.
Building Self-Evolving Agents
Making agents learn from experience through three paradigms:
- Paradigm 1: Post-Training
- Paradigm 2: In-Context Learning
- Paradigm 3: Externalized Learning
Method 1: Post-Training - SFT Memorizes, RL Generalizes
📚 Supervised Fine-Tuning (SFT)
✅ Advantages:
- Extremely sample-efficient (thousands suffice)
- Quickly solidifies formats and protocols
- Stable training, fast convergence
❌ Limitations:
- Memorizes surface patterns
- Cliff-like degradation on out-of-distribution
- Hard to learn transferable strategies
🎯 Reinforcement Learning (RL)
✅ Advantages:
- Learns transferable policy representations
- Robust in out-of-distribution scenarios
- Discovers new strategies beyond training data
❌ Limitations:
- Low sample efficiency (100x more data and compute)
- High training cost and time
- Requires verifiable reward signals
💡 Engineering Practice: Form Before Function
- SFT Phase: Establish format stability, ensure parseable outputs
- RL Phase: Break through generalization boundaries on stable foundation
- Key Balance: Train SFT until “format stable, capabilities emerging”
Improving Sample Efficiency (I): On-Policy Distillation
Comparing Three Training Approaches:
SFT (Supervised Fine-Tuning)
- Sampling: ❌ Off-policy (teacher’s trajectories)
- Reward: ✅ Dense (token-by-token)
- Problem: Compounding errors in student’s states
RL (Reinforcement Learning)
- Sampling: ✅ On-policy (student’s rollouts)
- Reward: ❌ Sparse (only final outcome)
- Problem: One signal per episode, inefficient
✨ On-Policy Distillation
- Sampling: ✅ On-policy (student’s trajectories)
- Reward: ✅ Dense (teacher grades each token)
- Best of both worlds!
🔧 How It Works
1 | # Sample from student |
🎯 Key Benefits:
- 10x more efficient than RL
- Student learns to recover from its own mistakes
- Can reuse training data (multi-epoch)
- Enables continual learning
Improving Sample Efficiency (II): Feedback-Guided Sampling
❌ Traditional GRPO/DAPO
Process:
- Generate N independent rollouts
- Later attempts repeat same errors
Example:
1 | Rollout 1: Requires SSN → ❌ Failure |
✅ Feedback-Guided Sampling
Sequential Process:
- 1st rollout: From original prompt
- 2nd rollout: Prompt + 1st feedback in context
- Nth rollout: Accumulate feedback from N-1 rollouts
Example:
1 | Rollout 1: Requires SSN → ❌ Failure |
📈 Result: More high-quality samples per batch
This is essentially an online learning process:
- Externalized Learning: Feedback accumulated in knowledge base after each rollout
- Online RL: Agent adapts its policy based on accumulated feedback within the batch
Method 2: In-Context Learning
⚠️ Common Misconception
“With long context, just put all history in and let the model automatically reason”
This is a serious misconception about context capabilities!
🔍 What Context Really Does
- Nature: Retrieval, NOT reasoning engine
- Mechanism: Key-value similarity matching (like RAG)
- ✅ Good at: Finding relevant information
- ❌ Poor at: Statistical aggregation & counting
⚠️ Real Case: Three-Call Limit
- Rule: Max 3 calls to same merchant
- Context: Trajectory has multiple Xfinity calls
- Problem:
- Must scan entire trajectory to count
- Easily miscounts → makes 4th call
- Even if correct, wastes reasoning tokens
- Cost: O(trajectory length) per decision
System Hint: Making Implicit State Explicit
Solution: Pre-aggregate information → Reduce O(n) to O(1) context lookups
💡 How System Hint Works
1 | <system_hint> |
✅ Benefit:
- Complexity: O(n) → O(1)
- Model uses aggregated info directly
- No scanning or counting needed
📋 Four Types of System Hints
Task Planning
1
2TODO: [✅] Call customer service
[ ] Call retention deptSide-Channel Info
1
[2025-06-25 11:00:20] User message
Environment State
1
2Current dir: /home/ubuntu
OS: Ubuntu 24.04LLM-Generated Summary
1
2Conversation summary:
User wants $79 Xfinity plan with all current features
Method 3: Externalized Learning (Knowledge Base)
🚫 NEVER Store Raw Cases Directly in Knowledge Base
Storing raw dialogues/cases without distillation leads to incomplete retrieval and wrong conclusions.
🐱 Case 1: Cat Counting Problem
Scenario: 100 cases: 90 black cats, 10 white cats (all separate)
Question: “What’s the ratio?”
❌ Raw Storage Problem:
- Top-k=20 retrieves partial cases only
- Incomplete sample → Wrong inference
✅ Distilled Approach:
1 | "Total 100 cats: |
→ Single retrieval, accurate!
💼 Case 2: Discount Rule Error
Scenario: 3 cases: Veteran John ✅, Doctor Sarah ✅, Teacher Mike ❌
Question: “I’m a nurse, discount?”
❌ Raw Storage Problem:
- “Nurse” ≈ “Doctor” → Retrieves Sarah only
- Cases A, C missed → Wrong inference
✅ Distilled Approach:
1 | "Xfinity discount: ONLY |
→ Complete rule, correct answer!
Active Knowledge Distillation: Compression is Understanding
Core Principle: Invest extra compute now (LLM summarization) → Save reasoning tokens later
💡 Why Distillation?
❌ Raw trajectory (3 calls):
1 | 10:00 Call Xfinity (billing) |
✅ After distillation:
1 | "Called Xfinity 3 times (limit)" |
📊 Three Levels of Knowledge Distillation
Statistical Aggregation
- 100 cases → “90% black, 10% white”
- Reduce density, improve retrieval
Rule Distillation
- 3 cases → “Only veterans & doctors”
- Leap from cases to abstract rules
Structured Knowledge Extraction
- RAPTOR: Tree summaries
- GraphRAG: Entity networks
Summary: 3 Paradigms of Agent Continual Learning
Paradigm 1: Post-Training
- Core Finding: SFT memorizes, RL generalizes
- SFT: Solidifies formats and protocols, high sample efficiency
- RL: Learns transferable strategies, out-of-distribution robust
Paradigm 2: In-Context Learning
- Core Insight: Context ≠ Memory
- Nature: Attention is similar to RAG
- Methods: System hints, explicit summarization
Paradigm 3: Externalized Learning
3.1 Knowledge Base
- Advantages: Leverages extra compute for knowledge extraction
- Methods: Contextual retrieval, RAPTOR hierarchical summaries
3.2 Tool Generation
- Advantages: Codifies processes, efficient, reliable, composable
- Philosophy: Minimal predefinition + Maximum self-evolution (Alita)
Summary
Part I: Real-Time Interaction
Think While Listening, Speak While Thinking
❌ Problem: Serial architecture: VAD waits → ASR transcribes → LLM thinks → TTS speaks
✅ Solution:
- Perception: Streaming model produces context-aware transcription and acoustic events
- Thinking: Event-driven, can think while listening and speaking
💡 Example: Telecom Plan Query - No Awkward Silence
1 | O: "Should I order this plan?" |
Part II: Learning from Experience
Learn While Acting
❌ Problem:
- Fixed models cannot learn from experience after deployment
- Big world: business processes are dynamic & non-public
✅ Solution:
- Post-Training: Learn from interactions via RL
- In-Context: Aggregate info via system hints
- Externalized: Distill knowledge, generate tools
💡 Example: Credit Card Verification
1 | 1st call: ❌ Doesn't have last 4 digits of credit card |
“We want AI agents that can discover like we can, not which contain what we have discovered.”
— Richard Sutton
About Pine AI
Pine AI is an AI Agent that makes calls and uses computers to get things done. As your personal assistant, we contact customer service on your behalf to:
- 💰 Lower bills (average 20% savings on telecom, utilities)
- ❌ Cancel subscriptions
- 📋 File complaints
- 💵 Get compensation & refunds
- ✈️ Travel assistance
Results:
- Average Time Saved: 270 min
- Success Rate: 93%
- Saved for Consumers: $3M+