Self-Evolving Real-Time Agents: Think While Listening, Speak While Thinking, Learn While Acting

[This post is based on my invited talk at FAISys’25 (The 1st Frontier AI Systems Workshop).]

Hello everyone, I’m honored to speak at FAISys’25 (The 1st Frontier AI Systems Workshop). Today I’m presenting “Self-Evolving Real-Time Agents: Think While Listening, Speak While Thinking, Learn While Acting”.

I’m Co-Founder and Chief Scientist at Pine AI. Currently, Pine AI helps users handle daily tasks through AI-powered phone calls and computer automation. We assist with bill negotiation, subscription cancellation, complaint filing, compensation claims, and more. We’ve saved over $3 million for consumers with a 93% success rate, saving each user an average of 270 minutes.

Learning from experience represents the fundamental challenge in machine learning. Current autonomous AI agents face two core challenges in practical applications: real-time interaction with environments and learning from experience. Today I’ll introduce our technical breakthroughs in both areas.

Two Fundamental Challenges

Challenge I: High Latency in Real-Time Interaction

Real-time voice agents must respond within 1 second like humans, but traditional architectures using reasoning LLMs introduce 2-10 second delays.

VAD (Voice Activity Detection) Challenges:

Must wait 500-800ms of continuous silence to confirm user finished
“Uh-huh” mistakenly triggers interruption
Lost acoustic information (emotions, environment)

ASR (Automatic Speech Recognition) Challenges:

No context leads to high error rates (emails, names, phone numbers)
Lack of world knowledge causes transcription errors

LLM Challenges:

Forced to wait, cannot think while listening
Cannot speak while thinking (5-10 second silence)
Poor turn detection (when to speak/stay silent)

Challenge II: Learning from Experience

Models are “intelligent” but not “proficient” — like top graduates lacking real-world experience.

Fixed Models Cannot Learn:

Cannot learn from successful traces
Cannot learn from unsuccessful traces
Parameters frozen after deployment

Big World Hypothesis:
The world is too large to pre-encode all knowledge:

Business processes are dynamic and non-public
Verification info varies by company
Service rules constantly change
Pre-trained knowledge insufficient for deployment

Part I: Real-Time Agent-Environment Interaction

Agents need real-time interaction with three types of targets:

Humans: Dialogue and collaboration through real-time voice
Digital World: Operating computers, browsing web, using mobile devices
Physical World: Controlling robots, interacting with real environments

Typical Voice Agent Architecture

Traditional architecture has three layers:

Perception Layer: VAD + ASR
- Input: Continuous signals (audio streams)
- Output: Discrete events (speech_start, interrupt, laugh, speech_fragment, etc.)
- Function: Transform continuous signals into discrete events
Thinking Layer: LLM
- Input: Discrete event stream (observations and tool results)
- Output: Interleaved thoughts, tool calls, and output sentences
- Function: Asynchronous processing
Execution Layer: TTS
- Input: Discrete events (text)
- Output: Continuous actions (audio stream)
- Function: Transform discrete events into continuous actions

Problems with Traditional VAD + ASR Architecture

Three VAD Problems:

Unavoidable Latency: Must wait 500-800ms of continuous silence to confirm user finished
Poor Interrupt Detection: Cannot distinguish background noise/music; “Uh-huh” mistakenly triggers interruption
Low Voice Detection Accuracy: Errors in complex acoustic environments; mid-sentence pauses cause truncation; “Hello” in background noise causes unresponsiveness

Three ASR Problems:

Low Accuracy Without Context: VAD cuts audio into isolated segments; cannot use context for disambiguation; high errors for emails, names, phone numbers
Lack of World Knowledge: Cannot leverage common sense; low accuracy for addresses, brands, technical terms, amounts
Text-Only Output Lacks Acoustic Details:
- Lost emotions: happy, frustrated, excited
- Lost paralinguistic info: laugh, sigh, breath
- Lost environment: noisy, music, quiet

Solution: Streaming Voice Perception Model

We propose a Streaming Voice Perception Model to replace VAD + ASR:

Multimodal Architecture:

Audio Encoder (from Whisper): Converts audio to audio tokens
Qwen LLM (autoregressive): Processes audio tokens, outputs text + events

Key Advantages:

Streaming: Real-time output (not batch)
Context: Full dialogue history preserved
In-Context Learning: Better recognition for personal info, domain terms
World Knowledge: Higher accuracy for addresses, brands, amounts

Rich Output: Text + Acoustic Events

In addition to text tokens, outputs special tokens (acoustic events):

<speak_start> <speak_end>: Speech boundaries
<interrupt>: Interruption intent
<emotion:happy>: Emotion markers
<laugh> <sigh>: Paralinguistic info
<music>: Environmental sounds

Interactive ReAct: Flexibly Interleaved Observation-Thinking-Action

Traditional ReAct has a rigid OTA loop (Observation-Thinking-Action):

Fixed Loop: Must complete entire Observation-Thinking-Action sequence
Thinking Lost: Cannot think while listening, high latency
Rigid: Must wait for complete input before thinking

Interactive ReAct enables flexible OTA interleaving:

Think While Listening: New observations insert anytime, thinking preserved
Speak While Thinking: Fast response, then continue thinking
Intelligent Turn Detection: Decide when to speak, when to stay silent

Example Comparison:

Traditional ReAct (total ~20+ seconds):

O₁: "I want to lower my Xfinity bill to $79 per month"
T₁: (thinking 5s... then interrupted, all lost)
O₂: "and I do not want to cut off any features"
T₂: (thinking 15s...)
A₁: "Got it! Here is a $79 plan with all the features..."

Interactive ReAct (total ~6 seconds):

O₁: "I want to lower my Xfinity bill to $79 per month"
T₁: (fast think 0.5s: user utterance incomplete, wait)
T₂: (thinking 5s... then interrupted)
O₂: "and I do not want to cut off any features"
T₃: (fast think 0.5s: user wants to lower bill to $79)
A₁: "I can help you with that! Let me check the available plans"
T₄: (continuing thinking... 10s)
A₂: "Got it! Here is a $79 plan with all the features..."

Think While Listening

Key Insight: LLM is 20-100x Faster Than Human Speech - Use Gap Time to Think!

LLM Processing Speed:
- Prefill (Input): 1000+ tokens/sec
- Decode (Output): 100 tokens/sec
Human Voice Input/Output Speed:
- Speaking: 5 tokens/sec (text) or 20 tokens/sec (audio tokens)
- LLM is 20-100x faster than humans!

Example: Interview Agent with Async Tool Calls While Candidate Speaks

Candidate: "My previous role involved building distributed systems..."
Think: Distributed systems - need to assess depth. Let me search...
Tool Call: web_search("candidate distributed systems projects") (async!)

Candidate: "...we handled 10M requests/sec using Kafka and Redis" (speaking while tool runs)
Think: Kafka+Redis is solid for high throughput. Continue listening...

Tool Result: GitHub shows 3 open-source projects, 2K+ stars total
Think: Tool result confirms experience! Integrate with what candidate said...
Assistant: "That's impressive scale! (<0.5s!)
            Tell me about your toughest scaling challenge..."

Advantage: Async tools + thinking while listening → no waiting, ultra-fast response

Speak While Thinking

Theory: ⚡ Fast → 🐢 Slow → 🐌 Continuous Thinking Using Filler Speech

Three Phases of Thinking:

⚡ Fast (0.5s, 50 tokens): Quick judgment → immediate response
🐢 Slow (5s, 500 tokens): Deep analysis → complete answer
🐌 Continuous (interleaved thinking and speaking): Keep thinking → keep speaking

Key: Use “filler speech” to maintain conversation flow during deep thinking

Example: Interview Agent Asking Complex Question

Candidate: "I'm ready for the technical question."

Think: Complex question, need to formulate carefully
Assistant: "Let me ask you a system design question." (⚡ 0.5s)

Think: Need to cover scalability, consistency, latency... (🐢 5s)
Assistant: "Imagine you're building a global CDN."

Think: Continue - specify the cache invalidation challenge...
Assistant: "How would you handle cache invalidation across
            100+ edge servers when content is updated?"

Result: Question unfolds naturally sentence-by-sentence, no awkward silence

Future: Three Stages of AI Agent-Environment Interaction

Real-time asynchronous interaction with environment is fundamental to agents.

🗣️ Stage 1: Voice

Input: Voice
Output: Voice
Data Rate: 15-50 token/s
Latency: <500ms
Challenge: Fast-slow thinking balance
Solution: Interactive ReAct

💻 Stage 2: Computer Use

Input: Visual (screenshots)
Output: Mouse/keyboard actions
Data Rate: ~2K token/frame
Latency: <1 second
Challenge: Precise action execution
Solution: VLA models + RL

🤖 Stage 3: Physical World

Input: Vision+Voice+Tactile
Output: Voice+Joint actions
Data Rate: ~20K token/s
Latency: <100ms
Challenge: Real-time control
Solution: VLA + World Models

Key Insight: Complexity increases (data rate ↑, latency ↓), but architectural solutions transfer across stages

Part II: Agents Learning from Experience

“We want AI agents that can discover like we can, not which contain what we have discovered.”
— Richard Sutton

Why Agents Must Learn from Experience: From “Intelligent” to “Proficient”

🎓 SOTA Models ≈ Top Graduates

✅ Knowledgeable: Master vast amounts of general knowledge

❌ Lack Experience: Underperform vs. experienced professionals on specialized tasks (e.g., accounting, tax filing)

💼 Real Challenges in Pine AI

🔑 Verification Info
- 1st call: learns credit card last 4 digits required
- 2nd call: should proactively request it
📋 Service Procedures
- 1st cancellation: told to fill online form instead of phone call
- 2nd cancellation: should directly fill online form
🎯 Service Rules
- Which discounts apply? (veterans, 2-year loyalty, etc.)
💰 Price Estimation
- Is $60/month for 3Gbps broadband high or low? Room to negotiate?

Core Problem: Many business processes are dynamic and non-public. Simply improving the base model’s general capabilities cannot solve these “experience-based” problems.

Building Self-Evolving Agents

Making agents learn from experience through three paradigms:

Paradigm 1: Post-Training
Paradigm 2: In-Context Learning
Paradigm 3: Externalized Learning

Method 1: Post-Training - SFT Memorizes, RL Generalizes

📚 Supervised Fine-Tuning (SFT)

✅ Advantages:

Extremely sample-efficient (thousands suffice)
Quickly solidifies formats and protocols
Stable training, fast convergence

❌ Limitations:

Memorizes surface patterns
Cliff-like degradation on out-of-distribution
Hard to learn transferable strategies

🎯 Reinforcement Learning (RL)

✅ Advantages:

Learns transferable policy representations
Robust in out-of-distribution scenarios
Discovers new strategies beyond training data

❌ Limitations:

Low sample efficiency (100x more data and compute)
High training cost and time
Requires verifiable reward signals

💡 Engineering Practice: Form Before Function

SFT Phase: Establish format stability, ensure parseable outputs
RL Phase: Break through generalization boundaries on stable foundation
Key Balance: Train SFT until “format stable, capabilities emerging”

Improving Sample Efficiency (I): On-Policy Distillation

Comparing Three Training Approaches:

SFT (Supervised Fine-Tuning)

Sampling: ❌ Off-policy (teacher’s trajectories)
Reward: ✅ Dense (token-by-token)
Problem: Compounding errors in student’s states

RL (Reinforcement Learning)

Sampling: ✅ On-policy (student’s rollouts)
Reward: ❌ Sparse (only final outcome)
Problem: One signal per episode, inefficient

✨ On-Policy Distillation

Sampling: ✅ On-policy (student’s trajectories)
Reward: ✅ Dense (teacher grades each token)
Best of both worlds!

🔧 How It Works

# Sample from student
trajectory = student.generate(prompt)

# Teacher grades EVERY token
for token in trajectory:
    teacher_logprobs = teacher(token | ctx)
    student_logprobs = student(token | ctx)
    
    # Minimize reverse KL
    loss = KL(student || teacher)

🎯 Key Benefits:

10x more efficient than RL
Student learns to recover from its own mistakes
Can reuse training data (multi-epoch)
Enables continual learning

Improving Sample Efficiency (II): Feedback-Guided Sampling

❌ Traditional GRPO/DAPO

Process:

Generate N independent rollouts
Later attempts repeat same errors

Example:

Rollout 1: Requires SSN → ❌ Failure
Rollout 2: Requires SSN → ❌ Failure
Rollout 3: Requires SSN → ❌ Failure
...(wasting environment feedback)

✅ Feedback-Guided Sampling

Sequential Process:

1st rollout: From original prompt
2nd rollout: Prompt + 1st feedback in context
Nth rollout: Accumulate feedback from N-1 rollouts

Example:

Rollout 1: Requires SSN → ❌ Failure
Rollout 2: [Knows SSN] Prepared → ✅ Success
Rollout 3: [Knows SSN] Prepared → ✅ Success
...(rapid adaptation within batch!)

📈 Result: More high-quality samples per batch

This is essentially an online learning process:

Externalized Learning: Feedback accumulated in knowledge base after each rollout
Online RL: Agent adapts its policy based on accumulated feedback within the batch

Method 2: In-Context Learning

⚠️ Common Misconception

“With long context, just put all history in and let the model automatically reason”

This is a serious misconception about context capabilities!

🔍 What Context Really Does

Nature: Retrieval, NOT reasoning engine
Mechanism: Key-value similarity matching (like RAG)
✅ Good at: Finding relevant information
❌ Poor at: Statistical aggregation & counting

⚠️ Real Case: Three-Call Limit

Rule: Max 3 calls to same merchant
Context: Trajectory has multiple Xfinity calls
Problem:
- Must scan entire trajectory to count
- Easily miscounts → makes 4th call
- Even if correct, wastes reasoning tokens
Cost: O(trajectory length) per decision

System Hint: Making Implicit State Explicit

Solution: Pre-aggregate information → Reduce O(n) to O(1) context lookups

💡 How System Hint Works

<system_hint>
Tool call summary:
- 'phone_call' called 3 times
  - Xfinity: 3 times (limit reached)

Constraint check:
- Cannot call Xfinity again
</system_hint>

✅ Benefit:

Complexity: O(n) → O(1)
Model uses aggregated info directly
No scanning or counting needed

📋 Four Types of System Hints

Task Planning

1 2	TODO: [✅] Call customer service [ ] Call retention dept

Side-Channel Info
1
[2025-06-25 11:00:20] User message

Environment State

1 2	Current dir: /home/ubuntu OS: Ubuntu 24.04

LLM-Generated Summary

1 2	Conversation summary: User wants $79 Xfinity plan with all current features

Method 3: Externalized Learning (Knowledge Base)

🚫 NEVER Store Raw Cases Directly in Knowledge Base

Storing raw dialogues/cases without distillation leads to incomplete retrieval and wrong conclusions.

🐱 Case 1: Cat Counting Problem

Scenario: 100 cases: 90 black cats, 10 white cats (all separate)
Question: “What’s the ratio?”

❌ Raw Storage Problem:

Top-k=20 retrieves partial cases only
Incomplete sample → Wrong inference

✅ Distilled Approach:

1 2	"Total 100 cats: 90 black (90%), 10 white (10%)"

→ Single retrieval, accurate!

💼 Case 2: Discount Rule Error

Scenario: 3 cases: Veteran John ✅, Doctor Sarah ✅, Teacher Mike ❌
Question: “I’m a nurse, discount?”

❌ Raw Storage Problem:

“Nurse” ≈ “Doctor” → Retrieves Sarah only
Cases A, C missed → Wrong inference

✅ Distilled Approach:

1 2	"Xfinity discount: ONLY veterans & doctors qualify"

→ Complete rule, correct answer!

Active Knowledge Distillation: Compression is Understanding

Core Principle: Invest extra compute now (LLM summarization) → Save reasoning tokens later

💡 Why Distillation?

❌ Raw trajectory (3 calls):

10:00 Call Xfinity (billing)
10:30 Call Xfinity (transfer)
11:00 Call Xfinity (negotiate)

Model must scan O(n) to count

✅ After distillation:

1
2
3

"Called Xfinity 3 times (limit)"

O(1) lookup, instant recognition

📊 Three Levels of Knowledge Distillation

Statistical Aggregation
- 100 cases → “90% black, 10% white”
- Reduce density, improve retrieval
Rule Distillation
- 3 cases → “Only veterans & doctors”
- Leap from cases to abstract rules
Structured Knowledge Extraction
- RAPTOR: Tree summaries
- GraphRAG: Entity networks

Summary: 3 Paradigms of Agent Continual Learning

Paradigm 1: Post-Training

Core Finding: SFT memorizes, RL generalizes
SFT: Solidifies formats and protocols, high sample efficiency
RL: Learns transferable strategies, out-of-distribution robust

Paradigm 2: In-Context Learning

Core Insight: Context ≠ Memory
Nature: Attention is similar to RAG
Methods: System hints, explicit summarization

Paradigm 3: Externalized Learning

3.1 Knowledge Base

Advantages: Leverages extra compute for knowledge extraction
Methods: Contextual retrieval, RAPTOR hierarchical summaries

3.2 Tool Generation

Advantages: Codifies processes, efficient, reliable, composable
Philosophy: Minimal predefinition + Maximum self-evolution (Alita)

Summary

Part I: Real-Time Interaction

Think While Listening, Speak While Thinking

❌ Problem: Serial architecture: VAD waits → ASR transcribes → LLM thinks → TTS speaks

✅ Solution:

Perception: Streaming model produces context-aware transcription and acoustic events
Thinking: Event-driven, can think while listening and speaking

💡 Example: Telecom Plan Query - No Awkward Silence

O: "Should I order this plan?"
T₁: (fast 0.5s) Need more time
A₁: "Let me check the details..."
T₂: (slow 5s) Analyze plan...
A₂: "Yes, saves $30/month!"

Part II: Learning from Experience

Learn While Acting

❌ Problem:

Fixed models cannot learn from experience after deployment
Big world: business processes are dynamic & non-public