[This post is based on my invited talk at FAISys’25 (The 1st Frontier AI Systems Workshop).]

View Talk Slides (HTML)

Slides Source Code

Hello everyone, I’m honored to speak at FAISys’25 (The 1st Frontier AI Systems Workshop). Today I’m presenting “Self-Evolving Real-Time Agents: Think While Listening, Speak While Thinking, Learn While Acting”.

I’m Co-Founder and Chief Scientist at Pine AI. Currently, Pine AI helps users handle daily tasks through AI-powered phone calls and computer automation. We assist with bill negotiation, subscription cancellation, complaint filing, compensation claims, and more. We’ve saved over $3 million for consumers with a 93% success rate, saving each user an average of 270 minutes.

Learning from experience represents the fundamental challenge in machine learning. Current autonomous AI agents face two core challenges in practical applications: real-time interaction with environments and learning from experience. Today I’ll introduce our technical breakthroughs in both areas.

Two Fundamental Challenges

Challenge I: High Latency in Real-Time Interaction

Real-time voice agents must respond within 1 second like humans, but traditional architectures using reasoning LLMs introduce 2-10 second delays.

VAD (Voice Activity Detection) Challenges:

  • Must wait 500-800ms of continuous silence to confirm user finished
  • “Uh-huh” mistakenly triggers interruption
  • Lost acoustic information (emotions, environment)

ASR (Automatic Speech Recognition) Challenges:

  • No context leads to high error rates (emails, names, phone numbers)
  • Lack of world knowledge causes transcription errors

LLM Challenges:

  • Forced to wait, cannot think while listening
  • Cannot speak while thinking (5-10 second silence)
  • Poor turn detection (when to speak/stay silent)

Challenge II: Learning from Experience

Models are “intelligent” but not “proficient” — like top graduates lacking real-world experience.

Fixed Models Cannot Learn:

  • Cannot learn from successful traces
  • Cannot learn from unsuccessful traces
  • Parameters frozen after deployment

Big World Hypothesis:
The world is too large to pre-encode all knowledge:

  • Business processes are dynamic and non-public
  • Verification info varies by company
  • Service rules constantly change
  • Pre-trained knowledge insufficient for deployment

Part I: Real-Time Agent-Environment Interaction

Agents need real-time interaction with three types of targets:

  • Humans: Dialogue and collaboration through real-time voice
  • Digital World: Operating computers, browsing web, using mobile devices
  • Physical World: Controlling robots, interacting with real environments

Typical Voice Agent Architecture

Traditional architecture has three layers:

  1. Perception Layer: VAD + ASR

    • Input: Continuous signals (audio streams)
    • Output: Discrete events (speech_start, interrupt, laugh, speech_fragment, etc.)
    • Function: Transform continuous signals into discrete events
  2. Thinking Layer: LLM

    • Input: Discrete event stream (observations and tool results)
    • Output: Interleaved thoughts, tool calls, and output sentences
    • Function: Asynchronous processing
  3. Execution Layer: TTS

    • Input: Discrete events (text)
    • Output: Continuous actions (audio stream)
    • Function: Transform discrete events into continuous actions

Problems with Traditional VAD + ASR Architecture

Three VAD Problems:

  1. Unavoidable Latency: Must wait 500-800ms of continuous silence to confirm user finished
  2. Poor Interrupt Detection: Cannot distinguish background noise/music; “Uh-huh” mistakenly triggers interruption
  3. Low Voice Detection Accuracy: Errors in complex acoustic environments; mid-sentence pauses cause truncation; “Hello” in background noise causes unresponsiveness

Three ASR Problems:

  1. Low Accuracy Without Context: VAD cuts audio into isolated segments; cannot use context for disambiguation; high errors for emails, names, phone numbers
  2. Lack of World Knowledge: Cannot leverage common sense; low accuracy for addresses, brands, technical terms, amounts
  3. Text-Only Output Lacks Acoustic Details:
    • Lost emotions: happy, frustrated, excited
    • Lost paralinguistic info: laugh, sigh, breath
    • Lost environment: noisy, music, quiet

Solution: Streaming Voice Perception Model

We propose a Streaming Voice Perception Model to replace VAD + ASR:

Multimodal Architecture:

  1. Audio Encoder (from Whisper): Converts audio to audio tokens
  2. Qwen LLM (autoregressive): Processes audio tokens, outputs text + events

Key Advantages:

  • Streaming: Real-time output (not batch)
  • Context: Full dialogue history preserved
  • In-Context Learning: Better recognition for personal info, domain terms
  • World Knowledge: Higher accuracy for addresses, brands, amounts

Rich Output: Text + Acoustic Events

In addition to text tokens, outputs special tokens (acoustic events):

  • <speak_start> <speak_end>: Speech boundaries
  • <interrupt>: Interruption intent
  • <emotion:happy>: Emotion markers
  • <laugh> <sigh>: Paralinguistic info
  • <music>: Environmental sounds

Interactive ReAct: Flexibly Interleaved Observation-Thinking-Action

Traditional ReAct has a rigid OTA loop (Observation-Thinking-Action):

  • Fixed Loop: Must complete entire Observation-Thinking-Action sequence
  • Thinking Lost: Cannot think while listening, high latency
  • Rigid: Must wait for complete input before thinking

Interactive ReAct enables flexible OTA interleaving:

  • Think While Listening: New observations insert anytime, thinking preserved
  • Speak While Thinking: Fast response, then continue thinking
  • Intelligent Turn Detection: Decide when to speak, when to stay silent

Example Comparison:

Traditional ReAct (total ~20+ seconds):

1
2
3
4
5
O₁: "I want to lower my Xfinity bill to $79 per month"
T₁: (thinking 5s... then interrupted, all lost)
O₂: "and I do not want to cut off any features"
T₂: (thinking 15s...)
A₁: "Got it! Here is a $79 plan with all the features..."

Interactive ReAct (total ~6 seconds):

1
2
3
4
5
6
7
8
O₁: "I want to lower my Xfinity bill to $79 per month"
T₁: (fast think 0.5s: user utterance incomplete, wait)
T₂: (thinking 5s... then interrupted)
O₂: "and I do not want to cut off any features"
T₃: (fast think 0.5s: user wants to lower bill to $79)
A₁: "I can help you with that! Let me check the available plans"
T₄: (continuing thinking... 10s)
A₂: "Got it! Here is a $79 plan with all the features..."

Think While Listening

Key Insight: LLM is 20-100x Faster Than Human Speech - Use Gap Time to Think!

  • LLM Processing Speed:

    • Prefill (Input): 1000+ tokens/sec
    • Decode (Output): 100 tokens/sec
  • Human Voice Input/Output Speed:

    • Speaking: 5 tokens/sec (text) or 20 tokens/sec (audio tokens)
    • LLM is 20-100x faster than humans!

Example: Interview Agent with Async Tool Calls While Candidate Speaks

1
2
3
4
5
6
7
8
9
10
11
Candidate: "My previous role involved building distributed systems..."
Think: Distributed systems - need to assess depth. Let me search...
Tool Call: web_search("candidate distributed systems projects") (async!)

Candidate: "...we handled 10M requests/sec using Kafka and Redis" (speaking while tool runs)
Think: Kafka+Redis is solid for high throughput. Continue listening...

Tool Result: GitHub shows 3 open-source projects, 2K+ stars total
Think: Tool result confirms experience! Integrate with what candidate said...
Assistant: "That's impressive scale! (<0.5s!)
Tell me about your toughest scaling challenge..."

Advantage: Async tools + thinking while listening → no waiting, ultra-fast response

Speak While Thinking

Theory: ⚡ Fast → 🐢 Slow → 🐌 Continuous Thinking Using Filler Speech

Three Phases of Thinking:

  1. ⚡ Fast (0.5s, 50 tokens): Quick judgment → immediate response
  2. 🐢 Slow (5s, 500 tokens): Deep analysis → complete answer
  3. 🐌 Continuous (interleaved thinking and speaking): Keep thinking → keep speaking

Key: Use “filler speech” to maintain conversation flow during deep thinking

Example: Interview Agent Asking Complex Question

1
2
3
4
5
6
7
8
9
10
11
Candidate: "I'm ready for the technical question."

Think: Complex question, need to formulate carefully
Assistant: "Let me ask you a system design question." (⚡ 0.5s)

Think: Need to cover scalability, consistency, latency... (🐢 5s)
Assistant: "Imagine you're building a global CDN."

Think: Continue - specify the cache invalidation challenge...
Assistant: "How would you handle cache invalidation across
100+ edge servers when content is updated?"

Result: Question unfolds naturally sentence-by-sentence, no awkward silence

Future: Three Stages of AI Agent-Environment Interaction

Real-time asynchronous interaction with environment is fundamental to agents.

🗣️ Stage 1: Voice

  • Input: Voice
  • Output: Voice
  • Data Rate: 15-50 token/s
  • Latency: <500ms
  • Challenge: Fast-slow thinking balance
  • Solution: Interactive ReAct

💻 Stage 2: Computer Use

  • Input: Visual (screenshots)
  • Output: Mouse/keyboard actions
  • Data Rate: ~2K token/frame
  • Latency: <1 second
  • Challenge: Precise action execution
  • Solution: VLA models + RL

🤖 Stage 3: Physical World

  • Input: Vision+Voice+Tactile
  • Output: Voice+Joint actions
  • Data Rate: ~20K token/s
  • Latency: <100ms
  • Challenge: Real-time control
  • Solution: VLA + World Models

Key Insight: Complexity increases (data rate ↑, latency ↓), but architectural solutions transfer across stages

Part II: Agents Learning from Experience

“We want AI agents that can discover like we can, not which contain what we have discovered.”
— Richard Sutton

Why Agents Must Learn from Experience: From “Intelligent” to “Proficient”

🎓 SOTA Models ≈ Top Graduates

Knowledgeable: Master vast amounts of general knowledge

Lack Experience: Underperform vs. experienced professionals on specialized tasks (e.g., accounting, tax filing)

💼 Real Challenges in Pine AI

  1. 🔑 Verification Info

    • 1st call: learns credit card last 4 digits required
    • 2nd call: should proactively request it
  2. 📋 Service Procedures

    • 1st cancellation: told to fill online form instead of phone call
    • 2nd cancellation: should directly fill online form
  3. 🎯 Service Rules

    • Which discounts apply? (veterans, 2-year loyalty, etc.)
  4. 💰 Price Estimation

    • Is $60/month for 3Gbps broadband high or low? Room to negotiate?

Core Problem: Many business processes are dynamic and non-public. Simply improving the base model’s general capabilities cannot solve these “experience-based” problems.

Building Self-Evolving Agents

Making agents learn from experience through three paradigms:

  1. Paradigm 1: Post-Training
  2. Paradigm 2: In-Context Learning
  3. Paradigm 3: Externalized Learning

Method 1: Post-Training - SFT Memorizes, RL Generalizes

📚 Supervised Fine-Tuning (SFT)

Advantages:

  • Extremely sample-efficient (thousands suffice)
  • Quickly solidifies formats and protocols
  • Stable training, fast convergence

Limitations:

  • Memorizes surface patterns
  • Cliff-like degradation on out-of-distribution
  • Hard to learn transferable strategies

🎯 Reinforcement Learning (RL)

Advantages:

  • Learns transferable policy representations
  • Robust in out-of-distribution scenarios
  • Discovers new strategies beyond training data

Limitations:

  • Low sample efficiency (100x more data and compute)
  • High training cost and time
  • Requires verifiable reward signals

💡 Engineering Practice: Form Before Function

  • SFT Phase: Establish format stability, ensure parseable outputs
  • RL Phase: Break through generalization boundaries on stable foundation
  • Key Balance: Train SFT until “format stable, capabilities emerging”

Improving Sample Efficiency (I): On-Policy Distillation

Comparing Three Training Approaches:

SFT (Supervised Fine-Tuning)

  • Sampling: ❌ Off-policy (teacher’s trajectories)
  • Reward: ✅ Dense (token-by-token)
  • Problem: Compounding errors in student’s states

RL (Reinforcement Learning)

  • Sampling: ✅ On-policy (student’s rollouts)
  • Reward: ❌ Sparse (only final outcome)
  • Problem: One signal per episode, inefficient

✨ On-Policy Distillation

  • Sampling: ✅ On-policy (student’s trajectories)
  • Reward: ✅ Dense (teacher grades each token)
  • Best of both worlds!

🔧 How It Works

1
2
3
4
5
6
7
8
9
10
# Sample from student
trajectory = student.generate(prompt)

# Teacher grades EVERY token
for token in trajectory:
teacher_logprobs = teacher(token | ctx)
student_logprobs = student(token | ctx)

# Minimize reverse KL
loss = KL(student || teacher)

🎯 Key Benefits:

  • 10x more efficient than RL
  • Student learns to recover from its own mistakes
  • Can reuse training data (multi-epoch)
  • Enables continual learning

Improving Sample Efficiency (II): Feedback-Guided Sampling

❌ Traditional GRPO/DAPO

Process:

  • Generate N independent rollouts
  • Later attempts repeat same errors

Example:

1
2
3
4
Rollout 1: Requires SSN → ❌ Failure
Rollout 2: Requires SSN → ❌ Failure
Rollout 3: Requires SSN → ❌ Failure
...(wasting environment feedback)

✅ Feedback-Guided Sampling

Sequential Process:

  • 1st rollout: From original prompt
  • 2nd rollout: Prompt + 1st feedback in context
  • Nth rollout: Accumulate feedback from N-1 rollouts

Example:

1
2
3
4
Rollout 1: Requires SSN → ❌ Failure
Rollout 2: [Knows SSN] Prepared → ✅ Success
Rollout 3: [Knows SSN] Prepared → ✅ Success
...(rapid adaptation within batch!)

📈 Result: More high-quality samples per batch

This is essentially an online learning process:

  • Externalized Learning: Feedback accumulated in knowledge base after each rollout
  • Online RL: Agent adapts its policy based on accumulated feedback within the batch

Method 2: In-Context Learning

⚠️ Common Misconception

“With long context, just put all history in and let the model automatically reason”

This is a serious misconception about context capabilities!

🔍 What Context Really Does

  • Nature: Retrieval, NOT reasoning engine
  • Mechanism: Key-value similarity matching (like RAG)
  • Good at: Finding relevant information
  • Poor at: Statistical aggregation & counting

⚠️ Real Case: Three-Call Limit

  • Rule: Max 3 calls to same merchant
  • Context: Trajectory has multiple Xfinity calls
  • Problem:
    • Must scan entire trajectory to count
    • Easily miscounts → makes 4th call
    • Even if correct, wastes reasoning tokens
  • Cost: O(trajectory length) per decision

System Hint: Making Implicit State Explicit

Solution: Pre-aggregate information → Reduce O(n) to O(1) context lookups

💡 How System Hint Works

1
2
3
4
5
6
7
8
<system_hint>
Tool call summary:
- 'phone_call' called 3 times
- Xfinity: 3 times (limit reached)

Constraint check:
- Cannot call Xfinity again
</system_hint>

✅ Benefit:

  • Complexity: O(n) → O(1)
  • Model uses aggregated info directly
  • No scanning or counting needed

📋 Four Types of System Hints

  1. Task Planning

    1
    2
    TODO: [✅] Call customer service
    [ ] Call retention dept
  2. Side-Channel Info

    1
    [2025-06-25 11:00:20] User message
  3. Environment State

    1
    2
    Current dir: /home/ubuntu
    OS: Ubuntu 24.04
  4. LLM-Generated Summary

    1
    2
    Conversation summary:
    User wants $79 Xfinity plan with all current features

Method 3: Externalized Learning (Knowledge Base)

🚫 NEVER Store Raw Cases Directly in Knowledge Base

Storing raw dialogues/cases without distillation leads to incomplete retrieval and wrong conclusions.

🐱 Case 1: Cat Counting Problem

Scenario: 100 cases: 90 black cats, 10 white cats (all separate)
Question: “What’s the ratio?”

Raw Storage Problem:

  • Top-k=20 retrieves partial cases only
  • Incomplete sample → Wrong inference

Distilled Approach:

1
2
"Total 100 cats:
90 black (90%), 10 white (10%)"

→ Single retrieval, accurate!

💼 Case 2: Discount Rule Error

Scenario: 3 cases: Veteran John ✅, Doctor Sarah ✅, Teacher Mike ❌
Question: “I’m a nurse, discount?”

Raw Storage Problem:

  • “Nurse” ≈ “Doctor” → Retrieves Sarah only
  • Cases A, C missed → Wrong inference

Distilled Approach:

1
2
"Xfinity discount: ONLY
veterans & doctors qualify"

→ Complete rule, correct answer!

Active Knowledge Distillation: Compression is Understanding

Core Principle: Invest extra compute now (LLM summarization) → Save reasoning tokens later

💡 Why Distillation?

Raw trajectory (3 calls):

1
2
3
4
5
10:00 Call Xfinity (billing)
10:30 Call Xfinity (transfer)
11:00 Call Xfinity (negotiate)

Model must scan O(n) to count

After distillation:

1
2
3
"Called Xfinity 3 times (limit)"

O(1) lookup, instant recognition

📊 Three Levels of Knowledge Distillation

  1. Statistical Aggregation

    • 100 cases → “90% black, 10% white”
    • Reduce density, improve retrieval
  2. Rule Distillation

    • 3 cases → “Only veterans & doctors”
    • Leap from cases to abstract rules
  3. Structured Knowledge Extraction

    • RAPTOR: Tree summaries
    • GraphRAG: Entity networks

Summary: 3 Paradigms of Agent Continual Learning

Paradigm 1: Post-Training

  • Core Finding: SFT memorizes, RL generalizes
  • SFT: Solidifies formats and protocols, high sample efficiency
  • RL: Learns transferable strategies, out-of-distribution robust

Paradigm 2: In-Context Learning

  • Core Insight: Context ≠ Memory
  • Nature: Attention is similar to RAG
  • Methods: System hints, explicit summarization

Paradigm 3: Externalized Learning

3.1 Knowledge Base

  • Advantages: Leverages extra compute for knowledge extraction
  • Methods: Contextual retrieval, RAPTOR hierarchical summaries

3.2 Tool Generation

  • Advantages: Codifies processes, efficient, reliable, composable
  • Philosophy: Minimal predefinition + Maximum self-evolution (Alita)

Summary

Part I: Real-Time Interaction

Think While Listening, Speak While Thinking

Problem: Serial architecture: VAD waits → ASR transcribes → LLM thinks → TTS speaks

Solution:

  • Perception: Streaming model produces context-aware transcription and acoustic events
  • Thinking: Event-driven, can think while listening and speaking

💡 Example: Telecom Plan Query - No Awkward Silence

1
2
3
4
5
O: "Should I order this plan?"
T₁: (fast 0.5s) Need more time
A₁: "Let me check the details..."
T₂: (slow 5s) Analyze plan...
A₂: "Yes, saves $30/month!"

Part II: Learning from Experience

Learn While Acting

Problem:

  • Fixed models cannot learn from experience after deployment
  • Big world: business processes are dynamic & non-public

Solution:

  • Post-Training: Learn from interactions via RL
  • In-Context: Aggregate info via system hints
  • Externalized: Distill knowledge, generate tools

💡 Example: Credit Card Verification

1
2
3
4
1st call: ❌ Doesn't have last 4 digits of credit card
Learn: Store "Xfinity needs last 4..."
2nd call: ✅ Proactively requests it
→ Experience-based improvement with high sample efficiency

“We want AI agents that can discover like we can, not which contain what we have discovered.”
— Richard Sutton


About Pine AI

Pine AI is an AI Agent that makes calls and uses computers to get things done. As your personal assistant, we contact customer service on your behalf to:

  • 💰 Lower bills (average 20% savings on telecom, utilities)
  • ❌ Cancel subscriptions
  • 📋 File complaints
  • 💵 Get compensation & refunds
  • ✈️ Travel assistance

Results:

  • Average Time Saved: 270 min
  • Success Rate: 93%
  • Saved for Consumers: $3M+

🔗 Learn more: 19pine.ai


View Complete Talk Slides

Comments