It was my great honor, at the invitation of Professor Jiaxing Zhang, to give an academic talk titled “Two Clouds over Agents: Real-time Interaction with the Environment and Learning from Experience” at the Lion Rock Artificial Intelligence Laboratory on September 4. Today I’m sharing the slides and video of this talk for your reference and discussion.

Talk materials

Talk summary

In 1900, Lord Kelvin said in a lecture: “The building of physics is almost complete; there are only two small clouds…” Those two small clouds later triggered the revolutions of relativity and quantum mechanics. Today, the AI agent field faces similar “two clouds.”

The first cloud: the challenge of real-time interaction

Current AI agents face severe latency when interacting with the environment in real time:

The dilemma of voice interaction

  • Serial processing vs real-time needs: must wait for the user to finish speaking to think, and finish thinking to speak
  • Fast vs slow thinking dilemma: deep reasoning takes 10+ seconds (users lose patience), quick responses are error-prone
  • Technical bottlenecks: waiting at every step (VAD detection, ASR recognition, LLM thinking, TTS synthesis)

The “last mile” challenge in GUI operation

  • Agents operate a computer 3–5× slower than humans
  • Every click requires a fresh screenshot and thinking (3–4 s latency)
  • Moravec’s paradox: the model “knows” what to do but “can’t do it”

Our solution: the SEAL architecture

SEAL (Streaming, Event-driven Agent Loop) is our proposed innovative architecture that abstracts all interactions as asynchronous event streams:

  1. Perception

    • Convert continuous signals (speech, GUI) into discrete events
    • Streaming speech perception model replaces VAD + ASR
    • Output rich acoustic events (interruptions, emotions, laughter, etc.)
  2. Thinking

    • Interactive ReAct: break the rigid “observe–think–act” loop
    • Realize thinking-while-listening and speaking-while-thinking
    • Fast thinking (0.5 s) → slow thinking (5 s) → continuous thinking
  3. Execution

    • Train end-to-end VLA models
    • Generate natural speech pauses and fillers
    • Produce human-like mouse movement trajectories

The second cloud: learning from experience

Today’s agents start from scratch on every task, unable to accumulate domain knowledge or improve task proficiency.

The challenge of moving from “smart” to “skilled”

  • SOTA models ≈ top graduates (knowledgeable but lacking experience)
  • Business processes are dynamic and non-public
  • Improving the base model alone cannot solve the “experience” problem

Three learning paradigms

1. Post-training

  • Method: parameter updates via RL
  • Value: solidify experience into parameters
  • Example: Kimi K2’s Model as Agent

2. In-context Learning

  • Method: leverage the Transformer’s attention mechanism
  • Breakthroughs:
    • DeepSeek MLA: 16× KV cache compression
    • Sparse attention: turn the KV cache into a vector database
    • MiniMax-01: hybrid architecture of linear attention + softmax attention

3. Externalized Learning 【Core Innovation】

  • Knowledge base: persistent experience storage, no retraining needed

    • Contextual retrieval: add context to each document chunk
    • LLM automated summarization: turn compute into a scalable knowledge base
  • Tool generation: agent self-evolution

    • Smart RPA: summarize repetitive operations into tools (checking the weather reduced from 47 s to 10 s)
    • Automatic diagnosis: automatically triage issues from production logs
    • MCP-Zero: proactive tool discovery, 98% token savings

Extending the Scaling Law

“The two methods that seem to scale arbitrarily … are search and learning.” — Rich Sutton, The Bitter Lesson

Externalized learning breaks the limits of model parameters:

  • Search → external knowledge bases and tool repositories
  • Learning → LLMs summarize experience into knowledge and code
  • Extend the boundary of the Scaling Law to the external ecosystem

Key insights

  1. The essence of real-time interaction: not making LLMs faster, but enabling them to “think while listening and speak while thinking” like humans
  2. The essence of learning: not stuffing all knowledge into parameters, but building a reliable external system of knowledge and tools
  3. The future of agents: from containers of knowledge to engines of discovery

Pine AI in practice

At Pine AI we are putting these ideas into practice so that AI agents can:

  • Interact with the world in real time (voice calls, GUI operations)
  • Learn from experience (knowledge accumulation, tool generation)
  • Truly solve problems and get things done for users

If you’re interested in building SOTA autonomous AI agents, you’re welcome to join our Pine AI team. We’re looking for full-stack engineers who enjoy co-programming with AI, love hands-on problem solving, and have solid engineering skills. Contact: boj@19pine.ai

Comments