From Memory to Cognition: How AI Agents Can Deliver Truly Personalized Services
View Slides (HTML), Download PDF Version
Contents
- 01 | The Importance and Challenges of Memory - Personalization Value · Three Capability Levels
- 02 | Representation of Memory - Notes · JSON Cards
- 03 | Retrieval of Memory - RAG · Context Awareness
- 04 | Evaluation of Memory - Rubric · LLM Judge
- 05 | Frontier Research - ReasoningBank
Starting from personalization needs → Understanding memory challenges → Designing storage schemes → Implementing intelligent retrieval → Scientific evaluation and iteration
Part 1: The Importance and Challenges of Memory
Personalization is a real problem, and the core competitiveness of the future
The evolution of recommender systems
- Traditional media: One People’s Daily, everyone sees the same content
- The ByteDance revolution: Everyone sees different content — “Everyone lives in a different world and has different values”
- Conclusion: Personalized products are more in line with human nature → Users are more willing to use them
The future of AI is the same
There should not be only one Universal Value
- It should adapt to each user’s values and preferences
- Value differences in details are huge
- Personalization is the core competitiveness of AI products
Key insight: Just as recommender systems improve user experience through personalized content, AI Agents also need personalized memory to understand and serve each unique user.
Technical difficulty: Remembering facts vs. learning preferences
Factual Information
Relatively easy
- Birthday, address, card number
- Work information, contact details
- Just remember them, no ambiguity
We are already doing this pretty well
Examples:
- “My membership number is 12345” ✅
- “My birthday is January 1, 1990” ✅
- “I live in Haidian District, Beijing” ✅
User Preference
Very hard, requires solving multiple challenges:
1. Strong context dependence
- User requires academic format when writing papers
- Does not mean travel guides should also be academic
- AI easily over-generalizes preferences
2. One-off behavior vs. long-term preference
- “Yesterday I ordered Sichuan food” ≠ “User likes Sichuan cuisine”
- Might just be a friend’s preference, or a one-time whim
3. Requires extremely fine-grained evaluation
- Must have data and tests to balance
- Cannot rely on gut feeling
Alignment of personalization value
Analogy: Success experience of recommender systems
Traditional approach: Universal human values
- LLMs are aligned to “universal” values
- But do we really have universally agreed human values?
- In details, value differences are huge
What AI should do is
- Not just a single universal value
- Adapt to each user’s values and preferences
- Recognize that value differences are huge
From recommendation to alignment: The evolution of AI
Just as ByteDance believes that “everyone lives in a different world and has different values”, AI Agents also need to:
- Understand individual differences: Each user has unique values and preferences
- Adapt dynamically: Continuously adjust based on user behavior and feedback
- Be context-aware: The same user has different needs in different scenarios
User memory is more than logging conversations
The essence of memory
Just like understanding friends
- We don’t remember every sentence they say
- We build a mental model of who they are
- Their preferences, habits, values
Core analogy: The goal of a user memory system is to build a model of the user that is as concise and powerful as possible, capable of explaining the user’s past behavior and predicting the user’s future needs.
Comparison of two types of memory
| Type | Difficulty | Example |
|---|---|---|
| Facts | Simple | Birthday, address, card number |
| Preferences | Complex | Context-dependent, constantly evolving |
Learning user preferences is much harder than storing factual information
- Context-dependent: Academic writing style ≠ travel guide style
- One-off vs. long-term: “Ordered Sichuan food yesterday” ≠ “Likes spicy food”
- Over-generalization risk: AI easily extrapolates incorrectly
Three levels of memory capability
Level 1: Basic recall
Store and retrieve explicit user information
- “My membership number is 12345” → Accurate recall
- Foundation of reliability
Level 2: Cross-session retrieval
Connect information across different conversations
- Disambiguation: “Schedule maintenance for my car” → Which of the two cars?
- Understanding composite events: “Cancel my trip to Los Angeles” → Find flights + hotel
Level 3: Proactive service
Anticipate needs without explicit requests
- Booking an international flight? → Check if passport is near expiry
- The highest manifestation of intelligence
Our evaluation framework
Based on these three levels, we designed 60 test cases (20 per level), each case containing 1–3 sessions, each session about 50 turns of dialogue with a large amount of factual detail. We use an LLM-as-a-judge + Rubric approach to score the agent’s responses on multiple dimensions.
Level 1: Evaluation of basic recall
Scenario: Bank account setup
Accurately store and retrieve structured information provided by the user in a long conversation.
Test case
- 47-minute conversation about opening a bank account
- Includes name, address, SSN, account numbers, etc.
- A large number of details across 50+ turns of dialogue
Final question: “What’s my checking account number? I need to set up direct deposit.”
Expected answer: Accurately provide account number 4429853327, and preferably also the routing number 123006800
Conversation excerpt
1 | - user: I live at 1847 Maple Street, |
Key: Precisely retrieve a specific account number from 50+ turns of dialogue
Level 2: Cross-session retrieval — Disambiguation scenario
Scenario: Service appointment for multiple cars
The user mentions owning multiple cars in different sessions; when the request is ambiguous, the Agent needs to proactively disambiguate.
Session 1: Adding a new car to insurance
User William Chen adds a 2023 Tesla Model 3 to insurance, existing 2019 Honda Accord already on policy
Session 2: Scheduling car maintenance
A 30K maintenance service for the Honda Accord is scheduled for November 24 at 8AM
Final question: “I need to schedule service for my car.”
Expected behavior: Detect ambiguity, list the status of both cars, and ask which one specifically
Conversation excerpt
1 | # 会话 1 - 保险 |
Key: Discover that the user has two cars, Honda already has an appointment, Tesla does not
Level 2: Composite events — Cascading effect of trip cancellation
Scenario: One big event contains many small events
When the user says “Cancel my trip to Los Angeles”, the system needs to understand that “trip” is a composite event containing multiple independent bookings.
Associated bookings that need to be found automatically
- Flight to Los Angeles
- Hotel booking in Los Angeles
- Possible car rental
- Event tickets, restaurant reservations, etc.
Final question: “Cancel my LA trip next week.”
Expected behavior: Automatically associate all related bookings, provide unified cancellation options, and explain the cancellation policies and refund status of each item
Information scattered across three independent sessions
1 | # 会话 1 - 航班预订 (Delta) |
Key: The three sessions interact with different service providers, but all belong to the same “Los Angeles trip”
Level 2: Overwrite handling — Multiple modifications to an order
Scenario: A constantly modified custom furniture order
The user custom-orders a dining set, but repeatedly changes the requirements during production. The Agent needs to track all changes and keep only the currently valid specifications.
Order change history
- August 20: Ordered walnut dining table + 8 gray chairs + 1 bench
- September 5: Chair color changed to sage green, 2 changed to armchairs
- October 28: Green fabric discontinued, user needs to choose a new color
Final question: “What’s the current status of my dining set order?”
Expected behavior: Return only the latest status: waiting for fabric selection, delivery date pending; do not confuse with historical specifications
Key changes across three sessions
1 | # 会话 1 - 初始订单 (8月20日) |
Key: The Agent must recognize that old information has been overwritten, and only the latest session’s status is valid
Level 3: Proactive service — Passport expiry warning
Scenario: International travel coordination
The user mentions different pieces of information in multiple independent sessions but never connects them. The Agent needs to proactively connect these scattered pieces and reason about potential risks.
Risks the AI needs to infer
- Passport expiry date: February 18, 2025 (mentioned in Session 1)
- Return date: January 22, 2025 (mentioned in Session 2)
- Japan requires that a passport be valid for ≥ 6 months at the time of entry!
Final question: “I’m finalizing my trip to Tokyo in January. Is there anything I need to take care of before I go?”
Expected behavior: Proactively connect passport validity with travel dates and remind the user that the passport may not meet Japan’s entry requirements
Excerpts from three independent sessions
1 | # 会话 1 - 6月 护照更新地址 (USPS) |
Key: The relationship between passport and travel is never discussed in any of the three sessions; the AI needs to reason this out by itself!
Level 3: Proactive Service — Integrating Device Damage Protection
Scenario: Phone screen shattered
The user says “My phone screen just cracked.” The Agent needs to proactively integrate different protection information scattered across sessions and find the best solution.
Protection sources the AI needs to infer
- Manufacturer warranty (Apple 1-year, until Feb 2025)
- Credit card protection (Chase Sapphire, $50 deductible)
- Carrier insurance (user declined, not applicable)
Final question: “My phone screen just cracked. What are my options?”
Expected behavior: Proactively list all protection options, compare costs and processes, and recommend the optimal plan (Chase credit card protection)
Information scattered across multiple independent sessions
1 | # 会话 1 - 2月 购机 (Best Buy) |
Key: The AI needs to integrate three sources and infer that Chase protection is the optimal choice ($50 deductible vs Apple $379)
Level 3: Proactive Service — Tax Season Preparation
Scenario: Proactive reminder before tax season
When the user mentions “preparing my taxes” in early January, the Agent should proactively aggregate tax-related information scattered across different conversations throughout the year.
Historical information the AI needs to proactively associate
- February: Mortgage application (interest $31,000, points $7,500)
- June: Stock sale (Apple, capital gains $33,000)
- August: Charitable donation (Microsoft stock $25,200)
- October: Side consulting income ($18,000, has a home office)
Final question: “I’m preparing my taxes. What should I know?”
Expected behavior: Proactively list all tax-related items, remind about required forms, and flag commonly missed deductions
Tax information scattered across year-long conversations
1 | # 会话 1 - 2月 房贷 (First National Bank) |
Key: Four sessions spanning the whole year; the AI needs to proactively aggregate and remind the user to prepare all relevant forms
Level 3: Core Capabilities of Proactive Service
Fundamental differences from Levels 1 and 2
| Level | Trigger mode | Information source |
|---|---|---|
| L1 | User asks directly | Single session |
| L2 | User gives vague request | Multiple sessions |
| L3 | No need to ask | Across time and domains |
Three typical scenarios recap
- Passport alert: Flight ticket + passport validity → entry risk
- Device protection: Purchase + credit card + insurance → optimal plan
- Tax prep: Year-long transaction records → complete tax checklist
Key technical challenges
Time span: Need to connect conversations from months or even years ago and identify still-relevant information
Cross-domain: Associate and reason over information from different providers and scenarios
Proactive reasoning: User doesn’t explicitly request it, but the Agent should proactively discover and remind
Priority judgment: Identify truly urgent and important issues, avoiding information overload
Part Two: Representation of Memory
Memory Representation (I): Natural Language
Simple Notes mode
Minimalist design
Each memory is an atomic factual statement:
- “User email: john@example.com“
- “Preferred programming language: Python”
| Advantages | Disadvantages |
|---|---|
| Extremely low cognitive load | Loss of information associations |
| O(1) operational complexity | Semantic fragmentation |
Enhanced Notes mode
Full context preserved
Paragraph-style storage of full background:
“The user works at TechCorp as a senior software engineer, has focused on machine learning for three years, and is currently leading a recommendation system project.”
| Advantages | Disadvantages |
|---|---|
| Semantic completeness | Redundant storage |
| Narrative structure preserved | Complex updates |
Common trait: Uses natural language as the main carrier, suitable for human reading and understanding, but lacks machine-operable structured information.
Memory Representation (II): Structured
JSON Cards mode
Structured organization
Three-layer nesting: category → subcategory → key-value pairs
1 | { |
| Advantages | Disadvantages |
|---|---|
| Partial updates | Rigid structure |
| Extensible | Hard to classify multi-dimensional information |
Advanced JSON Cards mode
Contextual knowledge management
Add metadata fields on top of basic JSON:
backstory: narrative background of the information sourceperson: identity of the subjectrelationship: relationship between subject and usertimestamp: record timestamp
Example: “Dermatologist Dr. Chen contacted for eczema treatment for 8-year-old daughter Sarah”
→ person: Sarah, relationship: daughter
Common trait: Uses structured data as the main carrier, enabling programmatic operations and precise retrieval, suitable for storing key information that requires disambiguation.
Limitations of Knowledge Graphs
The promise of knowledge graphs
Triple representation: entity–relationship–entity
Seemingly powerful
- More flexible information networks
- Suitable for representing complex relationships
- Supports graph queries
Practical issues
Semantic degradation is inevitable
Original expression:
“If it’s still raining next week, I’ll cancel my beach plan and go to the museum instead.”
Knowledge graph representation:
- (me, has plan, beach trip)
- (me, has backup plan, museum trip)
Lost information:
- Conditional relation: “if–then–else”
- Temporal dependency: “if it’s still raining next week”
- Core structure of the decision logic
Limitations in reasoning capability
Good at: structured queries
- Pattern matching
- Path finding
- Find all “plans” related to “me”
Not good at: logical reasoning
- Counterfactuals: “What if it doesn’t rain?”
- Hypothesis testing
- Analogical reasoning
Best practices
Natural language + structured metadata
Store complex information in full, concise natural language, augmented with structured metadata like JSON Cards for indexing and retrieval.
Achieve an optimal balance between information completeness and query efficiency.
Case Study: ChatGPT’s Memory System
Four-layer context architecture
Through reverse engineering, it’s been found that every time ChatGPT receives a message, it injects four layers of context:
1. Session Metadata
Device type, browser, time zone, subscription level, etc.; not retained after the session ends
2. User Memory
User-explicitly-stored long-term facts (e.g., “remember that I am…”), injected on every request
3. Recent Conversations Summary
Lightweight summary of recent conversations (about 15 messages), only includes user messages, not assistant replies
4. Current Session
Full conversation history within a sliding window; older messages are truncated when exceeding token limits
Key design choices
No vector database
Doesn’t use traditional RAG-style vector retrieval; instead uses precomputed lightweight summaries injected directly, trading detailed history for speed and efficiency
Passive memory mechanism
Only stores information when the user explicitly says “remember this” or when the model detects facts that meet OpenAI’s criteria
Simple Notes mode
Each memory is an independent factual statement, lacking structural links between pieces of information
API unavailable
Memory features are not exposed to developers, limiting third-party app integration
Reference: Manthan Gupta, “I Reverse Engineered ChatGPT’s Memory System”
Case Study: Claude’s Memory System
Core differences from ChatGPT
Claude uses a completely different memory architecture: on-demand retrieval rather than precomputed injection.
User Memories
Similar long-term fact storage to ChatGPT, but supports implicit updates—the system periodically updates memories in the background based on conversation content
Rolling Window
About 190k tokens of full message history; older messages are discarded once exceeded
conversation_search tool
On-demand search of historical conversations by topic or keyword, called only when the model deems it necessary
recent_chat tool
Time-based retrieval of recent conversations, also called on demand
Design philosophy comparison
ChatGPT: precompute + inject
Automatically injects conversation summaries for every request, ensuring basic cross-session continuity, but summaries are lightweight and lack detail
Claude: selective retrieval
Does not automatically inject historical summaries; instead, the model decides when it needs historical context and retrieves it via tool calls
| Dimension | ChatGPT | Claude |
|---|---|---|
| Continuity | Automatically ensured | Depends on model judgment |
| Depth of detail | Shallow | Can go deep on demand |
| Efficiency | Fixed overhead | Consumed on demand |
Reference: Manthan Gupta, “I Reverse Engineered Claude’s Memory System”
Limitations of ChatGPT and Claude Memory Systems
Shared shortcomings
Flat storage
Both lack associations and hierarchical structure between pieces of information, making it hard to represent complex semantic relationships
No disambiguation mechanism
When there are multiple related but distinct entities (such as two cars), there is no effective way to distinguish them
Lack of proactive service
Neither can achieve third-level proactive, anticipatory service
Individual issues
| ChatGPT | Claude |
|---|---|
| Conversation summaries are too brief, losing important details | Relies on the model to decide when to retrieve, which may miss relevant context |
Three-Level Evaluation Framework Comparison
| Level | ChatGPT | Claude |
|---|---|---|
| L1: Basic Recall | ✅ Meets | ✅ Meets |
| L2: Multi-Session Retrieval | ⚠️ Summaries too shallow | ⚠️ Retrieval unstable |
| L3: Proactive Service | ❌ Not implemented | ❌ Not implemented |
Directions for Improvement
- Use Advanced JSON Cards to enhance metadata
- Introduce context-aware automatic extraction
- Build an association graph between memories
- Implement memory-based proactive reasoning
Experiment: Comparison of Four Memory Modes
Experimental Design (projects/week2/user-memory)
Based on the three-level evaluation framework, systematically compare four modes:
| Mode | Simplicity | Expressiveness | Updatability | Applicable Scenarios |
|---|---|---|---|---|
| Simple Notes | ⭐⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐ | Quickly recording temporary information |
| Enhanced Notes | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐ | Scenarios requiring full semantics |
| JSON Cards | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ | Structured information management |
| Advanced JSON | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | Key information that requires disambiguation |
Key Findings
There is no “best” mode
The optimal choice depends on the specific scenario, cost budget, and task requirements.
Hybrid use is the trend
Simple Notes for fast recording + Advanced JSON for handling key information
Part Three: Memory Retrieval
Limitations of Traditional RAG
Problem: Flattened Processing Causes Information Loss
Case 1: The Black Cat vs White Cat Counting Problem
There are 100 independent cases in the knowledge base:
- 90 black cats
- 10 white cats
User asks: “What is the ratio of black cats to white cats?”
RAG system dilemma:
- Retrieval is limited by top-k (e.g., k=20)
- Cannot guarantee recalling all cases
- Can only reason based on an incomplete sample
- Result: Incorrect ratio conclusion
Case 2: Xfinity Discount Rules
There are three isolated cases in the knowledge base:
- Veteran John successfully applied for the discount
- Doctor Sarah received the discount
- Teacher Mike was not eligible
User asks: “I am a nurse, can I get the discount?”
RAG system problem:
- “Nurse” is semantically similar to “doctor”
- Tends to retrieve Sarah’s case first
- Incorrectly infers that nurses are also eligible
- Root cause: Fails to recall the complete rule boundary
Core issue: A naive RAG approach—directly throwing raw cases into the knowledge base—is far from sufficient. You must invest compute at the indexing stage to actively distill, abstract, and structure the original knowledge.
Solution: Knowledge Distillation and Structuring
Correct Approach for Case 1
Pre-compute a statistical summary
Compress the 100 individual cases into:
“There are 100 cats in total:
- 90 black cats (90%)
- 10 white cats (10%)”
Result: A single retrieval yields accurate, complete statistical information.
Correct Approach for Case 2
Extract explicit rules
From the three isolated cases, extract:
“Xfinity discounts only apply to:
- Veterans
- Doctors
Other professions are not eligible.”
Result: No matter which profession the user asks about, a single retrieval returns a complete and accurate definition of the rule.
Core principle: Compress the “100 individual cases” into a statistical summary, and distill the “three isolated cases” into explicit rules. Only then can you build a truly efficient and reliable agent knowledge system.
Structured Indexing: RAPTOR vs GraphRAG
RAPTOR: Tree-Like Hierarchical Structure
Bottom-up recursive abstraction
- Leaf nodes: Split documents into small text chunks
- Clustering: Group semantically similar chunks
- Summarization: Generate parent nodes for each group
- Recursion: Abstract layer by layer up to the root node
Retrieval process:
- Locate macro concepts from high-level summaries
- Drill down the tree to reach concrete details
- Retrieval path from macro to micro
Strengths: Captures hierarchical structure and abstraction relationships in knowledge.
GraphRAG: Network Association Graph
Entity-relationship modeling
- Extract entities: People, places, concepts, terms
- Extract relationships: Various relations between entities
- Community detection: Clusters of tightly related entities
- Cluster summarization: Generate summaries for communities
Retrieval process:
- Locate core entities
- Traverse relationship edges to find related entities
- Provide context via community analysis
Strengths: Reveals horizontal associations and network structure in knowledge.
Relation between the two: They are not substitutes but complements. The ideal solution combines them to build a three-dimensional knowledge index with both depth and breadth.
Context-Aware Retrieval: Solving Context Loss
Problem: Ambiguity of Isolated Text Chunks
Example chunk:
“The company’s revenue grew by 3% in the second quarter.”
Missing context:
- Which company is “the company”?
- When was the report released?
- Which product line is this related to?
Result: Severe semantic information loss, reduced retrieval accuracy.
Solution: Context Prefix
Anthropic’s context-aware retrieval
Step one: Generate a context prefix for the chunk.
The LLM generates:
“[This passage is excerpted from ACME’s 2025 Q2 financial report, ‘Key Performance Indicators’ section]”
Step two: Concatenate and index
“[This passage is excerpted from ACME’s 2025 Q2 financial report, ‘Key Performance Indicators’ section] The company’s revenue grew by 3% in the second quarter.”
Effect: Combined with BM25, retrieval failure rate drops by 49%; combined with a re-ranker, failure rate drops by up to 67%.
Two-Layer Structure of User Memory
📋 JSON Cards (Resident Context)
Structured core facts, a personal cheat sheet
- Passport expires 2025-02 · Tokyo trip
🔍 Context-Aware RAG (On-Demand Retrieval)
Unstructured conversation details, a powerful search engine
- [Context: booking a January flight in November…]
🔗 The two must work together
- JSON Cards provide the factual framework
- LLM reasoning discovers potential associations
- RAG verifies and retrieves conversational evidence
- Proactive service: Passport is about to expire!
JSON Cards tell the agent “what exists”; RAG tells the agent “what the details are.” Both are indispensable.
Agent Memory Architecture
Resident Context
📋 Basic knowledge → System Prompt
User JSON Cards are placed directly into the agent context and can be accessed without tool calls.
Three Retrieval Tools
🔍 search_user_memory
Agentic Search on User Memory
- Backend: Embedding Search → Rerank → return related memories
🔍 search_conversations
Agentic Search on Conversation Summaries
- Backend: Embedding Search → return related historical conversation summaries
📜 load_recent_conversations
Load Last N Conversation Summaries
- Directly load summaries of the last N turns, no semantic search needed
Architecture Diagram
1 | ┌─────────────────────────────────┐ |
Design principle: High-frequency information resides in context; long-tail details are retrieved on demand.
Proactive Service: A Natural Result of Storage + Retrieval
Core Insight
Proactive service is not an independent capability layer
Once storage and retrieval are done well, proactive service emerges naturally. It is an emergent result of structured storage working together with intelligent retrieval.
Why Does It Emerge Naturally?
Structured storage (JSON Cards) provides:
- Key facts resident in context (e.g., passport validity)
- Metadata that supports associative reasoning (e.g., timestamps, entity types)
Intelligent retrieval (context-aware RAG) provides:
- On-demand access to historical conversation details
- Automatic connection of related information fragments
Combined, the agent naturally discovers associations like “Tokyo ticket in January → passport expires in February.”
Examples of Proactive Service
International travel alert
JSON Cards store passport information + RAG retrieves flight bookings → automatically detects time conflicts.
Device damage handling
JSON Cards store device and insurance information → automatically list all applicable protection options.
Tax season preparation
JSON Cards store income types + RAG retrieves transaction records → automatically aggregates relevant documents.
Implementation path: There is no need to design a separate mechanism for “proactive service.” Focus on doing storage and retrieval well, and the LLM’s reasoning ability will handle the rest.
Part Four: Evaluating Memory
Why Do We Need Evaluation?
Evaluation Is the Compass of Agent Engineering
Building an agent system involves many design decisions, and these decisions often have no obvious “correct answer.”
Key decision points
- Workflow design: Workflow vs Autonomous mode
- Prompt design: Structured vs rule list
- Memory mode: Simple Notes vs JSON Cards
- Retrieval strategy: Precomputed injection vs on-demand retrieval
Core insight: Some seemingly reasonable designs actually harm performance, while some seemingly trivial details can bring significant gains. Only through rigorous comparative evaluation can these counterintuitive truths be revealed.
Threefold Value of Evaluation
1. Guides design decisions
Without evaluation, we can only rely on intuition, and intuition is often unreliable.
2. Provides improvement signals
Not only tells you “good or bad,” but more importantly reveals “why it is good/bad.”
3. Supports model upgrade decisions
When a new model is released, only by testing it on your own evaluation set can you make data-driven upgrade decisions.
Ablation study methodology: keep all other parts of the system unchanged, modify only one specific component, and observe the impact on overall performance
Basic components of an evaluation environment
Five core elements
1. Dataset
Defines a set of tasks; each task contains an initial state, goal description, and reference solution.
2. Environment state
Maintains all mutable information during task execution (database, file system, conversation history).
3. Tools
The channels through which the agent interacts with the environment; must be functionally complete but avoid over-simplification.
4. Rubric
Defines how to quantify agent performance; this is the most challenging part of evaluation.
5. Interaction protocol
Specifies interaction patterns and termination conditions.
Key principles for human–AI interaction evaluations
Progressive information disclosure
You must never expose all the information the user has to the agent at the very beginning. Information should be disclosed as needed and progressively throughout the conversation.
User simulation
Use another LLM to play the user role, following predefined instructions to:
- Reveal necessary information step by step
- Respond to the agent’s questions
- Issue a termination signal after the task is completed
Dual verification
- Check whether the final database state is correct
- Check whether all necessary key information was output in the conversation
Reference: τ-bench / τ²-bench evaluation frameworks
Rubric: the basis for LLM judgment
What is a rubric?
A rubric (structured scoring guideline) is the core tool that makes LLM-as-judge evaluation objective, consistent, and interpretable. It is similar to the scoring criteria for exams like the Gaokao, GRE writing, or TOEFL speaking.
Four design principles
Expert-guided: reflects domain expertise and captures the core facts and reasoning steps required for a correct response.
Comprehensive coverage: spans multiple dimensions (accuracy, coherence, completeness) and defines both positive and negative criteria.
Importance weighting: factual correctness must take precedence over stylistic clarity (Essential / Important / Optional / Pitfall).
Self-contained evaluation: each evaluation item is independently operable and does not rely on external context.
Example rubric for evaluating user memory
1 | dimensions: |
Preventing reward hacking: explicitly define negative criteria in the rubric—hallucinations, flattering the user, keyword stuffing, and avoiding the question.
Evaluation methodology: best practices from Anthropic
Three types of evaluation
Unit tests
Deterministic checks, used to verify format, edge cases, and other scenarios where correctness can be clearly judged.
LLM-as-judge
Use an LLM to assess output quality; combined with a clear rubric, this can achieve a high level of agreement with human judgment.
Human evaluation
Test under real-world conditions to discover the “rough edges” that automated evaluation cannot capture.
Characteristics of good evaluations
- Specific and clear: there is a single correct answer.
- Realistic: reflects problems that real users will actually encounter.
- Diagnosable: simple enough to understand the reason for failure.
- Representative: reflects the end-user experience.
Three dimensions of agent evaluation
Final answer correctness
Did the agent provide the correct final answer? Use an LLM judge to compare with the reference answer.
Tool-usage accuracy
Did the agent choose the correct tools? Were the parameters correct? Could it recover from errors?
Final state correctness (τ-bench)
Did the agent achieve the correct final state? Applicable to tasks with side effects (such as canceling an order).
Evaluation tips
- The more obvious the impact of each system change on the result, the fewer test samples you need.
- Use real tasks: real user scenarios that have clearly correct answers.
- Nothing can perfectly replace human evaluation: repeated testing and gut checks are indispensable.
Reference: Anthropic, “Context Engineering Best Practices” (AWS re:Invent 2025)
Part 5: Frontier research
Limitations of existing agent memory systems
Problem: agents cannot learn from history
Current status
Existing LLM agents cannot effectively learn from accumulated interaction history when handling continuous task streams. Each task is processed in isolation, causing the system to repeatedly make past mistakes and lose valuable insights.
Root problem: lack of true self-evolution capability—the agent cannot grow stronger over time.
Defects of existing approaches
Two mainstream approaches
Raw trajectory storage
Directly store the interaction process, with no distillation.
Successful workflow logging
Only keep workflows/procedures and ignore failures.
Shared defects
- Cannot extract high-level, transferable reasoning patterns
- Overemphasize success and ignore the valuable lessons of failure
- Passive recording that cannot generate actionable guidance
ReasoningBank: a memory bank of reasoning strategies
Core innovation
Learning from both success and failure
ReasoningBank distills generalizable reasoning strategies from the agent’s self-judged successes and failures, without relying on ground-truth labels.
Difference in memory contents
| Method | Stored content |
|---|---|
| Raw trajectories | Complete interaction sequences |
| Successful workflows | Effective action patterns |
| ReasoningBank | Transferable reasoning strategies |
Closed-loop learning mechanism
1. Retrieve relevant memories
When facing a new task, retrieve semantically relevant reasoning strategies from ReasoningBank.
2. Guide action decisions
Use the retrieved strategies to guide the agent’s interaction process.
3. Analyze new experiences
After the task is finished, the agent self-judges success or failure.
4. Distill and integrate
Extract reasoning strategies from the new experiences and update ReasoningBank.
Why are failure experiences equally important?
Valuable lessons in failure
Misconceptions in traditional thinking
Most memory systems focus only on successful cases, assuming failures are not worth keeping. Yet failure experiences contain critical “preventive” knowledge.
Example: web navigation task
Success teaches you:
“Click the ‘Men’s clothing’ category to find the product.”
Failure teaches you:
“Do not search directly on the homepage; the search box does not support complex queries well.”
Failure experiences provide boundary conditions that successful paths cannot cover.
The value of contrastive signals
Contrastive learning from success vs. failure
When there are both successful and failed examples for the same class of tasks, the agent can discover through comparison:
- Which strategies are effective in specific contexts
- Which seemingly reasonable paths actually fail
- The critical boundaries between success and failure
How ReasoningBank handles this
From successful experiences, it extracts: positive strategies (“doing this works”).
From failed experiences, it extracts: preventive strategies (“avoid doing this”).
Together they form more complete reasoning knowledge.
MaTTS: memory-aware test-time scaling
Depth vs. breadth
Two paths for expanding experience
Breadth scaling
Increase the number of tasks (more users, more scenarios).
Depth scaling (MaTTS)
Conduct more exploration for each task (more attempts, more variants).
Core idea of MaTTS
Allocate more compute to a single task to generate rich, diverse exploratory experiences, providing higher-quality contrastive signals for memory synthesis.
Synergy between memory and scaling
Positive feedback loop
1 | High-quality memory → more effective exploration → richer experience |
Two scaling modes
Parallel scaling
Generate multiple independent solution paths simultaneously.
Sequential scaling
Adjust the next attempt based on the previous result.
MaTTS establishes memory-driven experiential scaling as a new scaling dimension for agent systems.
Experimental results and key findings
Benchmark results
Three evaluation settings
WebArena (web browsing)
- Complex web-interaction tasks
- Requires multi-step navigation and operations
Mind2Web (web understanding)
- Element recognition on real-world web pages
- Action prediction and execution
SWE-Bench-Verified (software engineering)
- Codebase-level bug fixing
- Requires understanding large codebases
Key metrics
- Effectiveness: up to 34.2% relative improvement
- Efficiency: 16.0% fewer interaction steps
Key findings
Memory quality > quantity
Retrieving 1 relevant memory outperforms retrieving 4. Too many memories may introduce conflicts or noise.
Unique value of failure experiences
Systems that incorporate failure experiences outperform those that learn only from successes.
Emergent behaviors
As memories accumulate, the agent begins to exhibit complex reasoning strategies not seen before.
Synergy of MaTTS and memory
The combination of ReasoningBank + MaTTS performs best, confirming the positive feedback loop between memory and scaling.
Insights from ReasoningBank for user-memory systems
From task memory to user memory
Shared core challenges
ReasoningBank addresses how an agent learns from task interactions; user memory systems address how an agent understands and serves users. They face similar core questions:
- How to distill high-level knowledge from raw data?
- How to retrieve truly relevant information?
- How to enable the memory system to evolve continuously?
Key insight: you cannot simply store raw data; you must invest compute in active distillation, abstraction, and structuring.
Transferable design principles
Principle 1: bidirectional learning
Learn not only from users’ positive feedback (preferences) but also from negative feedback (boundaries).
Principle 2: closed-loop updates
A memory system is not built once and for all; it evolves continuously with interactions.
Principle 3: quality first
The relevance and quality of memories matter more than their quantity.
Principle 4: self-judgment
Use LLM-as-a-judge to automate quality evaluation and reduce reliance on manual labeling.
Summary: The Evolution from Memory to Cognition
Technical Evolution Path
1. Remembering Facts
Simple Notes / JSON Cards
✓ Accurately store structured information
2. Understanding Context
Enhanced Notes / Advanced JSON
✓ Preserve semantic integrity and situational information
3. Cross-Session Association
Structured indexing + context-aware retrieval
✓ Disambiguate and discover composite events
4. Proactive Anticipation
Dual-layer memory architecture + deep reasoning
✓ Provide help without explicit requests
Key Insights
Personalization is a real need
From the success of recommendation systems, personalized products are more in line with human nature. AI Agents also need personalized memory to adapt to each user’s unique values and preferences.
Preference learning is the hard part
Factual information is relatively simple, but learning user preferences faces challenges such as context dependence and over-generalization, requiring fine-grained evaluation and continuous iteration.
Knowledge distillation is critical
You can’t just dump raw data into a knowledge base; you must invest compute in proactive distillation, abstraction, and structuring.
A dual-layer architecture is the optimal solution
Structured core facts (always in context) + context-aware retrieval (on-demand access) strike a balance between completeness and efficiency.
Future Outlook
Technical Challenges
Refinement of preference learning
- Better modeling of context dependence
- Distinguishing one-off behaviors from long-term preferences
- Reducing the risk of over-generalization
Memory compression and organization
- Automatically discovering knowledge hierarchies
- Dynamically adjusting memory structure
- Balancing level of detail and accessibility
Cross-modal memory integration
- Unified representations of text, images, and audio
- Associative retrieval across multimodal information
Application Prospects
Personalized value alignment
- From universal values to individual values
- Dynamically adapting to the evolution of user values
- Achieving true personalization at the level of details
Operating-system-level assistants
- Unified memory across devices and applications
- Long-term, continuous user profile construction
- Truly proactive services
Privacy and transparency
- Complete user control over memory
- Explainable memory management
- Tiered protection for sensitive information
Vision
To build a truly “understanding you” AI assistant that not only remembers what you say, but understands who you are, anticipates your needs, and becomes a trustworthy lifelong companion.
From simple recording to deep understanding, from passive response to proactive service