From Memory to Cognition: How AI Agents Can Deliver Truly Personalized Services

View Slides (HTML), Download PDF Version

01 | The Importance and Challenges of Memory - Personalization Value · Three Capability Levels
02 | Representation of Memory - Notes · JSON Cards
03 | Retrieval of Memory - RAG · Context Awareness
04 | Evaluation of Memory - Rubric · LLM Judge
05 | Frontier Research - ReasoningBank

Starting from personalization needs → Understanding memory challenges → Designing storage schemes → Implementing intelligent retrieval → Scientific evaluation and iteration

Part 1: The Importance and Challenges of Memory

Personalization is a real problem, and the core competitiveness of the future

The evolution of recommender systems

Traditional media: One People’s Daily, everyone sees the same content
The ByteDance revolution: Everyone sees different content — “Everyone lives in a different world and has different values”
Conclusion: Personalized products are more in line with human nature → Users are more willing to use them

The future of AI is the same

There should not be only one Universal Value

It should adapt to each user’s values and preferences
Value differences in details are huge
Personalization is the core competitiveness of AI products

Key insight: Just as recommender systems improve user experience through personalized content, AI Agents also need personalized memory to understand and serve each unique user.

Technical difficulty: Remembering facts vs. learning preferences

Factual Information

Relatively easy

Birthday, address, card number
Work information, contact details
Just remember them, no ambiguity

We are already doing this pretty well

Examples:

“My membership number is 12345” ✅
“My birthday is January 1, 1990” ✅
“I live in Haidian District, Beijing” ✅

User Preference

Very hard, requires solving multiple challenges:

1. Strong context dependence

User requires academic format when writing papers
Does not mean travel guides should also be academic
AI easily over-generalizes preferences

2. One-off behavior vs. long-term preference

“Yesterday I ordered Sichuan food” ≠ “User likes Sichuan cuisine”
Might just be a friend’s preference, or a one-time whim

3. Requires extremely fine-grained evaluation

Must have data and tests to balance
Cannot rely on gut feeling

Alignment of personalization value

Analogy: Success experience of recommender systems

Traditional approach: Universal human values

LLMs are aligned to “universal” values
But do we really have universally agreed human values?
In details, value differences are huge

What AI should do is

Not just a single universal value
Adapt to each user’s values and preferences
Recognize that value differences are huge

From recommendation to alignment: The evolution of AI

Just as ByteDance believes that “everyone lives in a different world and has different values”, AI Agents also need to:

Understand individual differences: Each user has unique values and preferences
Adapt dynamically: Continuously adjust based on user behavior and feedback
Be context-aware: The same user has different needs in different scenarios

User memory is more than logging conversations

The essence of memory

Just like understanding friends

We don’t remember every sentence they say
We build a mental model of who they are
Their preferences, habits, values

Core analogy: The goal of a user memory system is to build a model of the user that is as concise and powerful as possible, capable of explaining the user’s past behavior and predicting the user’s future needs.

Comparison of two types of memory

Type	Difficulty	Example
Facts	Simple	Birthday, address, card number
Preferences	Complex	Context-dependent, constantly evolving

Learning user preferences is much harder than storing factual information

Context-dependent: Academic writing style ≠ travel guide style
One-off vs. long-term: “Ordered Sichuan food yesterday” ≠ “Likes spicy food”
Over-generalization risk: AI easily extrapolates incorrectly

Three levels of memory capability

Level 1: Basic recall

Store and retrieve explicit user information

“My membership number is 12345” → Accurate recall
Foundation of reliability

Level 2: Cross-session retrieval

Connect information across different conversations

Disambiguation: “Schedule maintenance for my car” → Which of the two cars?
Understanding composite events: “Cancel my trip to Los Angeles” → Find flights + hotel

Level 3: Proactive service

Anticipate needs without explicit requests

Booking an international flight? → Check if passport is near expiry
The highest manifestation of intelligence

Our evaluation framework

Based on these three levels, we designed 60 test cases (20 per level), each case containing 1–3 sessions, each session about 50 turns of dialogue with a large amount of factual detail. We use an LLM-as-a-judge + Rubric approach to score the agent’s responses on multiple dimensions.

Level 1: Evaluation of basic recall

Scenario: Bank account setup

Accurately store and retrieve structured information provided by the user in a long conversation.

Test case

47-minute conversation about opening a bank account
Includes name, address, SSN, account numbers, etc.
A large number of details across 50+ turns of dialogue

Final question: “What’s my checking account number? I need to set up direct deposit.”

Expected answer: Accurately provide account number 4429853327, and preferably also the routing number 123006800

Conversation excerpt

- user: I live at 1847 Maple Street, 
        Apartment 3B, Portland, Oregon.
- assistant: Thank you. Phone number?
- user: My cell is 503-555-8924.
...
- assistant: Your new checking account 
             number is 4429853327.
- user: Let me write that down... 
        4429853327, right?
- assistant: Correct. And your savings 
             account is 4429853328.
...
- user: Can I use PIN 4827?
- assistant: Yes, 4827 is set as your PIN.
...
- assistant: Your online banking username 
             is MRobertson503.

Key: Precisely retrieve a specific account number from 50+ turns of dialogue

Level 2: Cross-session retrieval — Disambiguation scenario

Scenario: Service appointment for multiple cars

The user mentions owning multiple cars in different sessions; when the request is ambiguous, the Agent needs to proactively disambiguate.

Session 1: Adding a new car to insurance
User William Chen adds a 2023 Tesla Model 3 to insurance, existing 2019 Honda Accord already on policy

Session 2: Scheduling car maintenance
A 30K maintenance service for the Honda Accord is scheduled for November 24 at 8AM

Final question: “I need to schedule service for my car.”

Expected behavior: Detect ambiguity, list the status of both cars, and ask which one specifically

Conversation excerpt

# 会话 1 - 保险
- user: I just bought a 2023 Tesla Model 3.
- assistant: Is this replacing the Honda?
- user: It's an addition. I'm keeping the 
        Honda for my wife to drive.
...
- assistant: Honda is SF-789234501-01, 
             Tesla is SF-789234501-02.

# 会话 2 - 保养预约
- user: I need an oil change and the 
        30,000 mile service.
- assistant: What vehicle?
- user: It's a 2019 Honda Accord.
...
- assistant: Friday Nov 24th at 8 AM.
             Confirmation: FS-447291.

Key: Discover that the user has two cars, Honda already has an appointment, Tesla does not

Level 2: Composite events — Cascading effect of trip cancellation

Scenario: One big event contains many small events

When the user says “Cancel my trip to Los Angeles”, the system needs to understand that “trip” is a composite event containing multiple independent bookings.

Associated bookings that need to be found automatically

Flight to Los Angeles
Hotel booking in Los Angeles
Possible car rental
Event tickets, restaurant reservations, etc.

Final question: “Cancel my LA trip next week.”

Expected behavior: Automatically associate all related bookings, provide unified cancellation options, and explain the cancellation policies and refund status of each item

Information scattered across three independent sessions

# 会话 1 - 航班预订 (Delta)
- 航班: DL 1234
- 日期: 12月20日 → 12月23日
- 目的地: Los Angeles
- 确认号: DELTA-ABC123

# 会话 2 - 酒店预订 (Marriott)
- 酒店: Marriott Downtown LA
- 入住: 12月20日
- 退房: 12月23日
- 确认号: MAR-789456

# 会话 3 - 租车预订 (Hertz)
- 公司: Hertz
- 取车: LAX, 12月20日 3PM
- 还车: LAX, 12月23日 12PM
- 确认号: HERTZ-456789

Key: The three sessions interact with different service providers, but all belong to the same “Los Angeles trip”

Level 2: Overwrite handling — Multiple modifications to an order

Scenario: A constantly modified custom furniture order

The user custom-orders a dining set, but repeatedly changes the requirements during production. The Agent needs to track all changes and keep only the currently valid specifications.

Order change history

August 20: Ordered walnut dining table + 8 gray chairs + 1 bench
September 5: Chair color changed to sage green, 2 changed to armchairs
October 28: Green fabric discontinued, user needs to choose a new color

Final question: “What’s the current status of my dining set order?”

Expected behavior: Return only the latest status: waiting for fabric selection, delivery date pending; do not confuse with historical specifications

Key changes across three sessions

# 会话 1 - 初始订单 (8月20日)
- 餐桌: 胡桃木 Hamilton, 72寸 ($4,100)
- 椅子: 8把标准椅, 灰色 ($3,400)
- 长凳: 1条配套长凳 ($650)
- 交付: 11月5日

# 会话 2 - 设计变更 (9月5日)
- 椅子颜色: 灰色 → 鼠尾草绿 ✓
- 椅子类型: 2把升级为扶手椅 (+$200)
- 交付: 11月5日 → 11月12日

# 会话 3 - 面料问题 (10月28日)
- 问题: 鼠尾草绿面料停产!
- 状态: 等待客户从新样品中选择
- 交付: 待定 (取决于选择时间)

Key: The Agent must recognize that old information has been overwritten, and only the latest session’s status is valid

Level 3: Proactive service — Passport expiry warning

Scenario: International travel coordination

The user mentions different pieces of information in multiple independent sessions but never connects them. The Agent needs to proactively connect these scattered pieces and reason about potential risks.

Risks the AI needs to infer

Passport expiry date: February 18, 2025 (mentioned in Session 1)
Return date: January 22, 2025 (mentioned in Session 2)
Japan requires that a passport be valid for ≥ 6 months at the time of entry!

Final question: “I’m finalizing my trip to Tokyo in January. Is there anything I need to take care of before I go?”

Expected behavior: Proactively connect passport validity with travel dates and remind the user that the passport may not meet Japan’s entry requirements

Excerpts from three independent sessions

# 会话 1 - 6月 护照更新地址 (USPS)
- user: I need to update my address on file.
- assistant: I'll update that. I see your 
             passport expires Feb 18, 2025.
- user: Thanks, I'll deal with renewal later.
# (护照有效期只是顺便被提到，未讨论旅行)

# 会话 2 - 11月 机票预订 (Delta)
- user: I want to book Tokyo, Jan 15-22.
- assistant: Great! I found flights for you.
- user: Book the 2pm departure please.
- assistant: Done. Confirmation: DELTA-JMK892
# (只预订机票，未问护照问题)

# 会话 3 - 10月 信用卡 (Chase)
- user: Will my Sapphire Reserve work abroad?
- assistant: Yes, no foreign transaction fees.
             Trip insurance covers purchases.

Key: The relationship between passport and travel is never discussed in any of the three sessions; the AI needs to reason this out by itself!

Level 3: Proactive Service — Integrating Device Damage Protection

Scenario: Phone screen shattered

The user says “My phone screen just cracked.” The Agent needs to proactively integrate different protection information scattered across sessions and find the best solution.

Protection sources the AI needs to infer

Manufacturer warranty (Apple 1-year, until Feb 2025)
Credit card protection (Chase Sapphire, $50 deductible)
Carrier insurance (user declined, not applicable)

Final question: “My phone screen just cracked. What are my options?”

Expected behavior: Proactively list all protection options, compare costs and processes, and recommend the optimal plan (Chase credit card protection)

Information scattered across multiple independent sessions

# 会话 1 - 2月 购机 (Best Buy)
- user: I'll use my Chase Sapphire Reserve.
- assistant: That card extends warranties 
             and has purchase protection.
- Phone: iPhone 14 Pro, $1,099
#（原厂保修: 到 2025年2月）

# 会话 2 - 2月 手机激活 (Verizon)
- user: No thanks, I don't need insurance.
- assistant: You can add it later if needed.
# (用户拒绝了 Verizon Mobile Protect)

# 会话 3 - 8月 账单查询 (Chase)
- user: Do I have phone protection?
- assistant: Yes, up to $800 per claim, 
             $50 deductible. Must pay bill 
             with this card monthly.
# (确认信用卡保护仍然有效)

Key: The AI needs to integrate three sources and infer that Chase protection is the optimal choice ($50 deductible vs Apple $379)

Level 3: Proactive Service — Tax Season Preparation

Scenario: Proactive reminder before tax season

When the user mentions “preparing my taxes” in early January, the Agent should proactively aggregate tax-related information scattered across different conversations throughout the year.

Historical information the AI needs to proactively associate

February: Mortgage application (interest $31,000, points $7,500)
June: Stock sale (Apple, capital gains $33,000)
August: Charitable donation (Microsoft stock $25,200)
October: Side consulting income ($18,000, has a home office)

Final question: “I’m preparing my taxes. What should I know?”

Expected behavior: Proactively list all tax-related items, remind about required forms, and flag commonly missed deductions

Tax information scattered across year-long conversations

# 会话 1 - 2月 房贷 (First National Bank)
- 贷款: $500,000, 6.75% 利率
- 利息: ~$31,000/年 (可抵扣)
- Points: $7,500 (可抵扣)

# 会话 2 - 6月 股票 (Charles Schwab)
- 卖出: 300股 Apple, $55,590
- 成本: $22,500 → 资本利得 $33,000

# 会话 3 - 8月 捐赠 (United Way)
- 捐赠: 72股 Microsoft, 价值 $25,200
- 避免资本利得税 $4,374

# 会话 4 - 10月 副业 (SBDC)
- 收入: $18,000, 家庭办公室 6%
- 自雇税: ~$1,683

Key: Four sessions spanning the whole year; the AI needs to proactively aggregate and remind the user to prepare all relevant forms

Level 3: Core Capabilities of Proactive Service

Fundamental differences from Levels 1 and 2

Level	Trigger mode	Information source
L1	User asks directly	Single session
L2	User gives vague request	Multiple sessions
L3	No need to ask	Across time and domains

Three typical scenarios recap

Passport alert: Flight ticket + passport validity → entry risk
Device protection: Purchase + credit card + insurance → optimal plan
Tax prep: Year-long transaction records → complete tax checklist

Key technical challenges

Time span: Need to connect conversations from months or even years ago and identify still-relevant information

Cross-domain: Associate and reason over information from different providers and scenarios

Proactive reasoning: User doesn’t explicitly request it, but the Agent should proactively discover and remind

Priority judgment: Identify truly urgent and important issues, avoiding information overload

Part Two: Representation of Memory

Memory Representation (I): Natural Language

Simple Notes mode

Minimalist design

Each memory is an atomic factual statement:

“User email: john@example.com“
“Preferred programming language: Python”

Advantages	Disadvantages
Extremely low cognitive load	Loss of information associations
O(1) operational complexity	Semantic fragmentation

Enhanced Notes mode

Full context preserved

Paragraph-style storage of full background:

“The user works at TechCorp as a senior software engineer, has focused on machine learning for three years, and is currently leading a recommendation system project.”

Advantages	Disadvantages
Semantic completeness	Redundant storage
Narrative structure preserved	Complex updates

Common trait: Uses natural language as the main carrier, suitable for human reading and understanding, but lacks machine-operable structured information.

Memory Representation (II): Structured

JSON Cards mode

Structured organization

Three-layer nesting: category → subcategory → key-value pairs

{
  "personal": {
    "contact": {"email": "[email protected]"}
  },
  "work": {
    "position": {"title": "Senior Engineer"}
  }
}

Advantages	Disadvantages
Partial updates	Rigid structure
Extensible	Hard to classify multi-dimensional information

Advanced JSON Cards mode

Contextual knowledge management

Add metadata fields on top of basic JSON:

backstory: narrative background of the information source
person: identity of the subject
relationship: relationship between subject and user
timestamp: record timestamp

Example: “Dermatologist Dr. Chen contacted for eczema treatment for 8-year-old daughter Sarah”
→ person: Sarah, relationship: daughter

Common trait: Uses structured data as the main carrier, enabling programmatic operations and precise retrieval, suitable for storing key information that requires disambiguation.

Limitations of Knowledge Graphs

The promise of knowledge graphs

Triple representation: entity–relationship–entity

Seemingly powerful

More flexible information networks
Suitable for representing complex relationships
Supports graph queries

Practical issues

Semantic degradation is inevitable

Original expression:
“If it’s still raining next week, I’ll cancel my beach plan and go to the museum instead.”

Knowledge graph representation:

(me, has plan, beach trip)
(me, has backup plan, museum trip)

Lost information:

Conditional relation: “if–then–else”
Temporal dependency: “if it’s still raining next week”
Core structure of the decision logic

Limitations in reasoning capability

Good at: structured queries

Pattern matching
Path finding
Find all “plans” related to “me”

Not good at: logical reasoning

Counterfactuals: “What if it doesn’t rain?”
Hypothesis testing
Analogical reasoning

Best practices

Natural language + structured metadata

Store complex information in full, concise natural language, augmented with structured metadata like JSON Cards for indexing and retrieval.

Achieve an optimal balance between information completeness and query efficiency.

Case Study: ChatGPT’s Memory System

Four-layer context architecture

Through reverse engineering, it’s been found that every time ChatGPT receives a message, it injects four layers of context:

1. Session Metadata
Device type, browser, time zone, subscription level, etc.; not retained after the session ends

2. User Memory
User-explicitly-stored long-term facts (e.g., “remember that I am…”), injected on every request

3. Recent Conversations Summary
Lightweight summary of recent conversations (about 15 messages), only includes user messages, not assistant replies

4. Current Session
Full conversation history within a sliding window; older messages are truncated when exceeding token limits

Key design choices

No vector database
Doesn’t use traditional RAG-style vector retrieval; instead uses precomputed lightweight summaries injected directly, trading detailed history for speed and efficiency

Passive memory mechanism
Only stores information when the user explicitly says “remember this” or when the model detects facts that meet OpenAI’s criteria

Simple Notes mode
Each memory is an independent factual statement, lacking structural links between pieces of information

API unavailable
Memory features are not exposed to developers, limiting third-party app integration

Reference: Manthan Gupta, “I Reverse Engineered ChatGPT’s Memory System”

Case Study: Claude’s Memory System

Core differences from ChatGPT

Claude uses a completely different memory architecture: on-demand retrieval rather than precomputed injection.

User Memories
Similar long-term fact storage to ChatGPT, but supports implicit updates—the system periodically updates memories in the background based on conversation content

Rolling Window
About 190k tokens of full message history; older messages are discarded once exceeded

conversation_search tool
On-demand search of historical conversations by topic or keyword, called only when the model deems it necessary

recent_chat tool
Time-based retrieval of recent conversations, also called on demand

Design philosophy comparison

ChatGPT: precompute + inject
Automatically injects conversation summaries for every request, ensuring basic cross-session continuity, but summaries are lightweight and lack detail

Claude: selective retrieval
Does not automatically inject historical summaries; instead, the model decides when it needs historical context and retrieves it via tool calls

Dimension	ChatGPT	Claude
Continuity	Automatically ensured	Depends on model judgment
Depth of detail	Shallow	Can go deep on demand
Efficiency	Fixed overhead	Consumed on demand

Reference: Manthan Gupta, “I Reverse Engineered Claude’s Memory System”

Limitations of ChatGPT and Claude Memory Systems

Shared shortcomings

Flat storage
Both lack associations and hierarchical structure between pieces of information, making it hard to represent complex semantic relationships

No disambiguation mechanism
When there are multiple related but distinct entities (such as two cars), there is no effective way to distinguish them

Lack of proactive service
Neither can achieve third-level proactive, anticipatory service

Individual issues

ChatGPT	Claude
Conversation summaries are too brief, losing important details	Relies on the model to decide when to retrieve, which may miss relevant context

Three-Level Evaluation Framework Comparison

Level	ChatGPT	Claude
L1: Basic Recall	✅ Meets	✅ Meets
L2: Multi-Session Retrieval	⚠️ Summaries too shallow	⚠️ Retrieval unstable
L3: Proactive Service	❌ Not implemented	❌ Not implemented

Directions for Improvement

Use Advanced JSON Cards to enhance metadata
Introduce context-aware automatic extraction
Build an association graph between memories
Implement memory-based proactive reasoning

Experiment: Comparison of Four Memory Modes

Experimental Design (`projects/week2/user-memory`)

Based on the three-level evaluation framework, systematically compare four modes:

Mode	Simplicity	Expressiveness	Updatability	Applicable Scenarios
Simple Notes	⭐⭐⭐⭐⭐	⭐⭐	⭐⭐⭐⭐	Quickly recording temporary information
Enhanced Notes	⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐	Scenarios requiring full semantics
JSON Cards	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	Structured information management
Advanced JSON	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐	Key information that requires disambiguation

Key Findings

There is no “best” mode
The optimal choice depends on the specific scenario, cost budget, and task requirements.

Hybrid use is the trend
Simple Notes for fast recording + Advanced JSON for handling key information

Part Three: Memory Retrieval

Limitations of Traditional RAG

Problem: Flattened Processing Causes Information Loss

Case 1: The Black Cat vs White Cat Counting Problem

There are 100 independent cases in the knowledge base:

90 black cats
10 white cats

User asks: “What is the ratio of black cats to white cats?”

RAG system dilemma:

Retrieval is limited by top-k (e.g., k=20)
Cannot guarantee recalling all cases
Can only reason based on an incomplete sample
Result: Incorrect ratio conclusion

Case 2: Xfinity Discount Rules

There are three isolated cases in the knowledge base:

Veteran John successfully applied for the discount
Doctor Sarah received the discount
Teacher Mike was not eligible

User asks: “I am a nurse, can I get the discount?”

RAG system problem:

“Nurse” is semantically similar to “doctor”
Tends to retrieve Sarah’s case first
Incorrectly infers that nurses are also eligible
Root cause: Fails to recall the complete rule boundary

Core issue: A naive RAG approach—directly throwing raw cases into the knowledge base—is far from sufficient. You must invest compute at the indexing stage to actively distill, abstract, and structure the original knowledge.

Solution: Knowledge Distillation and Structuring

Correct Approach for Case 1

Pre-compute a statistical summary

Compress the 100 individual cases into:

“There are 100 cats in total:

90 black cats (90%)

10 white cats (10%)”

Result: A single retrieval yields accurate, complete statistical information.

Correct Approach for Case 2

Extract explicit rules

From the three isolated cases, extract:

“Xfinity discounts only apply to:

Veterans

Doctors

Other professions are not eligible.”

Result: No matter which profession the user asks about, a single retrieval returns a complete and accurate definition of the rule.

Core principle: Compress the “100 individual cases” into a statistical summary, and distill the “three isolated cases” into explicit rules. Only then can you build a truly efficient and reliable agent knowledge system.

Structured Indexing: RAPTOR vs GraphRAG

RAPTOR: Tree-Like Hierarchical Structure

Bottom-up recursive abstraction

Leaf nodes: Split documents into small text chunks
Clustering: Group semantically similar chunks
Summarization: Generate parent nodes for each group
Recursion: Abstract layer by layer up to the root node

Retrieval process:

Locate macro concepts from high-level summaries
Drill down the tree to reach concrete details
Retrieval path from macro to micro

Strengths: Captures hierarchical structure and abstraction relationships in knowledge.

GraphRAG: Network Association Graph

Entity-relationship modeling

Extract entities: People, places, concepts, terms
Extract relationships: Various relations between entities
Community detection: Clusters of tightly related entities
Cluster summarization: Generate summaries for communities

Retrieval process:

Locate core entities
Traverse relationship edges to find related entities
Provide context via community analysis

Strengths: Reveals horizontal associations and network structure in knowledge.

Relation between the two: They are not substitutes but complements. The ideal solution combines them to build a three-dimensional knowledge index with both depth and breadth.

Context-Aware Retrieval: Solving Context Loss

Problem: Ambiguity of Isolated Text Chunks

Example chunk:

“The company’s revenue grew by 3% in the second quarter.”

Missing context:

Which company is “the company”?
When was the report released?
Which product line is this related to?

Result: Severe semantic information loss, reduced retrieval accuracy.

Solution: Context Prefix

Anthropic’s context-aware retrieval

Step one: Generate a context prefix for the chunk.

The LLM generates:

“[This passage is excerpted from ACME’s 2025 Q2 financial report, ‘Key Performance Indicators’ section]”

Step two: Concatenate and index

“[This passage is excerpted from ACME’s 2025 Q2 financial report, ‘Key Performance Indicators’ section] The company’s revenue grew by 3% in the second quarter.”

Effect: Combined with BM25, retrieval failure rate drops by 49%; combined with a re-ranker, failure rate drops by up to 67%.

Two-Layer Structure of User Memory

📋 JSON Cards (Resident Context)
Structured core facts, a personal cheat sheet

Passport expires 2025-02 · Tokyo trip

🔍 Context-Aware RAG (On-Demand Retrieval)
Unstructured conversation details, a powerful search engine

[Context: booking a January flight in November…]

🔗 The two must work together

JSON Cards provide the factual framework
LLM reasoning discovers potential associations
RAG verifies and retrieves conversational evidence
Proactive service: Passport is about to expire!

JSON Cards tell the agent “what exists”; RAG tells the agent “what the details are.” Both are indispensable.

Agent Memory Architecture

Resident Context

📋 Basic knowledge → System Prompt
User JSON Cards are placed directly into the agent context and can be accessed without tool calls.

Three Retrieval Tools

🔍 search_user_memory
Agentic Search on User Memory

Backend: Embedding Search → Rerank → return related memories

🔍 search_conversations
Agentic Search on Conversation Summaries

Backend: Embedding Search → return related historical conversation summaries

📜 load_recent_conversations
Load Last N Conversation Summaries

Directly load summaries of the last N turns, no semantic search needed

Architecture Diagram

┌─────────────────────────────────┐
│           Agent                 │
│  ┌───────────────────────────┐  │
│  │    System Prompt          │  │
│  │  + User JSON Cards        │  │
│  └───────────────────────────┘  │
│                                 │
│  Tools:                         │
│  ├─ search_user_memory()        │
│  │   └→ Embedding → Rerank      │
│  ├─ search_conversations()      │
│  │   └→ Embedding Search        │
│  └─ load_recent_conversations() │
│      └→ Last N summaries        │
└─────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────┐
│     Memory Store (Vector DB)    │
│  • User memories (chunked)      │
│  • Conversation summaries       │
└─────────────────────────────────┘

Design principle: High-frequency information resides in context; long-tail details are retrieved on demand.

Proactive Service: A Natural Result of Storage + Retrieval

Core Insight

Proactive service is not an independent capability layer
Once storage and retrieval are done well, proactive service emerges naturally. It is an emergent result of structured storage working together with intelligent retrieval.

Why Does It Emerge Naturally?

Structured storage (JSON Cards) provides:

Key facts resident in context (e.g., passport validity)
Metadata that supports associative reasoning (e.g., timestamps, entity types)

Intelligent retrieval (context-aware RAG) provides:

On-demand access to historical conversation details
Automatic connection of related information fragments

Combined, the agent naturally discovers associations like “Tokyo ticket in January → passport expires in February.”

Examples of Proactive Service

International travel alert
JSON Cards store passport information + RAG retrieves flight bookings → automatically detects time conflicts.

Device damage handling
JSON Cards store device and insurance information → automatically list all applicable protection options.

Tax season preparation
JSON Cards store income types + RAG retrieves transaction records → automatically aggregates relevant documents.

Implementation path: There is no need to design a separate mechanism for “proactive service.” Focus on doing storage and retrieval well, and the LLM’s reasoning ability will handle the rest.

Part Four: Evaluating Memory

Why Do We Need Evaluation?

Evaluation Is the Compass of Agent Engineering

Building an agent system involves many design decisions, and these decisions often have no obvious “correct answer.”

Key decision points

Workflow design: Workflow vs Autonomous mode
Prompt design: Structured vs rule list
Memory mode: Simple Notes vs JSON Cards
Retrieval strategy: Precomputed injection vs on-demand retrieval

Core insight: Some seemingly reasonable designs actually harm performance, while some seemingly trivial details can bring significant gains. Only through rigorous comparative evaluation can these counterintuitive truths be revealed.

Threefold Value of Evaluation

1. Guides design decisions
Without evaluation, we can only rely on intuition, and intuition is often unreliable.

2. Provides improvement signals
Not only tells you “good or bad,” but more importantly reveals “why it is good/bad.”

3. Supports model upgrade decisions
When a new model is released, only by testing it on your own evaluation set can you make data-driven upgrade decisions.

Ablation study methodology: keep all other parts of the system unchanged, modify only one specific component, and observe the impact on overall performance

Basic components of an evaluation environment

Five core elements

1. Dataset
Defines a set of tasks; each task contains an initial state, goal description, and reference solution.

2. Environment state
Maintains all mutable information during task execution (database, file system, conversation history).

3. Tools
The channels through which the agent interacts with the environment; must be functionally complete but avoid over-simplification.

4. Rubric
Defines how to quantify agent performance; this is the most challenging part of evaluation.

5. Interaction protocol
Specifies interaction patterns and termination conditions.

Key principles for human–AI interaction evaluations

Progressive information disclosure
You must never expose all the information the user has to the agent at the very beginning. Information should be disclosed as needed and progressively throughout the conversation.

User simulation
Use another LLM to play the user role, following predefined instructions to:

Reveal necessary information step by step
Respond to the agent’s questions
Issue a termination signal after the task is completed

Dual verification

Check whether the final database state is correct
Check whether all necessary key information was output in the conversation

Reference: τ-bench / τ²-bench evaluation frameworks

Rubric: the basis for LLM judgment

What is a rubric?

A rubric (structured scoring guideline) is the core tool that makes LLM-as-judge evaluation objective, consistent, and interpretable. It is similar to the scoring criteria for exams like the Gaokao, GRE writing, or TOEFL speaking.

Four design principles

Expert-guided: reflects domain expertise and captures the core facts and reasoning steps required for a correct response.

Comprehensive coverage: spans multiple dimensions (accuracy, coherence, completeness) and defines both positive and negative criteria.

Importance weighting: factual correctness must take precedence over stylistic clarity (Essential / Important / Optional / Pitfall).

Self-contained evaluation: each evaluation item is independently operable and does not rely on external context.

Example rubric for evaluating user memory

dimensions:
  factual_precision:
    weight: essential
    levels:
      - 4: 所有事实完全正确
      - 3: 关键事实正确，细节略有偏差
      - 2: 部分事实正确
      - 1: 事实存在重大错误
      
  factual_recall:
    weight: important
    levels:
      - 4: 提供了所有相关信息
      - 3: 提供了主要信息
      - 2: 遗漏部分关键信息
      
  hallucination:
    weight: veto  # 一票否决
    description: 任何编造的信息直接判定失败

Preventing reward hacking: explicitly define negative criteria in the rubric—hallucinations, flattering the user, keyword stuffing, and avoiding the question.

Evaluation methodology: best practices from Anthropic

Three types of evaluation

Unit tests
Deterministic checks, used to verify format, edge cases, and other scenarios where correctness can be clearly judged.

LLM-as-judge
Use an LLM to assess output quality; combined with a clear rubric, this can achieve a high level of agreement with human judgment.

Human evaluation
Test under real-world conditions to discover the “rough edges” that automated evaluation cannot capture.

Characteristics of good evaluations

Specific and clear: there is a single correct answer.
Realistic: reflects problems that real users will actually encounter.
Diagnosable: simple enough to understand the reason for failure.
Representative: reflects the end-user experience.

Three dimensions of agent evaluation

Final answer correctness
Did the agent provide the correct final answer? Use an LLM judge to compare with the reference answer.

Tool-usage accuracy
Did the agent choose the correct tools? Were the parameters correct? Could it recover from errors?

Final state correctness (τ-bench)
Did the agent achieve the correct final state? Applicable to tasks with side effects (such as canceling an order).

Evaluation tips

The more obvious the impact of each system change on the result, the fewer test samples you need.
Use real tasks: real user scenarios that have clearly correct answers.
Nothing can perfectly replace human evaluation: repeated testing and gut checks are indispensable.

Reference: Anthropic, “Context Engineering Best Practices” (AWS re:Invent 2025)

Part 5: Frontier research

Limitations of existing agent memory systems

Problem: agents cannot learn from history

Current status

Existing LLM agents cannot effectively learn from accumulated interaction history when handling continuous task streams. Each task is processed in isolation, causing the system to repeatedly make past mistakes and lose valuable insights.

Root problem: lack of true self-evolution capability—the agent cannot grow stronger over time.

Defects of existing approaches

Two mainstream approaches

Raw trajectory storage
Directly store the interaction process, with no distillation.

Successful workflow logging
Only keep workflows/procedures and ignore failures.

Shared defects

Cannot extract high-level, transferable reasoning patterns
Overemphasize success and ignore the valuable lessons of failure
Passive recording that cannot generate actionable guidance

ReasoningBank: a memory bank of reasoning strategies

Core innovation

Learning from both success and failure

ReasoningBank distills generalizable reasoning strategies from the agent’s self-judged successes and failures, without relying on ground-truth labels.

Difference in memory contents

Method	Stored content
Raw trajectories	Complete interaction sequences
Successful workflows	Effective action patterns
ReasoningBank	Transferable reasoning strategies

Closed-loop learning mechanism

1. Retrieve relevant memories
When facing a new task, retrieve semantically relevant reasoning strategies from ReasoningBank.

2. Guide action decisions
Use the retrieved strategies to guide the agent’s interaction process.

3. Analyze new experiences
After the task is finished, the agent self-judges success or failure.

4. Distill and integrate
Extract reasoning strategies from the new experiences and update ReasoningBank.

Why are failure experiences equally important?

Valuable lessons in failure

Misconceptions in traditional thinking

Most memory systems focus only on successful cases, assuming failures are not worth keeping. Yet failure experiences contain critical “preventive” knowledge.

Example: web navigation task

Success teaches you:
“Click the ‘Men’s clothing’ category to find the product.”

Failure teaches you:
“Do not search directly on the homepage; the search box does not support complex queries well.”

Failure experiences provide boundary conditions that successful paths cannot cover.

The value of contrastive signals

Contrastive learning from success vs. failure

When there are both successful and failed examples for the same class of tasks, the agent can discover through comparison:

Which strategies are effective in specific contexts
Which seemingly reasonable paths actually fail
The critical boundaries between success and failure

How ReasoningBank handles this

From successful experiences, it extracts: positive strategies (“doing this works”).

From failed experiences, it extracts: preventive strategies (“avoid doing this”).

Together they form more complete reasoning knowledge.

MaTTS: memory-aware test-time scaling

Depth vs. breadth

Two paths for expanding experience

Breadth scaling
Increase the number of tasks (more users, more scenarios).

Depth scaling (MaTTS)
Conduct more exploration for each task (more attempts, more variants).

Core idea of MaTTS

Allocate more compute to a single task to generate rich, diverse exploratory experiences, providing higher-quality contrastive signals for memory synthesis.

Synergy between memory and scaling

Positive feedback loop

1
2
3

High-quality memory → more effective exploration → richer experience
     ↑                                  ↓
     ←──── stronger memory synthesis ←───

Two scaling modes

Parallel scaling
Generate multiple independent solution paths simultaneously.

Sequential scaling
Adjust the next attempt based on the previous result.

MaTTS establishes memory-driven experiential scaling as a new scaling dimension for agent systems.

Experimental results and key findings

Benchmark results

Three evaluation settings

WebArena (web browsing)

Complex web-interaction tasks
Requires multi-step navigation and operations

Mind2Web (web understanding)

Element recognition on real-world web pages
Action prediction and execution

SWE-Bench-Verified (software engineering)

Codebase-level bug fixing
Requires understanding large codebases

Key metrics

Effectiveness: up to 34.2% relative improvement
Efficiency: 16.0% fewer interaction steps

Key findings

Memory quality > quantity
Retrieving 1 relevant memory outperforms retrieving 4. Too many memories may introduce conflicts or noise.

Unique value of failure experiences
Systems that incorporate failure experiences outperform those that learn only from successes.

Emergent behaviors
As memories accumulate, the agent begins to exhibit complex reasoning strategies not seen before.

Synergy of MaTTS and memory
The combination of ReasoningBank + MaTTS performs best, confirming the positive feedback loop between memory and scaling.

Insights from ReasoningBank for user-memory systems

From task memory to user memory

Shared core challenges

ReasoningBank addresses how an agent learns from task interactions; user memory systems address how an agent understands and serves users. They face similar core questions:

How to distill high-level knowledge from raw data?
How to retrieve truly relevant information?
How to enable the memory system to evolve continuously?

Key insight: you cannot simply store raw data; you must invest compute in active distillation, abstraction, and structuring.

Transferable design principles

Principle 1: bidirectional learning
Learn not only from users’ positive feedback (preferences) but also from negative feedback (boundaries).

Principle 2: closed-loop updates
A memory system is not built once and for all; it evolves continuously with interactions.

Principle 3: quality first
The relevance and quality of memories matter more than their quantity.

Principle 4: self-judgment
Use LLM-as-a-judge to automate quality evaluation and reduce reliance on manual labeling.

Summary: The Evolution from Memory to Cognition

Technical Evolution Path

1. Remembering Facts
Simple Notes / JSON Cards
✓ Accurately store structured information

2. Understanding Context
Enhanced Notes / Advanced JSON
✓ Preserve semantic integrity and situational information

3. Cross-Session Association
Structured indexing + context-aware retrieval
✓ Disambiguate and discover composite events

4. Proactive Anticipation
Dual-layer memory architecture + deep reasoning
✓ Provide help without explicit requests

Key Insights

Personalization is a real need
From the success of recommendation systems, personalized products are more in line with human nature. AI Agents also need personalized memory to adapt to each user’s unique values and preferences.

Preference learning is the hard part
Factual information is relatively simple, but learning user preferences faces challenges such as context dependence and over-generalization, requiring fine-grained evaluation and continuous iteration.

Knowledge distillation is critical
You can’t just dump raw data into a knowledge base; you must invest compute in proactive distillation, abstraction, and structuring.

A dual-layer architecture is the optimal solution
Structured core facts (always in context) + context-aware retrieval (on-demand access) strike a balance between completeness and efficiency.

Future Outlook

Technical Challenges

Refinement of preference learning

Better modeling of context dependence
Distinguishing one-off behaviors from long-term preferences
Reducing the risk of over-generalization

Memory compression and organization

Automatically discovering knowledge hierarchies
Dynamically adjusting memory structure
Balancing level of detail and accessibility

Cross-modal memory integration

Unified representations of text, images, and audio
Associative retrieval across multimodal information

Application Prospects

Personalized value alignment

From universal values to individual values
Dynamically adapting to the evolution of user values
Achieving true personalization at the level of details

Operating-system-level assistants

Unified memory across devices and applications
Long-term, continuous user profile construction
Truly proactive services

Privacy and transparency

Complete user control over memory
Explainable memory management
Tiered protection for sensitive information

Vision

To build a truly “understanding you” AI assistant that not only remembers what you say, but understands who you are, anticipates your needs, and becomes a trustworthy lifelong companion.

From simple recording to deep understanding, from passive response to proactive service

From Memory to Cognition: How AI Agents Can Deliver Truly Personalized Services

Contents

Part 1: The Importance and Challenges of Memory

Personalization is a real problem, and the core competitiveness of the future

The evolution of recommender systems

The future of AI is the same

Technical difficulty: Remembering facts vs. learning preferences

Factual Information

User Preference

Alignment of personalization value

Analogy: Success experience of recommender systems

From recommendation to alignment: The evolution of AI

User memory is more than logging conversations

The essence of memory

Comparison of two types of memory

Three levels of memory capability

Level 1: Basic recall

Level 2: Cross-session retrieval

Level 3: Proactive service

Our evaluation framework

Level 1: Evaluation of basic recall

Scenario: Bank account setup

Conversation excerpt

Level 2: Cross-session retrieval — Disambiguation scenario

Scenario: Service appointment for multiple cars

Conversation excerpt

Level 2: Composite events — Cascading effect of trip cancellation

Scenario: One big event contains many small events

Information scattered across three independent sessions

Level 2: Overwrite handling — Multiple modifications to an order

Scenario: A constantly modified custom furniture order

Key changes across three sessions

Level 3: Proactive service — Passport expiry warning

Scenario: International travel coordination

Excerpts from three independent sessions

Level 3: Proactive Service — Integrating Device Damage Protection

Scenario: Phone screen shattered

Information scattered across multiple independent sessions

Level 3: Proactive Service — Tax Season Preparation

Scenario: Proactive reminder before tax season

Tax information scattered across year-long conversations

Level 3: Core Capabilities of Proactive Service

Fundamental differences from Levels 1 and 2

Three typical scenarios recap

Key technical challenges

Part Two: Representation of Memory

Memory Representation (I): Natural Language

Simple Notes mode

Enhanced Notes mode

Memory Representation (II): Structured

JSON Cards mode

Advanced JSON Cards mode

Limitations of Knowledge Graphs

The promise of knowledge graphs

Practical issues

Limitations in reasoning capability

Best practices

Case Study: ChatGPT’s Memory System

Four-layer context architecture

Key design choices

Case Study: Claude’s Memory System

Core differences from ChatGPT

Design philosophy comparison

Limitations of ChatGPT and Claude Memory Systems

Shared shortcomings

Individual issues

Three-Level Evaluation Framework Comparison

Directions for Improvement

Experiment: Comparison of Four Memory Modes

Experimental Design (projects/week2/user-memory)

Key Findings

Part Three: Memory Retrieval

Limitations of Traditional RAG

Problem: Flattened Processing Causes Information Loss

Solution: Knowledge Distillation and Structuring

Correct Approach for Case 1

Correct Approach for Case 2

Structured Indexing: RAPTOR vs GraphRAG

RAPTOR: Tree-Like Hierarchical Structure

GraphRAG: Network Association Graph

Experimental Design (`projects/week2/user-memory`)