View Slides (HTML), Download PDF Version

Slides Source Code

Contents

  • 01 | The Importance and Challenges of Memory - Personalization Value · Three Capability Levels
  • 02 | Representation of Memory - Notes · JSON Cards
  • 03 | Retrieval of Memory - RAG · Context Awareness
  • 04 | Evaluation of Memory - Rubric · LLM Judge
  • 05 | Frontier Research - ReasoningBank

Starting from personalization needs → Understanding memory challenges → Designing storage schemes → Implementing intelligent retrieval → Scientific evaluation and iteration

Part 1: The Importance and Challenges of Memory

Personalization is a real problem, and the core competitiveness of the future

The evolution of recommender systems

  • Traditional media: One People’s Daily, everyone sees the same content
  • The ByteDance revolution: Everyone sees different content — “Everyone lives in a different world and has different values”
  • Conclusion: Personalized products are more in line with human nature → Users are more willing to use them

The future of AI is the same

There should not be only one Universal Value

  • It should adapt to each user’s values and preferences
  • Value differences in details are huge
  • Personalization is the core competitiveness of AI products

Key insight: Just as recommender systems improve user experience through personalized content, AI Agents also need personalized memory to understand and serve each unique user.

Technical difficulty: Remembering facts vs. learning preferences

Factual Information

Relatively easy

  • Birthday, address, card number
  • Work information, contact details
  • Just remember them, no ambiguity

We are already doing this pretty well

Examples:

  • “My membership number is 12345” ✅
  • “My birthday is January 1, 1990” ✅
  • “I live in Haidian District, Beijing” ✅

User Preference

Very hard, requires solving multiple challenges:

1. Strong context dependence

  • User requires academic format when writing papers
  • Does not mean travel guides should also be academic
  • AI easily over-generalizes preferences

2. One-off behavior vs. long-term preference

  • “Yesterday I ordered Sichuan food” ≠ “User likes Sichuan cuisine”
  • Might just be a friend’s preference, or a one-time whim

3. Requires extremely fine-grained evaluation

  • Must have data and tests to balance
  • Cannot rely on gut feeling

Alignment of personalization value

Analogy: Success experience of recommender systems

Traditional approach: Universal human values

  • LLMs are aligned to “universal” values
  • But do we really have universally agreed human values?
  • In details, value differences are huge

What AI should do is

  • Not just a single universal value
  • Adapt to each user’s values and preferences
  • Recognize that value differences are huge

From recommendation to alignment: The evolution of AI

Just as ByteDance believes that “everyone lives in a different world and has different values”, AI Agents also need to:

  1. Understand individual differences: Each user has unique values and preferences
  2. Adapt dynamically: Continuously adjust based on user behavior and feedback
  3. Be context-aware: The same user has different needs in different scenarios

User memory is more than logging conversations

The essence of memory

Just like understanding friends

  • We don’t remember every sentence they say
  • We build a mental model of who they are
  • Their preferences, habits, values

Core analogy: The goal of a user memory system is to build a model of the user that is as concise and powerful as possible, capable of explaining the user’s past behavior and predicting the user’s future needs.

Comparison of two types of memory

Type Difficulty Example
Facts Simple Birthday, address, card number
Preferences Complex Context-dependent, constantly evolving

Learning user preferences is much harder than storing factual information

  • Context-dependent: Academic writing style ≠ travel guide style
  • One-off vs. long-term: “Ordered Sichuan food yesterday” ≠ “Likes spicy food”
  • Over-generalization risk: AI easily extrapolates incorrectly

Three levels of memory capability

Level 1: Basic recall

Store and retrieve explicit user information

  • “My membership number is 12345” → Accurate recall
  • Foundation of reliability

Level 2: Cross-session retrieval

Connect information across different conversations

  • Disambiguation: “Schedule maintenance for my car” → Which of the two cars?
  • Understanding composite events: “Cancel my trip to Los Angeles” → Find flights + hotel

Level 3: Proactive service

Anticipate needs without explicit requests

  • Booking an international flight? → Check if passport is near expiry
  • The highest manifestation of intelligence

Our evaluation framework

Based on these three levels, we designed 60 test cases (20 per level), each case containing 1–3 sessions, each session about 50 turns of dialogue with a large amount of factual detail. We use an LLM-as-a-judge + Rubric approach to score the agent’s responses on multiple dimensions.

Level 1: Evaluation of basic recall

Scenario: Bank account setup

Accurately store and retrieve structured information provided by the user in a long conversation.

Test case

  • 47-minute conversation about opening a bank account
  • Includes name, address, SSN, account numbers, etc.
  • A large number of details across 50+ turns of dialogue

Final question: “What’s my checking account number? I need to set up direct deposit.”

Expected answer: Accurately provide account number 4429853327, and preferably also the routing number 123006800

Conversation excerpt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
- user: I live at 1847 Maple Street, 
Apartment 3B, Portland, Oregon.
- assistant: Thank you. Phone number?
- user: My cell is 503-555-8924.
...
- assistant: Your new checking account
number is 4429853327.
- user: Let me write that down...
4429853327, right?
- assistant: Correct. And your savings
account is 4429853328.
...
- user: Can I use PIN 4827?
- assistant: Yes, 4827 is set as your PIN.
...
- assistant: Your online banking username
is MRobertson503.

Key: Precisely retrieve a specific account number from 50+ turns of dialogue

Level 2: Cross-session retrieval — Disambiguation scenario

Scenario: Service appointment for multiple cars

The user mentions owning multiple cars in different sessions; when the request is ambiguous, the Agent needs to proactively disambiguate.

Session 1: Adding a new car to insurance
User William Chen adds a 2023 Tesla Model 3 to insurance, existing 2019 Honda Accord already on policy

Session 2: Scheduling car maintenance
A 30K maintenance service for the Honda Accord is scheduled for November 24 at 8AM

Final question: “I need to schedule service for my car.”

Expected behavior: Detect ambiguity, list the status of both cars, and ask which one specifically

Conversation excerpt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 会话 1 - 保险
- user: I just bought a 2023 Tesla Model 3.
- assistant: Is this replacing the Honda?
- user: It's an addition. I'm keeping the
Honda for my wife to drive.
...
- assistant: Honda is SF-789234501-01,
Tesla is SF-789234501-02.

# 会话 2 - 保养预约
- user: I need an oil change and the
30,000 mile service.
- assistant: What vehicle?
- user: It's a 2019 Honda Accord.
...
- assistant: Friday Nov 24th at 8 AM.
Confirmation: FS-447291.

Key: Discover that the user has two cars, Honda already has an appointment, Tesla does not

Level 2: Composite events — Cascading effect of trip cancellation

Scenario: One big event contains many small events

When the user says “Cancel my trip to Los Angeles”, the system needs to understand that “trip” is a composite event containing multiple independent bookings.

Associated bookings that need to be found automatically

  • Flight to Los Angeles
  • Hotel booking in Los Angeles
  • Possible car rental
  • Event tickets, restaurant reservations, etc.

Final question: “Cancel my LA trip next week.”

Expected behavior: Automatically associate all related bookings, provide unified cancellation options, and explain the cancellation policies and refund status of each item

Information scattered across three independent sessions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 会话 1 - 航班预订 (Delta)
- 航班: DL 1234
- 日期: 12月20日 12月23日
- 目的地: Los Angeles
- 确认号: DELTA-ABC123

# 会话 2 - 酒店预订 (Marriott)
- 酒店: Marriott Downtown LA
- 入住: 12月20日
- 退房: 12月23日
- 确认号: MAR-789456

# 会话 3 - 租车预订 (Hertz)
- 公司: Hertz
- 取车: LAX, 12月20日 3PM
- 还车: LAX, 12月23日 12PM
- 确认号: HERTZ-456789

Key: The three sessions interact with different service providers, but all belong to the same “Los Angeles trip”

Level 2: Overwrite handling — Multiple modifications to an order

Scenario: A constantly modified custom furniture order

The user custom-orders a dining set, but repeatedly changes the requirements during production. The Agent needs to track all changes and keep only the currently valid specifications.

Order change history

  • August 20: Ordered walnut dining table + 8 gray chairs + 1 bench
  • September 5: Chair color changed to sage green, 2 changed to armchairs
  • October 28: Green fabric discontinued, user needs to choose a new color

Final question: “What’s the current status of my dining set order?”

Expected behavior: Return only the latest status: waiting for fabric selection, delivery date pending; do not confuse with historical specifications

Key changes across three sessions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 会话 1 - 初始订单 (8月20日)
- 餐桌: 胡桃木 Hamilton, 72 ($4,100)
- 椅子: 8把标准椅, 灰色 ($3,400)
- 长凳: 1条配套长凳 ($650)
- 交付: 11月5日

# 会话 2 - 设计变更 (9月5日)
- 椅子颜色: 灰色 鼠尾草绿
- 椅子类型: 2把升级为扶手椅 (+$200)
- 交付: 11月5日 11月12日

# 会话 3 - 面料问题 (10月28日)
- 问题: 鼠尾草绿面料停产!
- 状态: 等待客户从新样品中选择
- 交付: 待定 (取决于选择时间)

Key: The Agent must recognize that old information has been overwritten, and only the latest session’s status is valid

Level 3: Proactive service — Passport expiry warning

Scenario: International travel coordination

The user mentions different pieces of information in multiple independent sessions but never connects them. The Agent needs to proactively connect these scattered pieces and reason about potential risks.

Risks the AI needs to infer

  • Passport expiry date: February 18, 2025 (mentioned in Session 1)
  • Return date: January 22, 2025 (mentioned in Session 2)
  • Japan requires that a passport be valid for ≥ 6 months at the time of entry!

Final question: “I’m finalizing my trip to Tokyo in January. Is there anything I need to take care of before I go?”

Expected behavior: Proactively connect passport validity with travel dates and remind the user that the passport may not meet Japan’s entry requirements

Excerpts from three independent sessions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 会话 1 - 6月 护照更新地址 (USPS)
- user: I need to update my address on file.
- assistant: I'll update that. I see your
passport expires Feb 18, 2025.
- user: Thanks, I'll deal with renewal later.
# (护照有效期只是顺便被提到,未讨论旅行)

# 会话 2 - 11月 机票预订 (Delta)
- user: I want to book Tokyo, Jan 15-22.
- assistant: Great! I found flights for you.
- user: Book the 2pm departure please.
- assistant: Done. Confirmation: DELTA-JMK892
# (只预订机票,未问护照问题)

# 会话 3 - 10月 信用卡 (Chase)
- user: Will my Sapphire Reserve work abroad?
- assistant: Yes, no foreign transaction fees.
Trip insurance covers purchases.

Key: The relationship between passport and travel is never discussed in any of the three sessions; the AI needs to reason this out by itself!

Level 3: Proactive Service — Integrating Device Damage Protection

Scenario: Phone screen shattered

The user says “My phone screen just cracked.” The Agent needs to proactively integrate different protection information scattered across sessions and find the best solution.

Protection sources the AI needs to infer

  • Manufacturer warranty (Apple 1-year, until Feb 2025)
  • Credit card protection (Chase Sapphire, $50 deductible)
  • Carrier insurance (user declined, not applicable)

Final question: “My phone screen just cracked. What are my options?”

Expected behavior: Proactively list all protection options, compare costs and processes, and recommend the optimal plan (Chase credit card protection)

Information scattered across multiple independent sessions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 会话 1 - 2月 购机 (Best Buy)
- user: I'll use my Chase Sapphire Reserve.
- assistant: That card extends warranties
and has purchase protection.
- Phone: iPhone 14 Pro, $1,099
#(原厂保修: 到 2025年2月)

# 会话 2 - 2月 手机激活 (Verizon)
- user: No thanks, I don't need insurance.
- assistant: You can add it later if needed.
# (用户拒绝了 Verizon Mobile Protect)

# 会话 3 - 8月 账单查询 (Chase)
- user: Do I have phone protection?
- assistant: Yes, up to $800 per claim,
$50 deductible. Must pay bill
with this card monthly.
# (确认信用卡保护仍然有效)

Key: The AI needs to integrate three sources and infer that Chase protection is the optimal choice ($50 deductible vs Apple $379)

Level 3: Proactive Service — Tax Season Preparation

Scenario: Proactive reminder before tax season

When the user mentions “preparing my taxes” in early January, the Agent should proactively aggregate tax-related information scattered across different conversations throughout the year.

Historical information the AI needs to proactively associate

  • February: Mortgage application (interest $31,000, points $7,500)
  • June: Stock sale (Apple, capital gains $33,000)
  • August: Charitable donation (Microsoft stock $25,200)
  • October: Side consulting income ($18,000, has a home office)

Final question: “I’m preparing my taxes. What should I know?”

Expected behavior: Proactively list all tax-related items, remind about required forms, and flag commonly missed deductions

Tax information scattered across year-long conversations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 会话 1 - 2月 房贷 (First National Bank)
- 贷款: $500,000, 6.75% 利率
- 利息: ~$31,000/年 (可抵扣)
- Points: $7,500 (可抵扣)

# 会话 2 - 6月 股票 (Charles Schwab)
- 卖出: 300 Apple, $55,590
- 成本: $22,500 资本利得 $33,000

# 会话 3 - 8月 捐赠 (United Way)
- 捐赠: 72 Microsoft, 价值 $25,200
- 避免资本利得税 $4,374

# 会话 4 - 10月 副业 (SBDC)
- 收入: $18,000, 家庭办公室 6%
- 自雇税: ~$1,683

Key: Four sessions spanning the whole year; the AI needs to proactively aggregate and remind the user to prepare all relevant forms

Level 3: Core Capabilities of Proactive Service

Fundamental differences from Levels 1 and 2

Level Trigger mode Information source
L1 User asks directly Single session
L2 User gives vague request Multiple sessions
L3 No need to ask Across time and domains

Three typical scenarios recap

  • Passport alert: Flight ticket + passport validity → entry risk
  • Device protection: Purchase + credit card + insurance → optimal plan
  • Tax prep: Year-long transaction records → complete tax checklist

Key technical challenges

Time span: Need to connect conversations from months or even years ago and identify still-relevant information

Cross-domain: Associate and reason over information from different providers and scenarios

Proactive reasoning: User doesn’t explicitly request it, but the Agent should proactively discover and remind

Priority judgment: Identify truly urgent and important issues, avoiding information overload

Part Two: Representation of Memory

Memory Representation (I): Natural Language

Simple Notes mode

Minimalist design

Each memory is an atomic factual statement:

  • “User email: john@example.com
  • “Preferred programming language: Python”
Advantages Disadvantages
Extremely low cognitive load Loss of information associations
O(1) operational complexity Semantic fragmentation

Enhanced Notes mode

Full context preserved

Paragraph-style storage of full background:

“The user works at TechCorp as a senior software engineer, has focused on machine learning for three years, and is currently leading a recommendation system project.”

Advantages Disadvantages
Semantic completeness Redundant storage
Narrative structure preserved Complex updates

Common trait: Uses natural language as the main carrier, suitable for human reading and understanding, but lacks machine-operable structured information.

Memory Representation (II): Structured

JSON Cards mode

Structured organization

Three-layer nesting: category → subcategory → key-value pairs

1
2
3
4
5
6
7
8
{
"personal": {
"contact": {"email": "[email protected]"}
},
"work": {
"position": {"title": "Senior Engineer"}
}
}
Advantages Disadvantages
Partial updates Rigid structure
Extensible Hard to classify multi-dimensional information

Advanced JSON Cards mode

Contextual knowledge management

Add metadata fields on top of basic JSON:

  • backstory: narrative background of the information source
  • person: identity of the subject
  • relationship: relationship between subject and user
  • timestamp: record timestamp

Example: “Dermatologist Dr. Chen contacted for eczema treatment for 8-year-old daughter Sarah”
→ person: Sarah, relationship: daughter

Common trait: Uses structured data as the main carrier, enabling programmatic operations and precise retrieval, suitable for storing key information that requires disambiguation.

Limitations of Knowledge Graphs

The promise of knowledge graphs

Triple representation: entity–relationship–entity

Seemingly powerful

  • More flexible information networks
  • Suitable for representing complex relationships
  • Supports graph queries

Practical issues

Semantic degradation is inevitable

Original expression:
“If it’s still raining next week, I’ll cancel my beach plan and go to the museum instead.”

Knowledge graph representation:

  • (me, has plan, beach trip)
  • (me, has backup plan, museum trip)

Lost information:

  • Conditional relation: “if–then–else”
  • Temporal dependency: “if it’s still raining next week”
  • Core structure of the decision logic

Limitations in reasoning capability

Good at: structured queries

  • Pattern matching
  • Path finding
  • Find all “plans” related to “me”

Not good at: logical reasoning

  • Counterfactuals: “What if it doesn’t rain?”
  • Hypothesis testing
  • Analogical reasoning

Best practices

Natural language + structured metadata

Store complex information in full, concise natural language, augmented with structured metadata like JSON Cards for indexing and retrieval.

Achieve an optimal balance between information completeness and query efficiency.

Case Study: ChatGPT’s Memory System

Four-layer context architecture

Through reverse engineering, it’s been found that every time ChatGPT receives a message, it injects four layers of context:

1. Session Metadata
Device type, browser, time zone, subscription level, etc.; not retained after the session ends

2. User Memory
User-explicitly-stored long-term facts (e.g., “remember that I am…”), injected on every request

3. Recent Conversations Summary
Lightweight summary of recent conversations (about 15 messages), only includes user messages, not assistant replies

4. Current Session
Full conversation history within a sliding window; older messages are truncated when exceeding token limits

Key design choices

No vector database
Doesn’t use traditional RAG-style vector retrieval; instead uses precomputed lightweight summaries injected directly, trading detailed history for speed and efficiency

Passive memory mechanism
Only stores information when the user explicitly says “remember this” or when the model detects facts that meet OpenAI’s criteria

Simple Notes mode
Each memory is an independent factual statement, lacking structural links between pieces of information

API unavailable
Memory features are not exposed to developers, limiting third-party app integration

Reference: Manthan Gupta, “I Reverse Engineered ChatGPT’s Memory System”

Case Study: Claude’s Memory System

Core differences from ChatGPT

Claude uses a completely different memory architecture: on-demand retrieval rather than precomputed injection.

User Memories
Similar long-term fact storage to ChatGPT, but supports implicit updates—the system periodically updates memories in the background based on conversation content

Rolling Window
About 190k tokens of full message history; older messages are discarded once exceeded

conversation_search tool
On-demand search of historical conversations by topic or keyword, called only when the model deems it necessary

recent_chat tool
Time-based retrieval of recent conversations, also called on demand

Design philosophy comparison

ChatGPT: precompute + inject
Automatically injects conversation summaries for every request, ensuring basic cross-session continuity, but summaries are lightweight and lack detail

Claude: selective retrieval
Does not automatically inject historical summaries; instead, the model decides when it needs historical context and retrieves it via tool calls

Dimension ChatGPT Claude
Continuity Automatically ensured Depends on model judgment
Depth of detail Shallow Can go deep on demand
Efficiency Fixed overhead Consumed on demand

Reference: Manthan Gupta, “I Reverse Engineered Claude’s Memory System”

Limitations of ChatGPT and Claude Memory Systems

Shared shortcomings

Flat storage
Both lack associations and hierarchical structure between pieces of information, making it hard to represent complex semantic relationships

No disambiguation mechanism
When there are multiple related but distinct entities (such as two cars), there is no effective way to distinguish them

Lack of proactive service
Neither can achieve third-level proactive, anticipatory service

Individual issues

ChatGPT Claude
Conversation summaries are too brief, losing important details Relies on the model to decide when to retrieve, which may miss relevant context

Three-Level Evaluation Framework Comparison

Level ChatGPT Claude
L1: Basic Recall ✅ Meets ✅ Meets
L2: Multi-Session Retrieval ⚠️ Summaries too shallow ⚠️ Retrieval unstable
L3: Proactive Service ❌ Not implemented ❌ Not implemented

Directions for Improvement

  • Use Advanced JSON Cards to enhance metadata
  • Introduce context-aware automatic extraction
  • Build an association graph between memories
  • Implement memory-based proactive reasoning

Experiment: Comparison of Four Memory Modes

Experimental Design (projects/week2/user-memory)

Based on the three-level evaluation framework, systematically compare four modes:

Mode Simplicity Expressiveness Updatability Applicable Scenarios
Simple Notes ⭐⭐⭐⭐⭐ ⭐⭐ ⭐⭐⭐⭐ Quickly recording temporary information
Enhanced Notes ⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐ Scenarios requiring full semantics
JSON Cards ⭐⭐⭐ ⭐⭐⭐ ⭐⭐⭐⭐ Structured information management
Advanced JSON ⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐ Key information that requires disambiguation

Key Findings

There is no “best” mode
The optimal choice depends on the specific scenario, cost budget, and task requirements.

Hybrid use is the trend
Simple Notes for fast recording + Advanced JSON for handling key information

Part Three: Memory Retrieval

Limitations of Traditional RAG

Problem: Flattened Processing Causes Information Loss

Case 1: The Black Cat vs White Cat Counting Problem

There are 100 independent cases in the knowledge base:

  • 90 black cats
  • 10 white cats

User asks: “What is the ratio of black cats to white cats?”

RAG system dilemma:

  • Retrieval is limited by top-k (e.g., k=20)
  • Cannot guarantee recalling all cases
  • Can only reason based on an incomplete sample
  • Result: Incorrect ratio conclusion

Case 2: Xfinity Discount Rules

There are three isolated cases in the knowledge base:

  • Veteran John successfully applied for the discount
  • Doctor Sarah received the discount
  • Teacher Mike was not eligible

User asks: “I am a nurse, can I get the discount?”

RAG system problem:

  • “Nurse” is semantically similar to “doctor”
  • Tends to retrieve Sarah’s case first
  • Incorrectly infers that nurses are also eligible
  • Root cause: Fails to recall the complete rule boundary

Core issue: A naive RAG approach—directly throwing raw cases into the knowledge base—is far from sufficient. You must invest compute at the indexing stage to actively distill, abstract, and structure the original knowledge.

Solution: Knowledge Distillation and Structuring

Correct Approach for Case 1

Pre-compute a statistical summary

Compress the 100 individual cases into:

“There are 100 cats in total:

  • 90 black cats (90%)
  • 10 white cats (10%)”

Result: A single retrieval yields accurate, complete statistical information.

Correct Approach for Case 2

Extract explicit rules

From the three isolated cases, extract:

“Xfinity discounts only apply to:

  • Veterans
  • Doctors

Other professions are not eligible.”

Result: No matter which profession the user asks about, a single retrieval returns a complete and accurate definition of the rule.

Core principle: Compress the “100 individual cases” into a statistical summary, and distill the “three isolated cases” into explicit rules. Only then can you build a truly efficient and reliable agent knowledge system.

Structured Indexing: RAPTOR vs GraphRAG

RAPTOR: Tree-Like Hierarchical Structure

Bottom-up recursive abstraction

  1. Leaf nodes: Split documents into small text chunks
  2. Clustering: Group semantically similar chunks
  3. Summarization: Generate parent nodes for each group
  4. Recursion: Abstract layer by layer up to the root node

Retrieval process:

  • Locate macro concepts from high-level summaries
  • Drill down the tree to reach concrete details
  • Retrieval path from macro to micro

Strengths: Captures hierarchical structure and abstraction relationships in knowledge.

GraphRAG: Network Association Graph

Entity-relationship modeling

  1. Extract entities: People, places, concepts, terms
  2. Extract relationships: Various relations between entities
  3. Community detection: Clusters of tightly related entities
  4. Cluster summarization: Generate summaries for communities

Retrieval process:

  • Locate core entities
  • Traverse relationship edges to find related entities
  • Provide context via community analysis

Strengths: Reveals horizontal associations and network structure in knowledge.

Relation between the two: They are not substitutes but complements. The ideal solution combines them to build a three-dimensional knowledge index with both depth and breadth.

Context-Aware Retrieval: Solving Context Loss

Problem: Ambiguity of Isolated Text Chunks

Example chunk:

“The company’s revenue grew by 3% in the second quarter.”

Missing context:

  • Which company is “the company”?
  • When was the report released?
  • Which product line is this related to?

Result: Severe semantic information loss, reduced retrieval accuracy.

Solution: Context Prefix

Anthropic’s context-aware retrieval

Step one: Generate a context prefix for the chunk.

The LLM generates:

“[This passage is excerpted from ACME’s 2025 Q2 financial report, ‘Key Performance Indicators’ section]”

Step two: Concatenate and index

“[This passage is excerpted from ACME’s 2025 Q2 financial report, ‘Key Performance Indicators’ section] The company’s revenue grew by 3% in the second quarter.”

Effect: Combined with BM25, retrieval failure rate drops by 49%; combined with a re-ranker, failure rate drops by up to 67%.

Two-Layer Structure of User Memory

📋 JSON Cards (Resident Context)
Structured core facts, a personal cheat sheet

  • Passport expires 2025-02 · Tokyo trip

🔍 Context-Aware RAG (On-Demand Retrieval)
Unstructured conversation details, a powerful search engine

  • [Context: booking a January flight in November…]

🔗 The two must work together

  1. JSON Cards provide the factual framework
  2. LLM reasoning discovers potential associations
  3. RAG verifies and retrieves conversational evidence
  4. Proactive service: Passport is about to expire!

JSON Cards tell the agent “what exists”; RAG tells the agent “what the details are.” Both are indispensable.

Agent Memory Architecture

Resident Context

📋 Basic knowledge → System Prompt
User JSON Cards are placed directly into the agent context and can be accessed without tool calls.

Three Retrieval Tools

🔍 search_user_memory
Agentic Search on User Memory

  • Backend: Embedding Search → Rerank → return related memories

🔍 search_conversations
Agentic Search on Conversation Summaries

  • Backend: Embedding Search → return related historical conversation summaries

📜 load_recent_conversations
Load Last N Conversation Summaries

  • Directly load summaries of the last N turns, no semantic search needed

Architecture Diagram

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
┌─────────────────────────────────┐
│ Agent │
│ ┌───────────────────────────┐ │
│ │ System Prompt │ │
│ │ + User JSON Cards │ │
│ └───────────────────────────┘ │
│ │
│ Tools: │
│ ├─ search_user_memory() │
│ │ └→ Embedding → Rerank │
│ ├─ search_conversations() │
│ │ └→ Embedding Search │
│ └─ load_recent_conversations() │
│ └→ Last N summaries │
└─────────────────────────────────┘


┌─────────────────────────────────┐
│ Memory Store (Vector DB) │
│ • User memories (chunked) │
│ • Conversation summaries │
└─────────────────────────────────┘

Design principle: High-frequency information resides in context; long-tail details are retrieved on demand.

Proactive Service: A Natural Result of Storage + Retrieval

Core Insight

Proactive service is not an independent capability layer
Once storage and retrieval are done well, proactive service emerges naturally. It is an emergent result of structured storage working together with intelligent retrieval.

Why Does It Emerge Naturally?

Structured storage (JSON Cards) provides:

  • Key facts resident in context (e.g., passport validity)
  • Metadata that supports associative reasoning (e.g., timestamps, entity types)

Intelligent retrieval (context-aware RAG) provides:

  • On-demand access to historical conversation details
  • Automatic connection of related information fragments

Combined, the agent naturally discovers associations like “Tokyo ticket in January → passport expires in February.”

Examples of Proactive Service

International travel alert
JSON Cards store passport information + RAG retrieves flight bookings → automatically detects time conflicts.

Device damage handling
JSON Cards store device and insurance information → automatically list all applicable protection options.

Tax season preparation
JSON Cards store income types + RAG retrieves transaction records → automatically aggregates relevant documents.

Implementation path: There is no need to design a separate mechanism for “proactive service.” Focus on doing storage and retrieval well, and the LLM’s reasoning ability will handle the rest.

Part Four: Evaluating Memory

Why Do We Need Evaluation?

Evaluation Is the Compass of Agent Engineering

Building an agent system involves many design decisions, and these decisions often have no obvious “correct answer.”

Key decision points

  • Workflow design: Workflow vs Autonomous mode
  • Prompt design: Structured vs rule list
  • Memory mode: Simple Notes vs JSON Cards
  • Retrieval strategy: Precomputed injection vs on-demand retrieval

Core insight: Some seemingly reasonable designs actually harm performance, while some seemingly trivial details can bring significant gains. Only through rigorous comparative evaluation can these counterintuitive truths be revealed.

Threefold Value of Evaluation

1. Guides design decisions
Without evaluation, we can only rely on intuition, and intuition is often unreliable.

2. Provides improvement signals
Not only tells you “good or bad,” but more importantly reveals “why it is good/bad.”

3. Supports model upgrade decisions
When a new model is released, only by testing it on your own evaluation set can you make data-driven upgrade decisions.

Ablation study methodology: keep all other parts of the system unchanged, modify only one specific component, and observe the impact on overall performance

Basic components of an evaluation environment

Five core elements

1. Dataset
Defines a set of tasks; each task contains an initial state, goal description, and reference solution.

2. Environment state
Maintains all mutable information during task execution (database, file system, conversation history).

3. Tools
The channels through which the agent interacts with the environment; must be functionally complete but avoid over-simplification.

4. Rubric
Defines how to quantify agent performance; this is the most challenging part of evaluation.

5. Interaction protocol
Specifies interaction patterns and termination conditions.

Key principles for human–AI interaction evaluations

Progressive information disclosure
You must never expose all the information the user has to the agent at the very beginning. Information should be disclosed as needed and progressively throughout the conversation.

User simulation
Use another LLM to play the user role, following predefined instructions to:

  • Reveal necessary information step by step
  • Respond to the agent’s questions
  • Issue a termination signal after the task is completed

Dual verification

  • Check whether the final database state is correct
  • Check whether all necessary key information was output in the conversation

Reference: τ-bench / τ²-bench evaluation frameworks

Rubric: the basis for LLM judgment

What is a rubric?

A rubric (structured scoring guideline) is the core tool that makes LLM-as-judge evaluation objective, consistent, and interpretable. It is similar to the scoring criteria for exams like the Gaokao, GRE writing, or TOEFL speaking.

Four design principles

Expert-guided: reflects domain expertise and captures the core facts and reasoning steps required for a correct response.

Comprehensive coverage: spans multiple dimensions (accuracy, coherence, completeness) and defines both positive and negative criteria.

Importance weighting: factual correctness must take precedence over stylistic clarity (Essential / Important / Optional / Pitfall).

Self-contained evaluation: each evaluation item is independently operable and does not rely on external context.

Example rubric for evaluating user memory

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
dimensions:
factual_precision:
weight: essential
levels:
- 4: 所有事实完全正确
- 3: 关键事实正确,细节略有偏差
- 2: 部分事实正确
- 1: 事实存在重大错误

factual_recall:
weight: important
levels:
- 4: 提供了所有相关信息
- 3: 提供了主要信息
- 2: 遗漏部分关键信息

hallucination:
weight: veto # 一票否决
description: 任何编造的信息直接判定失败

Preventing reward hacking: explicitly define negative criteria in the rubric—hallucinations, flattering the user, keyword stuffing, and avoiding the question.

Evaluation methodology: best practices from Anthropic

Three types of evaluation

Unit tests
Deterministic checks, used to verify format, edge cases, and other scenarios where correctness can be clearly judged.

LLM-as-judge
Use an LLM to assess output quality; combined with a clear rubric, this can achieve a high level of agreement with human judgment.

Human evaluation
Test under real-world conditions to discover the “rough edges” that automated evaluation cannot capture.

Characteristics of good evaluations

  • Specific and clear: there is a single correct answer.
  • Realistic: reflects problems that real users will actually encounter.
  • Diagnosable: simple enough to understand the reason for failure.
  • Representative: reflects the end-user experience.

Three dimensions of agent evaluation

Final answer correctness
Did the agent provide the correct final answer? Use an LLM judge to compare with the reference answer.

Tool-usage accuracy
Did the agent choose the correct tools? Were the parameters correct? Could it recover from errors?

Final state correctness (τ-bench)
Did the agent achieve the correct final state? Applicable to tasks with side effects (such as canceling an order).

Evaluation tips

  • The more obvious the impact of each system change on the result, the fewer test samples you need.
  • Use real tasks: real user scenarios that have clearly correct answers.
  • Nothing can perfectly replace human evaluation: repeated testing and gut checks are indispensable.

Reference: Anthropic, “Context Engineering Best Practices” (AWS re:Invent 2025)

Part 5: Frontier research

Limitations of existing agent memory systems

Problem: agents cannot learn from history

Current status

Existing LLM agents cannot effectively learn from accumulated interaction history when handling continuous task streams. Each task is processed in isolation, causing the system to repeatedly make past mistakes and lose valuable insights.

Root problem: lack of true self-evolution capability—the agent cannot grow stronger over time.

Defects of existing approaches

Two mainstream approaches

Raw trajectory storage
Directly store the interaction process, with no distillation.

Successful workflow logging
Only keep workflows/procedures and ignore failures.

Shared defects

  • Cannot extract high-level, transferable reasoning patterns
  • Overemphasize success and ignore the valuable lessons of failure
  • Passive recording that cannot generate actionable guidance

ReasoningBank: a memory bank of reasoning strategies

Core innovation

Learning from both success and failure

ReasoningBank distills generalizable reasoning strategies from the agent’s self-judged successes and failures, without relying on ground-truth labels.

Difference in memory contents

Method Stored content
Raw trajectories Complete interaction sequences
Successful workflows Effective action patterns
ReasoningBank Transferable reasoning strategies

Closed-loop learning mechanism

1. Retrieve relevant memories
When facing a new task, retrieve semantically relevant reasoning strategies from ReasoningBank.

2. Guide action decisions
Use the retrieved strategies to guide the agent’s interaction process.

3. Analyze new experiences
After the task is finished, the agent self-judges success or failure.

4. Distill and integrate
Extract reasoning strategies from the new experiences and update ReasoningBank.

Why are failure experiences equally important?

Valuable lessons in failure

Misconceptions in traditional thinking

Most memory systems focus only on successful cases, assuming failures are not worth keeping. Yet failure experiences contain critical “preventive” knowledge.

Example: web navigation task

Success teaches you:
“Click the ‘Men’s clothing’ category to find the product.”

Failure teaches you:
“Do not search directly on the homepage; the search box does not support complex queries well.”

Failure experiences provide boundary conditions that successful paths cannot cover.

The value of contrastive signals

Contrastive learning from success vs. failure

When there are both successful and failed examples for the same class of tasks, the agent can discover through comparison:

  • Which strategies are effective in specific contexts
  • Which seemingly reasonable paths actually fail
  • The critical boundaries between success and failure

How ReasoningBank handles this

From successful experiences, it extracts: positive strategies (“doing this works”).

From failed experiences, it extracts: preventive strategies (“avoid doing this”).

Together they form more complete reasoning knowledge.

MaTTS: memory-aware test-time scaling

Depth vs. breadth

Two paths for expanding experience

Breadth scaling
Increase the number of tasks (more users, more scenarios).

Depth scaling (MaTTS)
Conduct more exploration for each task (more attempts, more variants).

Core idea of MaTTS

Allocate more compute to a single task to generate rich, diverse exploratory experiences, providing higher-quality contrastive signals for memory synthesis.

Synergy between memory and scaling

Positive feedback loop

1
2
3
High-quality memory → more effective exploration → richer experience
↑ ↓
←──── stronger memory synthesis ←───

Two scaling modes

Parallel scaling
Generate multiple independent solution paths simultaneously.

Sequential scaling
Adjust the next attempt based on the previous result.

MaTTS establishes memory-driven experiential scaling as a new scaling dimension for agent systems.

Experimental results and key findings

Benchmark results

Three evaluation settings

WebArena (web browsing)

  • Complex web-interaction tasks
  • Requires multi-step navigation and operations

Mind2Web (web understanding)

  • Element recognition on real-world web pages
  • Action prediction and execution

SWE-Bench-Verified (software engineering)

  • Codebase-level bug fixing
  • Requires understanding large codebases

Key metrics

  • Effectiveness: up to 34.2% relative improvement
  • Efficiency: 16.0% fewer interaction steps

Key findings

Memory quality > quantity
Retrieving 1 relevant memory outperforms retrieving 4. Too many memories may introduce conflicts or noise.

Unique value of failure experiences
Systems that incorporate failure experiences outperform those that learn only from successes.

Emergent behaviors
As memories accumulate, the agent begins to exhibit complex reasoning strategies not seen before.

Synergy of MaTTS and memory
The combination of ReasoningBank + MaTTS performs best, confirming the positive feedback loop between memory and scaling.

Insights from ReasoningBank for user-memory systems

From task memory to user memory

Shared core challenges

ReasoningBank addresses how an agent learns from task interactions; user memory systems address how an agent understands and serves users. They face similar core questions:

  • How to distill high-level knowledge from raw data?
  • How to retrieve truly relevant information?
  • How to enable the memory system to evolve continuously?

Key insight: you cannot simply store raw data; you must invest compute in active distillation, abstraction, and structuring.

Transferable design principles

Principle 1: bidirectional learning
Learn not only from users’ positive feedback (preferences) but also from negative feedback (boundaries).

Principle 2: closed-loop updates
A memory system is not built once and for all; it evolves continuously with interactions.

Principle 3: quality first
The relevance and quality of memories matter more than their quantity.

Principle 4: self-judgment
Use LLM-as-a-judge to automate quality evaluation and reduce reliance on manual labeling.

Summary: The Evolution from Memory to Cognition

Technical Evolution Path

1. Remembering Facts
Simple Notes / JSON Cards
✓ Accurately store structured information

2. Understanding Context
Enhanced Notes / Advanced JSON
✓ Preserve semantic integrity and situational information

3. Cross-Session Association
Structured indexing + context-aware retrieval
✓ Disambiguate and discover composite events

4. Proactive Anticipation
Dual-layer memory architecture + deep reasoning
✓ Provide help without explicit requests

Key Insights

Personalization is a real need
From the success of recommendation systems, personalized products are more in line with human nature. AI Agents also need personalized memory to adapt to each user’s unique values and preferences.

Preference learning is the hard part
Factual information is relatively simple, but learning user preferences faces challenges such as context dependence and over-generalization, requiring fine-grained evaluation and continuous iteration.

Knowledge distillation is critical
You can’t just dump raw data into a knowledge base; you must invest compute in proactive distillation, abstraction, and structuring.

A dual-layer architecture is the optimal solution
Structured core facts (always in context) + context-aware retrieval (on-demand access) strike a balance between completeness and efficiency.

Future Outlook

Technical Challenges

Refinement of preference learning

  • Better modeling of context dependence
  • Distinguishing one-off behaviors from long-term preferences
  • Reducing the risk of over-generalization

Memory compression and organization

  • Automatically discovering knowledge hierarchies
  • Dynamically adjusting memory structure
  • Balancing level of detail and accessibility

Cross-modal memory integration

  • Unified representations of text, images, and audio
  • Associative retrieval across multimodal information

Application Prospects

Personalized value alignment

  • From universal values to individual values
  • Dynamically adapting to the evolution of user values
  • Achieving true personalization at the level of details

Operating-system-level assistants

  • Unified memory across devices and applications
  • Long-term, continuous user profile construction
  • Truly proactive services

Privacy and transparency

  • Complete user control over memory
  • Explainable memory management
  • Tiered protection for sensitive information

Vision

To build a truly “understanding you” AI assistant that not only remembers what you say, but understands who you are, anticipates your needs, and becomes a trustworthy lifelong companion.


From simple recording to deep understanding, from passive response to proactive service

Comments

2025-10-16
  1. Contents
  2. Part 1: The Importance and Challenges of Memory
    1. Personalization is a real problem, and the core competitiveness of the future
      1. The evolution of recommender systems
      2. The future of AI is the same
    2. Technical difficulty: Remembering facts vs. learning preferences
      1. Factual Information
      2. User Preference
    3. Alignment of personalization value
      1. Analogy: Success experience of recommender systems
      2. From recommendation to alignment: The evolution of AI
    4. User memory is more than logging conversations
      1. The essence of memory
      2. Comparison of two types of memory
    5. Three levels of memory capability
      1. Level 1: Basic recall
      2. Level 2: Cross-session retrieval
      3. Level 3: Proactive service
      4. Our evaluation framework
    6. Level 1: Evaluation of basic recall
      1. Scenario: Bank account setup
      2. Conversation excerpt
    7. Level 2: Cross-session retrieval — Disambiguation scenario
      1. Scenario: Service appointment for multiple cars
      2. Conversation excerpt
    8. Level 2: Composite events — Cascading effect of trip cancellation
      1. Scenario: One big event contains many small events
      2. Information scattered across three independent sessions
    9. Level 2: Overwrite handling — Multiple modifications to an order
      1. Scenario: A constantly modified custom furniture order
      2. Key changes across three sessions
    10. Level 3: Proactive service — Passport expiry warning
      1. Scenario: International travel coordination
      2. Excerpts from three independent sessions
    11. Level 3: Proactive Service — Integrating Device Damage Protection
      1. Scenario: Phone screen shattered
      2. Information scattered across multiple independent sessions
    12. Level 3: Proactive Service — Tax Season Preparation
      1. Scenario: Proactive reminder before tax season
      2. Tax information scattered across year-long conversations
    13. Level 3: Core Capabilities of Proactive Service
      1. Fundamental differences from Levels 1 and 2
      2. Three typical scenarios recap
      3. Key technical challenges
  3. Part Two: Representation of Memory
    1. Memory Representation (I): Natural Language
      1. Simple Notes mode
      2. Enhanced Notes mode
    2. Memory Representation (II): Structured
      1. JSON Cards mode
      2. Advanced JSON Cards mode
    3. Limitations of Knowledge Graphs
      1. The promise of knowledge graphs
      2. Practical issues
      3. Limitations in reasoning capability
      4. Best practices
    4. Case Study: ChatGPT’s Memory System
      1. Four-layer context architecture
      2. Key design choices
    5. Case Study: Claude’s Memory System
      1. Core differences from ChatGPT
      2. Design philosophy comparison
    6. Limitations of ChatGPT and Claude Memory Systems
      1. Shared shortcomings
      2. Individual issues
      3. Three-Level Evaluation Framework Comparison
      4. Directions for Improvement
    7. Experiment: Comparison of Four Memory Modes
      1. Experimental Design (projects/week2/user-memory)
      2. Key Findings
  4. Part Three: Memory Retrieval
    1. Limitations of Traditional RAG
      1. Problem: Flattened Processing Causes Information Loss
    2. Solution: Knowledge Distillation and Structuring
      1. Correct Approach for Case 1
      2. Correct Approach for Case 2
    3. Structured Indexing: RAPTOR vs GraphRAG
      1. RAPTOR: Tree-Like Hierarchical Structure
      2. GraphRAG: Network Association Graph
    4. Context-Aware Retrieval: Solving Context Loss
      1. Problem: Ambiguity of Isolated Text Chunks
      2. Solution: Context Prefix
    5. Two-Layer Structure of User Memory
    6. Agent Memory Architecture
      1. Resident Context
      2. Three Retrieval Tools
      3. Architecture Diagram
    7. Proactive Service: A Natural Result of Storage + Retrieval
      1. Core Insight
      2. Why Does It Emerge Naturally?
      3. Examples of Proactive Service
  5. Part Four: Evaluating Memory
    1. Why Do We Need Evaluation?
      1. Evaluation Is the Compass of Agent Engineering
      2. Threefold Value of Evaluation
    2. Basic components of an evaluation environment
      1. Five core elements
      2. Key principles for human–AI interaction evaluations
    3. Rubric: the basis for LLM judgment
      1. What is a rubric?
      2. Four design principles
      3. Example rubric for evaluating user memory
    4. Evaluation methodology: best practices from Anthropic
      1. Three types of evaluation
      2. Characteristics of good evaluations
      3. Three dimensions of agent evaluation
      4. Evaluation tips
  6. Part 5: Frontier research
    1. Limitations of existing agent memory systems
      1. Problem: agents cannot learn from history
      2. Defects of existing approaches
    2. ReasoningBank: a memory bank of reasoning strategies
      1. Core innovation
      2. Closed-loop learning mechanism
    3. Why are failure experiences equally important?
      1. Valuable lessons in failure
      2. The value of contrastive signals
    4. MaTTS: memory-aware test-time scaling
      1. Depth vs. breadth
      2. Synergy between memory and scaling
    5. Experimental results and key findings
      1. Benchmark results
      2. Key findings
    6. Insights from ReasoningBank for user-memory systems
      1. From task memory to user memory
      2. Transferable design principles
  7. Summary: The Evolution from Memory to Cognition
    1. Technical Evolution Path
    2. Key Insights
  8. Future Outlook
    1. Technical Challenges
    2. Application Prospects
    3. Vision