(This article is organized from Anthropic team talks and in-depth discussions during AWS re:Invent 2025)

View Slides (HTML), Download PDF Version (note these slides are not official Anthropic material; I reconstructed them from photos and recordings)

Slides Source Code

Contents

Claude is already smart enough—intelligence is not the bottleneck, context is. Every organization has unique workflows, standards, and knowledge systems, and Claude does not inherently know any of these. This post compiles Anthropic’s best practices for Context Engineering, covering Skills, Agent SDK, MCP, evaluation systems and other core topics to help you build more efficient AI applications.

  • 01 | Skills system - Let Claude master organization-specific knowledge
  • 02 | Context Engineering framework - Four pillars for optimizing token utility
  • 03 | Context Window & Context Rot - Understand context limits and degradation
  • 04 | Tool design best practices - Elements of powerful tools
  • 05 | Claude Agent SDK - A framework for production-ready agents
  • 06 | Sub-agent configuration best practices - Automatic invocation and permissions
  • 07 | MCP (Model Context Protocol) - A standardized protocol for tool integration
  • 08 | Evaluations - Why evaluation matters and best practices
  • 09 | Lessons from building Coding Agents - What we learned from Claude Code
  • 10 | Ecosystem collaboration - How Prompts, MCP, Skills, and Subagents work together

Core question: Why do we need Context Engineering?

Claude does not know:

  • How your team structures reports
  • Your brand guidelines and templates
  • Your compliance processes
  • Your data analysis methodology

Current solutions all have limitations:

  • Prompts are ad-hoc instructions
  • Custom agents require infrastructure building
  • Context management is challenging

Part 1: Skills System

What are Skills?

Skills are folders of instructions, scripts, and resources that Claude can dynamically discover and load. You can think of them as “professional knowledge packs” that improve organization-wide productivity through consistent, high-quality outputs.

Two types of Skills

01 General capability enhancement
Things Claude cannot yet do well out of the box, such as creating PDF, Excel, and PowerPoint files.

02 Organization/industry/personal workflows and best practices
For example, Anthropic’s brand style guidelines.

How Skills work

A Skill is a directory containing an SKILL.md file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
## pdf/SKILL.md

### YAML Frontmatter

name: pdf
description: Comprehensive PDF toolkit for extracting text and tables,
merging/splitting documents, and filling out forms.

### Markdown

## Overview

This guide covers essential PDF processing operations using Python libraries
and command-line tools. For detailed examples, see `/reference.md`.
If you need to fill out a PDF form, read `/form.md` and follow its instructions.

## Quick Start

from pypdf import PdfReader, PdfWriter
reader = PdfReader('document.pdf')
print(f"Pages: {len(reader.pages)}")

Key design points:

  • Metadata: Name and description at the top of the file
  • Preloading: Agents preload the names and descriptions of installed Skills into the system prompt
  • Efficiency: Claude only reads more content when needed

Skills packaging extra content

More complex Skills can reference additional context files:

1
2
3
4
pdf/
├── SKILL.md # 主入口文件
├── reference.md # PDF 处理高级参考
└── forms.md # PDF 表单填写指南
  • Discovery: Claude navigates and discovers details as needed
  • Executable Scripts: For operations better handled by traditional code, this is more token-efficient and provides deterministic reliability when needed

Progressive Disclosure

Complex Skills can reference additional context:

Main file anthropic/brand_styling/SKILL.md:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
## Overview

This skill provides Anthropic's official brand identity resources for
PowerPoint presentations. It includes a pre-branded template
and tools to apply Anthropic styling to existing presentations.

## Colors

- Dark: '#141413' - Primary text and dark backgrounds
- Light: '#faf9f5' - Light backgrounds and text on dark
- Light Gray: '#c8c6dc' - Subtle backgrounds

## Workflows

When creating presentations, read `./slide-decks.md`
When creating professional documents, read `./docs.md`

Reference file slide-decks.md:

1
2
3
4
5
6
7
8
9
10
11
## Anthropic Slide Decks

- Intro/outro slides
- background color: '#141413'
- foreground color: oat

- Section slides:
- background color: '#da7857'
- foreground color: '#141413'

... and so on ...

Reference file docs.md:

1
2
3
4
5
6
## Documents

* every document should start with a title, a list of authors, and the creation date
* if you use tabs in GDocs, make sure the main doc is titled as such

... and so on ...

In this way, Claude only reads slide-decks.md when creating presentations, and only reads docs.md when creating documents, achieving on-demand loading.

Skills are universal across all products

The same Skills format works across all Anthropic products:

Product Best for Features
🌟 Apps Auto-calling, UX Claude creates professional documents and analyses; end users can create, manage, and share custom Skills
🌟 Developer Platform Programmatic distribution Deploy Skills into end-user products via the Code Execution API; use core Skills or build custom ones
💻 Claude Code Developer workflows Use official or custom Skills with automatic invocation

Skills in Claude Code

Install via plugins (from the official Anthropic GitHub repo) or by manually adding to the ~/.claude/skills directory.

  • Automatic invocation: Claude automatically loads Skills when relevant—users approve
  • Different from slash commands: Skills are auto-loaded by Claude; slash commands are explicitly invoked by users
  • Runs in the local development environment
  • Marketplace: Distributed through a plugin marketplace

Skills best practices

Naming and description

  • Use gerund-style names: processing-pdfs, analyzing-spreadsheets, testing-code
  • Avoid vague names (helper, utils) or inconsistent patterns
  • Include both what it does and when to use it
  • Use declarative phrasing: “Processes Excel files and generates reports”
  • Avoid: “I can help you…” or “You can use this to…”

File organization

  • Keep the main SKILL.md under 500 lines
  • Split content into separate files as you approach that limit
  • Keep references at a single depth level from SKILL.md—avoid nested file references
  • Use directory structure for longer files (>100 lines)

Content quality

  • Use consistent terminology
  • Show concrete input/output pairs, just like regular prompts
  • Examples should match the behavior you want to encourage

Skill application examples

Based on the Claude Agent SDK you can build many specialized agents:

🔒 Code Security Agent 📝 Code Review Agent 📄 Contract Review Agent 📊 Meeting Summary Agent
💰 Financial Reporting Agent ✉️ Email Automation Agent 📑 Invoice Processing Agent

Part 2: Context Engineering Framework

Core concept

Context Engineering is the discipline of optimizing token utility to deal with the inherent constraints of LLMs

Four pillars

1. System prompt

  • Minimal, precise instructions using clear, simple, direct language—“say less, mean more”
  • Structured sections
  • Appropriate level of abstraction (not too rigid, not too vague)

2. Tools

  • Self-contained (i.e., independent), non-overlapping, and purpose-specific—“every tool must justify its existence”
  • Explicit parameters & concise, distinct descriptions
  • Clear success/failure modes

3. Data retrieval

  • Just-in-time context (JIT Context)—“load what you need when you need it”
  • Balance between preloading and dynamic fetching (agents can fetch autonomously)
  • Carefully designed retrieval tools—don’t send the whole library, send a librarian

4. Long-horizon optimizations

  • History compression strategies
  • Structured note-taking systems
  • Use sub-agent architectures where appropriate

Data Retrieval Paradigm Shift

Old approach: Preload (traditional RAG) - pre-load all potentially relevant data

New approach: Just-In-Time retrieval

Strategy Description Example
Lightweight identifiers Pass IDs instead of full objects; the agent requests details when needed user_id: "12345" → agent calls get_user() → full profile
Progressive disclosure Start from summaries; the agent drills down as needed File list → file metadata → file contents
Autonomous exploration Give discovery tools instead of data dumps; the agent navigates the information space search_docs() + read_doc(detail_level) vs loading all documents

Three Strategies for Long-Running Tasks

1. Compaction

  • Periodically summarize intermediate steps and/or compress history
  • Reset the context with the compressed summary, keeping only key information
  • Tradeoff: slight loss of detail in exchange for continuous operation
  • Example: “User wants X, tried Y, learned Z” vs full conversation

2. Structured memory/notes

  • The agent maintains explicit memory artifacts (i.e., external persistent storage)
  • Store “working notes” in structured form: decisions, learnings, state
  • Retrieve on demand instead of keeping everything in context
  • Examples: decision logs, key-findings documents

3. Sub-agent architecture

  • Decompose complex tasks into specialized agents
  • Each sub-agent has a focused, clean, narrow context
  • The main agent coordinates and synthesizes results
  • Example: a code-review agent spawning a documentation-check sub-agent

Part 3: Context Window and Context Rot

Context Window Limits

  • All frontier models have a maximum number of tokens they can process in a single interaction
  • Anthropic’s context window is 200k tokens

The Context Rot Problem

As context grows, output quality may degrade

Four main causes:

Type Description
🧪 Context Poisoning Incorrect or outdated information pollutes the context, causing the model to reason from wrong premises
📄 Context Distraction Irrelevant information distracts the model and reduces focus on key information
Context Confusion Similar but distinct information is mixed together, making it hard for the model to distinguish and associate correctly
🔍⚠️ Context Clash Contradictory or inconsistent information appears in the context, and the model doesn’t know which to trust

Key conclusion: All models experience performance degradation with long contexts. (See Chroma technical report: Context-Rot: How Increasing Input Tokens Impacts LLM Performance)

Prompt Caching

  • Prompt caching is a lever for cost and latency
  • The success of prompt caching is highly related to context structure

Effective context construction and maintenance will:

Goal Outcome
Handle context window limits Reliability
Reduce context rot Accuracy
Optimize prompt caching Cost & latency

Part 4: Tool Design Best Practices

Elements of strong tool design

  1. Use simple and accurate tool names

  2. Detailed and well-formatted descriptions—include what the tool returns, how it should be used, etc.

  3. Avoid overly similar tool names or descriptions!

  4. Tools that perform a single action work better—aim for at most one level of nested parameters

  5. Provide examples—expected input/output formats

  6. Be mindful of the format of tool results

  7. Test your tools! Ensure the agent can use them correctly

Example tool definition:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
{
"name": "search_customers",
"description": "Search customer database by name, email, or ID. Returns matching customer records.",
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Search term (name, email, or customer ID)"
},
"max_results": {
"type": "integer",
"default": 10,
"description": "Number of results to return (default: 10, max: 50)"
}
}
},
"required": ["query"]
}

Part 5: Claude Agent SDK

Architecture Overview

The Claude Agent SDK is built on the agent framework that powers Claude Code, and provides all the building blocks needed to construct production-ready agents.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Application / Platform

┌─────────────────────────┐
│ Claude Agent SDK │
├─────────────────────────┤
│ Harness │
├─────────┬───────┬───────┤
│ Tools │Prompts│ File │
│ │ │System │
├─────────┴───────┴───────┤
│ Models │
├─────────┬───────────────┤
│ Claude │ Claude Opus │
│ Sonnet │ │
└─────────┴───────────────┘

Core Capabilities of the SDK

Tools

  • Read/write file operations
  • Code execution
  • Web search
  • MCP servers
  • Skills

Permissions

  • Human approval checkpoints
  • Fine-grained permissions
  • Tool allow/deny lists

Production readiness

  • Session management
  • Error handling
  • Monitoring

Enhancements

  • Subagents
  • Web search
  • Research mode
  • Auto compacting
  • Multi-stream
  • Memory

Design Philosophy of the Agent SDK

Claude Code: Delegate everyday development work to Claude

By giving Claude access to a user’s computer (via a terminal), it can write code like a programmer:

  • Find files
  • Write and edit files
  • Test
  • Debug
  • Iteratively perform actions

Claude Agent SDK: Extend Claude Code to build custom agents

The principles of Claude Code can be extended to general agents. By giving Claude the same tools, agents can:

  • Read CSV files
  • Search the web
  • Build visualizations
  • And more

Core design principle: The Claude Agent SDK gives your agents a computer, so they can work like humans.

Claude Code Tool Suite

Tool Description Requires Permission
Agent Run sub-agents to handle complex multi-step tasks No
Bash Execute shell commands in your environment Yes
Edit Make targeted edits to specific files Yes
Glob Find files based on pattern matching No
Grep Search for patterns in file contents No
LS List files and directories No
MultiEdit Perform multiple edits atomically on a single file Yes
NotebookEdit Modify Jupyter notebook cells Yes
NotebookRead Read and display Jupyter notebook contents No
Read Read file contents No
TodoRead Read the task list for the current session No
TodoWrite Create and manage structured task lists No
WebFetch Fetch content from specified URLs Yes
WebSearch Perform web search with domain filtering Yes
Write Create or overwrite files Yes

Characteristics of a Strong Agent Framework

  • Does not over-script or over-scaffold the model
  • Allows tuning of all key system parts (context engineering)
  • Leverages all model capabilities (extended thinking, interleaved thinking, parallel tool calls, etc.)
  • Provides access to memory
  • Enables multi-agent setups where valuable
  • Has a robust agent permission system

Part 6: Sub-Agent Configuration Best Practices

The Description field is critical for auto-invocation

  • Make descriptions specific and action-oriented
  • Use “PROACTIVELY” or “MUST BE USED” to encourage automatic delegation
  • Example: "Use PROACTIVELY when code changes might impact performance. MUST BE USED for optimization tasks."

Tool Permissions

  • Restrict tools to what each sub-agent needs
  • Example: a code reviewer gets Read, Grep, Glob but not Write or Edit

Model Selection

  • Use inherit to match the main conversation for consistency
  • Specify sonnet, opus, or haiku according to specific needs
  • If omitted, defaults to sonnet

Native Sub-Agent Orchestration Best Practices

  • Manage context limits within the agent framework
  • When the context window is cleared, consider restarting rather than compressing
  • Prompt around early compaction
  • Be prescriptive about how to start
  • Provide verification tools
  • As autonomous task duration increases, Claude needs to verify correctness without constant human feedback

Best Practices for Research Tasks

To get the best research results:

  • Provide clear success criteria—spell out what counts as a successful answer
  • Encourage multi-source validation—verify information across multiple sources
  • Use a structured approach for complex research—proceed step by step and methodically

Part 7: MCP (Model Context Protocol)

What is MCP?

1
2
3
4
5
6
7
8
9
10
11
12
                AI applications                    Data sources and tools

Chat interface ←→ Data and file systems
Claude Desktop, ┌─────────┐ PostgreSQL, SQLite, GDrive
LibreChat │ MCP │
←→ │Standardized│ ←→ Development tools
IDEs and code editors │ protocol │ Git, Sentry, etc.
Claude Code, Goose └─────────┘
←→ ←→ Productivity tools
Other AI applications Slack, Google Maps, etc.
Sire, Superinterface Bidirectional
data flow

Where MCP Is Heading

The latest specification (June 2025) focuses on structured tool outputs, OAuth authorization, mechanisms for server-initiated interaction requests, and security best practices.

Future directions:

  1. Asynchronous operations
  2. Statelessness and scalability
  3. Server identity and discovery

Part 8: Evaluations

Types of evaluations

Evaluation Type Description Purpose Examples
Intelligence benchmarks Evaluate a model’s general intelligence Compare with other models, model release decisions MMLU, GPQA
Capability benchmarks Evaluate the model’s performance on general capability domains Compare with other models, model release positioning MATH, HumanEval, SWE-Bench
Behavioral evaluations Quantify how common specific model behaviors are Monitor and improve model behavior Refusal rate, hallucinations, “Certainly!” frequency
Safety evaluations Use threat analysis and red-teaming to evaluate the experience of bad actors Understand the risks of feature or product launches Computer Use, browser-use red-teaming
Product evaluations Evaluate the model’s ability to perform tasks within specific product features Product launch decisions and iteration Artifacts, data features, multimodal PDF

Characteristics of good evaluations

  • Measure performance as well as regressions
  • User-centered, covering the full range of expected user behaviors
  • Consider edge cases and risks
  • Recognized by multiple stakeholders (such as Legal)
  • Have (rough) targets
  • If your feature is common in other AI products, there may be benchmarks available to use or adapt
  • Negative examples are extremely important — they define the boundaries of the feature and ensure it doesn’t over-trigger

Excellent evaluations: can be scored objectively and procedurally

Evaluation process

  1. Establish a baseline: For positive and negative examples, run prompts using the current production configuration and record the outputs as the baseline

  2. Outline expected behavior: For positive and negative examples, explain “What is the expected behavior after applying the prompt changes?”

  3. [Optional] Scoring: Build a scorer that checks model output against expected behavior (exact match, regex, or model-based)

Automation and iteration

The faster evaluations run, the easier it is to iterate.

Tool options:

  • Anthropic Console
  • Custom scripts or notebooks
  • Custom tools

Two dimensions of iteration:

  • Feature iteration: change system prompts, tool definitions, system flows
  • Evaluation iteration: you may see behaviors you don’t like and didn’t anticipate, and need to add use cases or prompts into the evaluation to test them

Agent evaluation examples

1. Answer Accuracy

The LLM judges the correctness of the Agent’s answer

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
User: How many employees started in 2023 and are still active?

Agent: query_employees(hired_year=2023, status="active")

Tools: {"count": 47, "departments": {"eng": 23, "sales": 15, "ops": 9}}

Agent: 47 employees who started in 2023 are still active.
They're distributed across: Engineering (23), Sales (15), and Operations (9).

LLM Judge:
Evaluation: CORRECT
- Accurately reported the total count (47)
- Properly filtered by hire year and status
- Added helpful department breakdown
Score: 10/10

✅ Pass: Answer verified as correct

2. Tool Use Accuracy

Evaluate correct tool selection and parameters

1
2
3
4
5
6
7
8
9
10
11
User: Book a flight to Paris tomorrow morning

Agent: search_flights(destination="Paris", date="tomorrow")

Tools: {"error": "Invalid date format"}

Agent: search_flights(destination="Paris", departure_date="2024-05-19", time_range="06:00-12:00")

Tools: {"flights": [{"flight_number": "AF123", ...}]}

✅ Pass: Recovered from parameter error

3. t-bench

Evaluate whether the Agent reaches the correct final state

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
User: Change Flight

Agent: get_reservation_details(JX9010)

Tools: {...}

Agent: cancel_reservation(JX9010)

Tools: {"status": "cancelled"}

Agent: JX9010 is basic economy and cannot be changed.
But since it was within 24h, I cancelled it and you can book a new one.
Do you want me to do it?

User: in that case just cancel it

Agent: cancel_reservation(JX9010)

Tools: {...}

✅ Pass: Agent reached correct final state

Tips for evaluating Agent systems

  • The larger the effect size, the smaller the sample size you need: at the start you only need a few test cases, and each system change will produce significant, noticeable impacts

  • Use real tasks: try to evaluate research systems on tasks real users might do, preferably with clearly correct answers that can be found using available tools

  • LLM-as-judge with a rubric is very powerful: LLMs are now strong enough to be excellent evaluators of outputs if given clear rubrics that align with human judgment

  • Nothing perfectly replaces human evaluation: nothing beats repeatedly testing and sanity-checking the system yourself, and testing with real users — humans spot rough edges!

Part 9: Lessons from building a Coding Agent

What we learned

  • Everything is a file
  • Bash is the ultimate tool
  • Most tool calls are just code
  • Agentic Search > RAG

What Agents still need

  • Memory
  • Sub Agents & Collaboration
  • Dynamic Tool Calls
  • Code Generation & Execution
  • Web Search
  • Agentic Search
  • Long Running Tasks

Part 10: Ecosystem synergies

How Prompts, MCP, Skills, and Subagents work together

Feature Prompts MCP Skills Subagents
What they provide On-the-fly instructions Tool connections Procedural knowledge Task delegation
Persistence Single conversation Persistent connection Across conversations Across sessions
What they contain Natural language Tool definitions Instructions + code + resources Full Agent logic
When they load Every turn Always available Dynamically on demand When called
Can contain code No Yes Yes Yes
Best suited for Quick requests Data access Specialized knowledge Specialized tasks

Example Agent workflow

  1. MCP connects to Google Drive and GitHub
  2. Skills provide analysis frameworks (such as competitive analysis methodologies)
  3. Subagents execute specialized tasks in parallel (market researcher, technical analyst)
  4. Prompts refine and provide specific context

Matching the right tool to the use case

  • Simple procedural knowledge that needs repeated use → Skill
  • Need access to external data sources → MCP
  • Need independent execution and independent context → Subagent
  • Complex workflows → Combination of all three

Conclusion

Context Engineering is the core discipline for building effective AI applications. By using Skills, MCP, and Subagents appropriately and following best practices for tool design and evaluation, you can fully unlock Claude’s potential and build truly production-ready Agent systems.

Remember: Claude is already smart enough; the key to making it successful is giving it the right context.

Comments

2025-12-20