Claude's Context Engineering Secrets: Best Practices Learned from Anthropic

(This article is organized from Anthropic team talks and in-depth discussions during AWS re:Invent 2025)

View Slides (HTML), Download PDF Version (note these slides are not official Anthropic material; I reconstructed them from photos and recordings)

Slides Source Code

Claude is already smart enough—intelligence is not the bottleneck, context is. Every organization has unique workflows, standards, and knowledge systems, and Claude does not inherently know any of these. This post compiles Anthropic’s best practices for Context Engineering, covering Skills, Agent SDK, MCP, evaluation systems and other core topics to help you build more efficient AI applications.

01 | Skills system - Let Claude master organization-specific knowledge
02 | Context Engineering framework - Four pillars for optimizing token utility
03 | Context Window & Context Rot - Understand context limits and degradation
04 | Tool design best practices - Elements of powerful tools
05 | Claude Agent SDK - A framework for production-ready agents
06 | Sub-agent configuration best practices - Automatic invocation and permissions
07 | MCP (Model Context Protocol) - A standardized protocol for tool integration
08 | Evaluations - Why evaluation matters and best practices
09 | Lessons from building Coding Agents - What we learned from Claude Code
10 | Ecosystem collaboration - How Prompts, MCP, Skills, and Subagents work together

Core question: Why do we need Context Engineering?

Claude does not know:

How your team structures reports
Your brand guidelines and templates
Your compliance processes
Your data analysis methodology

Current solutions all have limitations:

Prompts are ad-hoc instructions
Custom agents require infrastructure building
Context management is challenging

Part 1: Skills System

What are Skills?

Skills are folders of instructions, scripts, and resources that Claude can dynamically discover and load. You can think of them as “professional knowledge packs” that improve organization-wide productivity through consistent, high-quality outputs.

Two types of Skills

01 General capability enhancement
Things Claude cannot yet do well out of the box, such as creating PDF, Excel, and PowerPoint files.

02 Organization/industry/personal workflows and best practices
For example, Anthropic’s brand style guidelines.

How Skills work

A Skill is a directory containing an SKILL.md file:

## pdf/SKILL.md

### YAML Frontmatter

name: pdf
description: Comprehensive PDF toolkit for extracting text and tables,
merging/splitting documents, and filling out forms.

### Markdown

## Overview

This guide covers essential PDF processing operations using Python libraries
and command-line tools. For detailed examples, see `/reference.md`. 
If you need to fill out a PDF form, read `/form.md` and follow its instructions.

## Quick Start

from pypdf import PdfReader, PdfWriter
reader = PdfReader('document.pdf')
print(f"Pages: {len(reader.pages)}")

Key design points:

Metadata: Name and description at the top of the file
Preloading: Agents preload the names and descriptions of installed Skills into the system prompt
Efficiency: Claude only reads more content when needed

Skills packaging extra content

More complex Skills can reference additional context files:

pdf/
├── SKILL.md          # 主入口文件
├── reference.md      # PDF 处理高级参考
└── forms.md          # PDF 表单填写指南

Discovery: Claude navigates and discovers details as needed
Executable Scripts: For operations better handled by traditional code, this is more token-efficient and provides deterministic reliability when needed

Progressive Disclosure

Complex Skills can reference additional context:

Main file anthropic/brand_styling/SKILL.md:

## Overview

This skill provides Anthropic's official brand identity resources for
PowerPoint presentations. It includes a pre-branded template
and tools to apply Anthropic styling to existing presentations.

## Colors

- Dark: '#141413' - Primary text and dark backgrounds
- Light: '#faf9f5' - Light backgrounds and text on dark
- Light Gray: '#c8c6dc' - Subtle backgrounds

## Workflows

When creating presentations, read `./slide-decks.md`
When creating professional documents, read `./docs.md`

Reference file slide-decks.md:

## Anthropic Slide Decks

- Intro/outro slides
  - background color: '#141413'
  - foreground color: oat

- Section slides:
  - background color: '#da7857'
  - foreground color: '#141413'

... and so on ...

Reference file docs.md:

## Documents

* every document should start with a title, a list of authors, and the creation date
* if you use tabs in GDocs, make sure the main doc is titled as such

... and so on ...

In this way, Claude only reads slide-decks.md when creating presentations, and only reads docs.md when creating documents, achieving on-demand loading.

Skills are universal across all products

The same Skills format works across all Anthropic products:

Product	Best for	Features
🌟 Apps	Auto-calling, UX	Claude creates professional documents and analyses; end users can create, manage, and share custom Skills
🌟 Developer Platform	Programmatic distribution	Deploy Skills into end-user products via the Code Execution API; use core Skills or build custom ones
💻 Claude Code	Developer workflows	Use official or custom Skills with automatic invocation

Skills in Claude Code

Install via plugins (from the official Anthropic GitHub repo) or by manually adding to the ~/.claude/skills directory.

Automatic invocation: Claude automatically loads Skills when relevant—users approve
Different from slash commands: Skills are auto-loaded by Claude; slash commands are explicitly invoked by users
Runs in the local development environment
Marketplace: Distributed through a plugin marketplace

Skills best practices

Naming and description

Use gerund-style names: processing-pdfs, analyzing-spreadsheets, testing-code
Avoid vague names (helper, utils) or inconsistent patterns
Include both what it does and when to use it
Use declarative phrasing: “Processes Excel files and generates reports”
Avoid: “I can help you…” or “You can use this to…”

File organization

Keep the main SKILL.md under 500 lines
Split content into separate files as you approach that limit
Keep references at a single depth level from SKILL.md—avoid nested file references
Use directory structure for longer files (>100 lines)

Content quality

Use consistent terminology
Show concrete input/output pairs, just like regular prompts
Examples should match the behavior you want to encourage

Skill application examples

Based on the Claude Agent SDK you can build many specialized agents:

🔒 Code Security Agent	📝 Code Review Agent	📄 Contract Review Agent	📊 Meeting Summary Agent
💰 Financial Reporting Agent	✉️ Email Automation Agent	📑 Invoice Processing Agent	…

Part 2: Context Engineering Framework

Core concept

Context Engineering is the discipline of optimizing token utility to deal with the inherent constraints of LLMs

Four pillars

1. System prompt

Minimal, precise instructions using clear, simple, direct language—“say less, mean more”
Structured sections
Appropriate level of abstraction (not too rigid, not too vague)

2. Tools

Self-contained (i.e., independent), non-overlapping, and purpose-specific—“every tool must justify its existence”
Explicit parameters & concise, distinct descriptions
Clear success/failure modes

3. Data retrieval

Just-in-time context (JIT Context)—“load what you need when you need it”
Balance between preloading and dynamic fetching (agents can fetch autonomously)
Carefully designed retrieval tools—don’t send the whole library, send a librarian

4. Long-horizon optimizations

History compression strategies
Structured note-taking systems
Use sub-agent architectures where appropriate

Data Retrieval Paradigm Shift

Old approach: Preload (traditional RAG) - pre-load all potentially relevant data

New approach: Just-In-Time retrieval

Strategy	Description	Example
Lightweight identifiers	Pass IDs instead of full objects; the agent requests details when needed	`user_id: "12345"` → agent calls `get_user()` → full profile
Progressive disclosure	Start from summaries; the agent drills down as needed	File list → file metadata → file contents
Autonomous exploration	Give discovery tools instead of data dumps; the agent navigates the information space	`search_docs()` + `read_doc(detail_level)` vs loading all documents

Three Strategies for Long-Running Tasks

1. Compaction

Periodically summarize intermediate steps and/or compress history
Reset the context with the compressed summary, keeping only key information
Tradeoff: slight loss of detail in exchange for continuous operation
Example: “User wants X, tried Y, learned Z” vs full conversation

2. Structured memory/notes

The agent maintains explicit memory artifacts (i.e., external persistent storage)
Store “working notes” in structured form: decisions, learnings, state
Retrieve on demand instead of keeping everything in context
Examples: decision logs, key-findings documents

3. Sub-agent architecture

Decompose complex tasks into specialized agents
Each sub-agent has a focused, clean, narrow context
The main agent coordinates and synthesizes results
Example: a code-review agent spawning a documentation-check sub-agent

Part 3: Context Window and Context Rot

Context Window Limits

All frontier models have a maximum number of tokens they can process in a single interaction
Anthropic’s context window is 200k tokens

The Context Rot Problem

As context grows, output quality may degrade

Four main causes:

Type	Description
🧪 Context Poisoning	Incorrect or outdated information pollutes the context, causing the model to reason from wrong premises
📄 Context Distraction	Irrelevant information distracts the model and reduces focus on key information
❓ Context Confusion	Similar but distinct information is mixed together, making it hard for the model to distinguish and associate correctly
🔍⚠️ Context Clash	Contradictory or inconsistent information appears in the context, and the model doesn’t know which to trust

Key conclusion: All models experience performance degradation with long contexts. (See Chroma technical report: Context-Rot: How Increasing Input Tokens Impacts LLM Performance)

Prompt Caching

Prompt caching is a lever for cost and latency
The success of prompt caching is highly related to context structure

Effective context construction and maintenance will:

Goal	Outcome
Handle context window limits	→ Reliability
Reduce context rot	→ Accuracy
Optimize prompt caching	→ Cost & latency

Part 4: Tool Design Best Practices

Elements of strong tool design

Use simple and accurate tool names
Detailed and well-formatted descriptions—include what the tool returns, how it should be used, etc.
Avoid overly similar tool names or descriptions!
Tools that perform a single action work better—aim for at most one level of nested parameters
Provide examples—expected input/output formats
Be mindful of the format of tool results
Test your tools! Ensure the agent can use them correctly

Example tool definition:

{
  "name": "search_customers",
  "description": "Search customer database by name, email, or ID. Returns matching customer records.",
  "input_schema": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "Search term (name, email, or customer ID)"
      },
      "max_results": {
        "type": "integer",
        "default": 10,
        "description": "Number of results to return (default: 10, max: 50)"
      }
    }
  },
  "required": ["query"]
}

Part 5: Claude Agent SDK

Architecture Overview

The Claude Agent SDK is built on the agent framework that powers Claude Code, and provides all the building blocks needed to construct production-ready agents.

Application / Platform
    ↓
┌─────────────────────────┐
│   Claude Agent SDK      │
├─────────────────────────┤
│       Harness           │
├─────────┬───────┬───────┤
│ Tools   │Prompts│ File  │
│         │       │System │
├─────────┴───────┴───────┤
│       Models            │
├─────────┬───────────────┤
│ Claude  │  Claude Opus  │
│ Sonnet  │               │
└─────────┴───────────────┘

Core Capabilities of the SDK

Tools

Read/write file operations
Code execution
Web search
MCP servers
Skills

Permissions

Human approval checkpoints
Fine-grained permissions
Tool allow/deny lists

Production readiness

Session management
Error handling
Monitoring

Enhancements

Subagents
Web search
Research mode
Auto compacting
Multi-stream
Memory

Design Philosophy of the Agent SDK

Claude Code: Delegate everyday development work to Claude

By giving Claude access to a user’s computer (via a terminal), it can write code like a programmer:

Find files
Write and edit files
Test
Debug
Iteratively perform actions

Claude Agent SDK: Extend Claude Code to build custom agents

The principles of Claude Code can be extended to general agents. By giving Claude the same tools, agents can:

Read CSV files
Search the web
Build visualizations
And more

Core design principle: The Claude Agent SDK gives your agents a computer, so they can work like humans.

Claude Code Tool Suite

Tool	Description	Requires Permission
Agent	Run sub-agents to handle complex multi-step tasks	No
Bash	Execute shell commands in your environment	Yes
Edit	Make targeted edits to specific files	Yes
Glob	Find files based on pattern matching	No
Grep	Search for patterns in file contents	No
LS	List files and directories	No
MultiEdit	Perform multiple edits atomically on a single file	Yes
NotebookEdit	Modify Jupyter notebook cells	Yes
NotebookRead	Read and display Jupyter notebook contents	No
Read	Read file contents	No
TodoRead	Read the task list for the current session	No
TodoWrite	Create and manage structured task lists	No
WebFetch	Fetch content from specified URLs	Yes
WebSearch	Perform web search with domain filtering	Yes
Write	Create or overwrite files	Yes

Characteristics of a Strong Agent Framework

Does not over-script or over-scaffold the model
Allows tuning of all key system parts (context engineering)
Leverages all model capabilities (extended thinking, interleaved thinking, parallel tool calls, etc.)
Provides access to memory
Enables multi-agent setups where valuable
Has a robust agent permission system

Part 6: Sub-Agent Configuration Best Practices

The Description field is critical for auto-invocation

Make descriptions specific and action-oriented
Use “PROACTIVELY” or “MUST BE USED” to encourage automatic delegation
Example: "Use PROACTIVELY when code changes might impact performance. MUST BE USED for optimization tasks."

Tool Permissions

Restrict tools to what each sub-agent needs
Example: a code reviewer gets Read, Grep, Glob but not Write or Edit

Model Selection

Use inherit to match the main conversation for consistency
Specify sonnet, opus, or haiku according to specific needs
If omitted, defaults to sonnet

Native Sub-Agent Orchestration Best Practices

Manage context limits within the agent framework
When the context window is cleared, consider restarting rather than compressing
Prompt around early compaction
Be prescriptive about how to start
Provide verification tools
As autonomous task duration increases, Claude needs to verify correctness without constant human feedback

Best Practices for Research Tasks

To get the best research results:

Provide clear success criteria—spell out what counts as a successful answer
Encourage multi-source validation—verify information across multiple sources
Use a structured approach for complex research—proceed step by step and methodically

Part 7: MCP (Model Context Protocol)

What is MCP?

                AI applications                    Data sources and tools
                
Chat interface          ←→                          Data and file systems
Claude Desktop,             ┌─────────┐            PostgreSQL, SQLite, GDrive
LibreChat                   │   MCP   │
                        ←→  │Standardized│  ←→      Development tools
IDEs and code editors       │ protocol │            Git, Sentry, etc.
Claude Code, Goose          └─────────┘
                        ←→              ←→          Productivity tools
Other AI applications                               Slack, Google Maps, etc.
Sire, Superinterface        Bidirectional
                            data flow

Where MCP Is Heading

The latest specification (June 2025) focuses on structured tool outputs, OAuth authorization, mechanisms for server-initiated interaction requests, and security best practices.

Future directions:

Asynchronous operations
Statelessness and scalability
Server identity and discovery

Part 8: Evaluations

Types of evaluations

Evaluation Type	Description	Purpose	Examples
Intelligence benchmarks	Evaluate a model’s general intelligence	Compare with other models, model release decisions	MMLU, GPQA
Capability benchmarks	Evaluate the model’s performance on general capability domains	Compare with other models, model release positioning	MATH, HumanEval, SWE-Bench
Behavioral evaluations	Quantify how common specific model behaviors are	Monitor and improve model behavior	Refusal rate, hallucinations, “Certainly!” frequency
Safety evaluations	Use threat analysis and red-teaming to evaluate the experience of bad actors	Understand the risks of feature or product launches	Computer Use, browser-use red-teaming
Product evaluations	Evaluate the model’s ability to perform tasks within specific product features	Product launch decisions and iteration	Artifacts, data features, multimodal PDF

Characteristics of good evaluations

Measure performance as well as regressions
User-centered, covering the full range of expected user behaviors
Consider edge cases and risks
Recognized by multiple stakeholders (such as Legal)
Have (rough) targets
If your feature is common in other AI products, there may be benchmarks available to use or adapt
Negative examples are extremely important — they define the boundaries of the feature and ensure it doesn’t over-trigger

Excellent evaluations: can be scored objectively and procedurally

Evaluation process

Establish a baseline: For positive and negative examples, run prompts using the current production configuration and record the outputs as the baseline
Outline expected behavior: For positive and negative examples, explain “What is the expected behavior after applying the prompt changes?”
[Optional] Scoring: Build a scorer that checks model output against expected behavior (exact match, regex, or model-based)

Automation and iteration

The faster evaluations run, the easier it is to iterate.

Tool options:

Anthropic Console
Custom scripts or notebooks
Custom tools

Two dimensions of iteration:

Feature iteration: change system prompts, tool definitions, system flows
Evaluation iteration: you may see behaviors you don’t like and didn’t anticipate, and need to add use cases or prompts into the evaluation to test them

Agent evaluation examples

1. Answer Accuracy

The LLM judges the correctness of the Agent’s answer

User: How many employees started in 2023 and are still active?

Agent: query_employees(hired_year=2023, status="active")

Tools: {"count": 47, "departments": {"eng": 23, "sales": 15, "ops": 9}}

Agent: 47 employees who started in 2023 are still active. 
       They're distributed across: Engineering (23), Sales (15), and Operations (9).

LLM Judge:
Evaluation: CORRECT 
- Accurately reported the total count (47) 
- Properly filtered by hire year and status 
- Added helpful department breakdown
Score: 10/10

✅ Pass: Answer verified as correct

2. Tool Use Accuracy

Evaluate correct tool selection and parameters

User: Book a flight to Paris tomorrow morning

Agent: search_flights(destination="Paris", date="tomorrow")

Tools: {"error": "Invalid date format"}

Agent: search_flights(destination="Paris", departure_date="2024-05-19", time_range="06:00-12:00")

Tools: {"flights": [{"flight_number": "AF123", ...}]}

✅ Pass: Recovered from parameter error

3. t-bench

Evaluate whether the Agent reaches the correct final state

User: Change Flight

Agent: get_reservation_details(JX9010)

Tools: {...}

Agent: cancel_reservation(JX9010)

Tools: {"status": "cancelled"}

Agent: JX9010 is basic economy and cannot be changed. 
       But since it was within 24h, I cancelled it and you can book a new one. 
       Do you want me to do it?

User: in that case just cancel it

Agent: cancel_reservation(JX9010)

Tools: {...}

✅ Pass: Agent reached correct final state

Tips for evaluating Agent systems

The larger the effect size, the smaller the sample size you need: at the start you only need a few test cases, and each system change will produce significant, noticeable impacts
Use real tasks: try to evaluate research systems on tasks real users might do, preferably with clearly correct answers that can be found using available tools
LLM-as-judge with a rubric is very powerful: LLMs are now strong enough to be excellent evaluators of outputs if given clear rubrics that align with human judgment
Nothing perfectly replaces human evaluation: nothing beats repeatedly testing and sanity-checking the system yourself, and testing with real users — humans spot rough edges!

Part 9: Lessons from building a Coding Agent

What we learned

Everything is a file
Bash is the ultimate tool
Most tool calls are just code
Agentic Search > RAG

What Agents still need

Memory
Sub Agents & Collaboration
Dynamic Tool Calls
Code Generation & Execution
Web Search
Agentic Search
Long Running Tasks

Part 10: Ecosystem synergies

How Prompts, MCP, Skills, and Subagents work together

Feature	Prompts	MCP	Skills	Subagents
What they provide	On-the-fly instructions	Tool connections	Procedural knowledge	Task delegation
Persistence	Single conversation	Persistent connection	Across conversations	Across sessions
What they contain	Natural language	Tool definitions	Instructions + code + resources	Full Agent logic
When they load	Every turn	Always available	Dynamically on demand	When called
Can contain code	No	Yes	Yes	Yes
Best suited for	Quick requests	Data access	Specialized knowledge	Specialized tasks

Example Agent workflow

MCP connects to Google Drive and GitHub
Skills provide analysis frameworks (such as competitive analysis methodologies)
Subagents execute specialized tasks in parallel (market researcher, technical analyst)
Prompts refine and provide specific context

Matching the right tool to the use case

Simple procedural knowledge that needs repeated use → Skill
Need access to external data sources → MCP
Need independent execution and independent context → Subagent
Complex workflows → Combination of all three

Conclusion

Context Engineering is the core discipline for building effective AI applications. By using Skills, MCP, and Subagents appropriately and following best practices for tool design and evaluation, you can fully unlock Claude’s potential and build truly production-ready Agent systems.

Remember: Claude is already smart enough; the key to making it successful is giving it the right context.