Claude's Context Engineering Secrets: Best Practices Learned from Anthropic
(This article is organized from Anthropic team talks and in-depth discussions during AWS re:Invent 2025)
View Slides (HTML), Download PDF Version (note these slides are not official Anthropic material; I reconstructed them from photos and recordings)
Contents
Claude is already smart enough—intelligence is not the bottleneck, context is. Every organization has unique workflows, standards, and knowledge systems, and Claude does not inherently know any of these. This post compiles Anthropic’s best practices for Context Engineering, covering Skills, Agent SDK, MCP, evaluation systems and other core topics to help you build more efficient AI applications.
- 01 | Skills system - Let Claude master organization-specific knowledge
- 02 | Context Engineering framework - Four pillars for optimizing token utility
- 03 | Context Window & Context Rot - Understand context limits and degradation
- 04 | Tool design best practices - Elements of powerful tools
- 05 | Claude Agent SDK - A framework for production-ready agents
- 06 | Sub-agent configuration best practices - Automatic invocation and permissions
- 07 | MCP (Model Context Protocol) - A standardized protocol for tool integration
- 08 | Evaluations - Why evaluation matters and best practices
- 09 | Lessons from building Coding Agents - What we learned from Claude Code
- 10 | Ecosystem collaboration - How Prompts, MCP, Skills, and Subagents work together
Core question: Why do we need Context Engineering?
Claude does not know:
- How your team structures reports
- Your brand guidelines and templates
- Your compliance processes
- Your data analysis methodology
Current solutions all have limitations:
- Prompts are ad-hoc instructions
- Custom agents require infrastructure building
- Context management is challenging
Part 1: Skills System
What are Skills?
Skills are folders of instructions, scripts, and resources that Claude can dynamically discover and load. You can think of them as “professional knowledge packs” that improve organization-wide productivity through consistent, high-quality outputs.
Two types of Skills
01 General capability enhancement
Things Claude cannot yet do well out of the box, such as creating PDF, Excel, and PowerPoint files.
02 Organization/industry/personal workflows and best practices
For example, Anthropic’s brand style guidelines.
How Skills work
A Skill is a directory containing an SKILL.md file:
1 | ## pdf/SKILL.md |
Key design points:
- Metadata: Name and description at the top of the file
- Preloading: Agents preload the names and descriptions of installed Skills into the system prompt
- Efficiency: Claude only reads more content when needed
Skills packaging extra content
More complex Skills can reference additional context files:
1 | pdf/ |
- Discovery: Claude navigates and discovers details as needed
- Executable Scripts: For operations better handled by traditional code, this is more token-efficient and provides deterministic reliability when needed
Progressive Disclosure
Complex Skills can reference additional context:
Main file anthropic/brand_styling/SKILL.md:
1 | ## Overview |
Reference file slide-decks.md:
1 | ## Anthropic Slide Decks |
Reference file docs.md:
1 | ## Documents |
In this way, Claude only reads slide-decks.md when creating presentations, and only reads docs.md when creating documents, achieving on-demand loading.
Skills are universal across all products
The same Skills format works across all Anthropic products:
| Product | Best for | Features |
|---|---|---|
| 🌟 Apps | Auto-calling, UX | Claude creates professional documents and analyses; end users can create, manage, and share custom Skills |
| 🌟 Developer Platform | Programmatic distribution | Deploy Skills into end-user products via the Code Execution API; use core Skills or build custom ones |
| 💻 Claude Code | Developer workflows | Use official or custom Skills with automatic invocation |
Skills in Claude Code
Install via plugins (from the official Anthropic GitHub repo) or by manually adding to the ~/.claude/skills directory.
- Automatic invocation: Claude automatically loads Skills when relevant—users approve
- Different from slash commands: Skills are auto-loaded by Claude; slash commands are explicitly invoked by users
- Runs in the local development environment
- Marketplace: Distributed through a plugin marketplace
Skills best practices
Naming and description
- Use gerund-style names:
processing-pdfs,analyzing-spreadsheets,testing-code - Avoid vague names (
helper,utils) or inconsistent patterns - Include both what it does and when to use it
- Use declarative phrasing: “Processes Excel files and generates reports”
- Avoid: “I can help you…” or “You can use this to…”
File organization
- Keep the main SKILL.md under 500 lines
- Split content into separate files as you approach that limit
- Keep references at a single depth level from SKILL.md—avoid nested file references
- Use directory structure for longer files (>100 lines)
Content quality
- Use consistent terminology
- Show concrete input/output pairs, just like regular prompts
- Examples should match the behavior you want to encourage
Skill application examples
Based on the Claude Agent SDK you can build many specialized agents:
| 🔒 Code Security Agent | 📝 Code Review Agent | 📄 Contract Review Agent | 📊 Meeting Summary Agent |
|---|---|---|---|
| 💰 Financial Reporting Agent | ✉️ Email Automation Agent | 📑 Invoice Processing Agent | … |
Part 2: Context Engineering Framework
Core concept
Context Engineering is the discipline of optimizing token utility to deal with the inherent constraints of LLMs
Four pillars
1. System prompt
- Minimal, precise instructions using clear, simple, direct language—“say less, mean more”
- Structured sections
- Appropriate level of abstraction (not too rigid, not too vague)
2. Tools
- Self-contained (i.e., independent), non-overlapping, and purpose-specific—“every tool must justify its existence”
- Explicit parameters & concise, distinct descriptions
- Clear success/failure modes
3. Data retrieval
- Just-in-time context (JIT Context)—“load what you need when you need it”
- Balance between preloading and dynamic fetching (agents can fetch autonomously)
- Carefully designed retrieval tools—don’t send the whole library, send a librarian
4. Long-horizon optimizations
- History compression strategies
- Structured note-taking systems
- Use sub-agent architectures where appropriate
Data Retrieval Paradigm Shift
Old approach: Preload (traditional RAG) - pre-load all potentially relevant data
New approach: Just-In-Time retrieval
| Strategy | Description | Example |
|---|---|---|
| Lightweight identifiers | Pass IDs instead of full objects; the agent requests details when needed | user_id: "12345" → agent calls get_user() → full profile |
| Progressive disclosure | Start from summaries; the agent drills down as needed | File list → file metadata → file contents |
| Autonomous exploration | Give discovery tools instead of data dumps; the agent navigates the information space | search_docs() + read_doc(detail_level) vs loading all documents |
Three Strategies for Long-Running Tasks
1. Compaction
- Periodically summarize intermediate steps and/or compress history
- Reset the context with the compressed summary, keeping only key information
- Tradeoff: slight loss of detail in exchange for continuous operation
- Example: “User wants X, tried Y, learned Z” vs full conversation
2. Structured memory/notes
- The agent maintains explicit memory artifacts (i.e., external persistent storage)
- Store “working notes” in structured form: decisions, learnings, state
- Retrieve on demand instead of keeping everything in context
- Examples: decision logs, key-findings documents
3. Sub-agent architecture
- Decompose complex tasks into specialized agents
- Each sub-agent has a focused, clean, narrow context
- The main agent coordinates and synthesizes results
- Example: a code-review agent spawning a documentation-check sub-agent
Part 3: Context Window and Context Rot
Context Window Limits
- All frontier models have a maximum number of tokens they can process in a single interaction
- Anthropic’s context window is 200k tokens
The Context Rot Problem
As context grows, output quality may degrade
Four main causes:
| Type | Description |
|---|---|
| 🧪 Context Poisoning | Incorrect or outdated information pollutes the context, causing the model to reason from wrong premises |
| 📄 Context Distraction | Irrelevant information distracts the model and reduces focus on key information |
| ❓ Context Confusion | Similar but distinct information is mixed together, making it hard for the model to distinguish and associate correctly |
| 🔍⚠️ Context Clash | Contradictory or inconsistent information appears in the context, and the model doesn’t know which to trust |
Key conclusion: All models experience performance degradation with long contexts. (See Chroma technical report: Context-Rot: How Increasing Input Tokens Impacts LLM Performance)
Prompt Caching
- Prompt caching is a lever for cost and latency
- The success of prompt caching is highly related to context structure
Effective context construction and maintenance will:
| Goal | Outcome |
|---|---|
| Handle context window limits | → Reliability |
| Reduce context rot | → Accuracy |
| Optimize prompt caching | → Cost & latency |
Part 4: Tool Design Best Practices
Elements of strong tool design
Use simple and accurate tool names
Detailed and well-formatted descriptions—include what the tool returns, how it should be used, etc.
Avoid overly similar tool names or descriptions!
Tools that perform a single action work better—aim for at most one level of nested parameters
Provide examples—expected input/output formats
Be mindful of the format of tool results
Test your tools! Ensure the agent can use them correctly
Example tool definition:
1 | { |
Part 5: Claude Agent SDK
Architecture Overview
The Claude Agent SDK is built on the agent framework that powers Claude Code, and provides all the building blocks needed to construct production-ready agents.
1 | Application / Platform |
Core Capabilities of the SDK
Tools
- Read/write file operations
- Code execution
- Web search
- MCP servers
- Skills
Permissions
- Human approval checkpoints
- Fine-grained permissions
- Tool allow/deny lists
Production readiness
- Session management
- Error handling
- Monitoring
Enhancements
- Subagents
- Web search
- Research mode
- Auto compacting
- Multi-stream
- Memory
Design Philosophy of the Agent SDK
Claude Code: Delegate everyday development work to Claude
By giving Claude access to a user’s computer (via a terminal), it can write code like a programmer:
- Find files
- Write and edit files
- Test
- Debug
- Iteratively perform actions
Claude Agent SDK: Extend Claude Code to build custom agents
The principles of Claude Code can be extended to general agents. By giving Claude the same tools, agents can:
- Read CSV files
- Search the web
- Build visualizations
- And more
Core design principle: The Claude Agent SDK gives your agents a computer, so they can work like humans.
Claude Code Tool Suite
| Tool | Description | Requires Permission |
|---|---|---|
| Agent | Run sub-agents to handle complex multi-step tasks | No |
| Bash | Execute shell commands in your environment | Yes |
| Edit | Make targeted edits to specific files | Yes |
| Glob | Find files based on pattern matching | No |
| Grep | Search for patterns in file contents | No |
| LS | List files and directories | No |
| MultiEdit | Perform multiple edits atomically on a single file | Yes |
| NotebookEdit | Modify Jupyter notebook cells | Yes |
| NotebookRead | Read and display Jupyter notebook contents | No |
| Read | Read file contents | No |
| TodoRead | Read the task list for the current session | No |
| TodoWrite | Create and manage structured task lists | No |
| WebFetch | Fetch content from specified URLs | Yes |
| WebSearch | Perform web search with domain filtering | Yes |
| Write | Create or overwrite files | Yes |
Characteristics of a Strong Agent Framework
- Does not over-script or over-scaffold the model
- Allows tuning of all key system parts (context engineering)
- Leverages all model capabilities (extended thinking, interleaved thinking, parallel tool calls, etc.)
- Provides access to memory
- Enables multi-agent setups where valuable
- Has a robust agent permission system
Part 6: Sub-Agent Configuration Best Practices
The Description field is critical for auto-invocation
- Make descriptions specific and action-oriented
- Use “PROACTIVELY” or “MUST BE USED” to encourage automatic delegation
- Example:
"Use PROACTIVELY when code changes might impact performance. MUST BE USED for optimization tasks."
Tool Permissions
- Restrict tools to what each sub-agent needs
- Example: a code reviewer gets
Read, Grep, Globbut notWriteorEdit
Model Selection
- Use
inheritto match the main conversation for consistency - Specify
sonnet,opus, orhaikuaccording to specific needs - If omitted, defaults to
sonnet
Native Sub-Agent Orchestration Best Practices
- Manage context limits within the agent framework
- When the context window is cleared, consider restarting rather than compressing
- Prompt around early compaction
- Be prescriptive about how to start
- Provide verification tools
- As autonomous task duration increases, Claude needs to verify correctness without constant human feedback
Best Practices for Research Tasks
To get the best research results:
- Provide clear success criteria—spell out what counts as a successful answer
- Encourage multi-source validation—verify information across multiple sources
- Use a structured approach for complex research—proceed step by step and methodically
Part 7: MCP (Model Context Protocol)
What is MCP?
1 | AI applications Data sources and tools |
Where MCP Is Heading
The latest specification (June 2025) focuses on structured tool outputs, OAuth authorization, mechanisms for server-initiated interaction requests, and security best practices.
Future directions:
- Asynchronous operations
- Statelessness and scalability
- Server identity and discovery
Part 8: Evaluations
Types of evaluations
| Evaluation Type | Description | Purpose | Examples |
|---|---|---|---|
| Intelligence benchmarks | Evaluate a model’s general intelligence | Compare with other models, model release decisions | MMLU, GPQA |
| Capability benchmarks | Evaluate the model’s performance on general capability domains | Compare with other models, model release positioning | MATH, HumanEval, SWE-Bench |
| Behavioral evaluations | Quantify how common specific model behaviors are | Monitor and improve model behavior | Refusal rate, hallucinations, “Certainly!” frequency |
| Safety evaluations | Use threat analysis and red-teaming to evaluate the experience of bad actors | Understand the risks of feature or product launches | Computer Use, browser-use red-teaming |
| Product evaluations | Evaluate the model’s ability to perform tasks within specific product features | Product launch decisions and iteration | Artifacts, data features, multimodal PDF |
Characteristics of good evaluations
- Measure performance as well as regressions
- User-centered, covering the full range of expected user behaviors
- Consider edge cases and risks
- Recognized by multiple stakeholders (such as Legal)
- Have (rough) targets
- If your feature is common in other AI products, there may be benchmarks available to use or adapt
- Negative examples are extremely important — they define the boundaries of the feature and ensure it doesn’t over-trigger
Excellent evaluations: can be scored objectively and procedurally
Evaluation process
Establish a baseline: For positive and negative examples, run prompts using the current production configuration and record the outputs as the baseline
Outline expected behavior: For positive and negative examples, explain “What is the expected behavior after applying the prompt changes?”
[Optional] Scoring: Build a scorer that checks model output against expected behavior (exact match, regex, or model-based)
Automation and iteration
The faster evaluations run, the easier it is to iterate.
Tool options:
- Anthropic Console
- Custom scripts or notebooks
- Custom tools
Two dimensions of iteration:
- Feature iteration: change system prompts, tool definitions, system flows
- Evaluation iteration: you may see behaviors you don’t like and didn’t anticipate, and need to add use cases or prompts into the evaluation to test them
Agent evaluation examples
1. Answer Accuracy
The LLM judges the correctness of the Agent’s answer
1 | User: How many employees started in 2023 and are still active? |
2. Tool Use Accuracy
Evaluate correct tool selection and parameters
1 | User: Book a flight to Paris tomorrow morning |
3. t-bench
Evaluate whether the Agent reaches the correct final state
1 | User: Change Flight |
Tips for evaluating Agent systems
The larger the effect size, the smaller the sample size you need: at the start you only need a few test cases, and each system change will produce significant, noticeable impacts
Use real tasks: try to evaluate research systems on tasks real users might do, preferably with clearly correct answers that can be found using available tools
LLM-as-judge with a rubric is very powerful: LLMs are now strong enough to be excellent evaluators of outputs if given clear rubrics that align with human judgment
Nothing perfectly replaces human evaluation: nothing beats repeatedly testing and sanity-checking the system yourself, and testing with real users — humans spot rough edges!
Part 9: Lessons from building a Coding Agent
What we learned
- Everything is a file
- Bash is the ultimate tool
- Most tool calls are just code
- Agentic Search > RAG
What Agents still need
- Memory
- Sub Agents & Collaboration
- Dynamic Tool Calls
- Code Generation & Execution
- Web Search
- Agentic Search
- Long Running Tasks
Part 10: Ecosystem synergies
How Prompts, MCP, Skills, and Subagents work together
| Feature | Prompts | MCP | Skills | Subagents |
|---|---|---|---|---|
| What they provide | On-the-fly instructions | Tool connections | Procedural knowledge | Task delegation |
| Persistence | Single conversation | Persistent connection | Across conversations | Across sessions |
| What they contain | Natural language | Tool definitions | Instructions + code + resources | Full Agent logic |
| When they load | Every turn | Always available | Dynamically on demand | When called |
| Can contain code | No | Yes | Yes | Yes |
| Best suited for | Quick requests | Data access | Specialized knowledge | Specialized tasks |
Example Agent workflow
- MCP connects to Google Drive and GitHub
- Skills provide analysis frameworks (such as competitive analysis methodologies)
- Subagents execute specialized tasks in parallel (market researcher, technical analyst)
- Prompts refine and provide specific context
Matching the right tool to the use case
- Simple procedural knowledge that needs repeated use → Skill
- Need access to external data sources → MCP
- Need independent execution and independent context → Subagent
- Complex workflows → Combination of all three
Conclusion
Context Engineering is the core discipline for building effective AI applications. By using Skills, MCP, and Subagents appropriately and following best practices for tool design and evaluation, you can fully unlock Claude’s potential and build truly production-ready Agent systems.
Remember: Claude is already smart enough; the key to making it successful is giving it the right context.