A Leak That Explains Claude Code: Harness Is the Key to Making Agents Reliable
On April 1, 2026, the complete source code of Anthropic’s Claude Code was leaked via an npm package. Open the source map and there it is: 1,903 files, 510,000 lines of TypeScript, everything laid out in the open.
Hidden in the source code: a complete pet gacha machine
The first thing people found in the codebase was a full pet system—Buddy. Enter /buddy and you can “hatch” your own dedicated CLI pet: 18 species, 5 rarity tiers (legendary only 1%), 5 random attributes, 6 eye types, 8 hats, 1% shiny rate, and 3-frame ASCII animation. Each user’s pet is deterministically generated from userId + SALT.
Inside a 510k-line production-grade AI Agent, there’s a pet system built with this much care. But if you read the code carefully, there are a few places that really make you think:
Evidence 1: SALT = 'friend-2026-401'—friend + April 1, 2026. The leak date, accurate to the day.
Evidence 2: The teaser window is precisely April 1–7, 2026. The comment says “Sustained Twitter buzz instead of a single UTC-midnight spike”—this doesn’t read like an engineer’s description of an internal feature; it reads like marketing copy.
Evidence 3: All 18 species names are constructed with String.fromCharCode(0x…) (hex encoding), because capybara collided with the internal codename for Anthropic’s next-generation model (it appears in the blacklist excluded-strings.txt). To avoid making it stand out, they hex-encoded all species names—“so one doesn’t stand out”. But capybara just happens to be the previously leaked new model name.
Evidence 4: Using hex encoding everywhere actually ensured that every reverse engineer would go decode them—if the goal was to hide, the effect is exactly the opposite.
Was this leak really a coincidence?
There are three possible interpretations:
- A. Pure coincidence (10%): Buddy was a planned April Fool’s Easter egg, the source map was a configuration mistake that happened to occur on the same day. This requires a lot of coincidence.
- B. The engineering team “accidentally” did it (55%): Someone “accidentally” turned on source maps in that build. Legal sending a DMCA is a genuine stress reaction, but a window of more than ten hours is more than enough for the code to spread globally. The Buddy Easter egg was a pre-planted trigger.
- C. Other possibilities: Completely accidental but later tolerated (20%), or planned by the company (15%).
Regardless of which is true, the outcome is the same: developers worldwide just did a free, in-depth code review and word-of-mouth campaign. This may be the most successful piece of tech marketing in 2026, intentional or not.
The real value: a rare window
The technical value of this leak doesn’t lie in any single clever implementation, but in the rare window it provides: what problems is a large-scale, commercially deployed AI Agent product actually solving at an engineering level? Over the past two years, AI Agents have gone from a paper concept to product reality, but almost all public discussion has been stuck at two extremes—either beginner tutorials about “letting models call tools,” or grand narratives about “AGI is coming.” Almost no one has clearly explained the middle layer.
After reading this codebase, the strongest impression is: the core challenge of Agents isn’t “letting the model call tools,” it’s everything outside the model, prompts, and tools. How do you decide permissions, recover from errors, manage context, keep caches consistent, coordinate parallelism, hide intermediate failures—this engineering is the real barrier between a demo and a production Agent product. And this “everything outside the model” has a formal name: Harness.
Based on the Claude Code source and related analyses, this article systematically breaks down Harness Engineering as an Agent engineering paradigm—what it is, why it matters, how Claude Code implements it, and what we can learn from it.
I. Harness Engineering: the most important Agent engineering paradigm of 2026
The evolution from Prompt to Harness
In recent years, AI engineering practice has followed a clear evolutionary path. First came Prompt Engineering, focused on “what to ask”—optimizing instructions given to the model. Then came Context Engineering, focused on “what to show”—systematically managing the information the model can see. By 2026, the industry is moving toward Harness Engineering, focused on “the whole system”—all the infrastructure around model execution.
The three are nested: Prompt ⊂ Context ⊂ Harness.
The definition of a Harness is straightforward: everything outside the model. How to provide context, how to call tools, how to recover from errors, how to ensure safety, how to share caches, how to coordinate parallelism. As model capabilities become commoditized, the competitive edge is shifting to the engineering practices outside the model.
This isn’t just talk. In Terminal Bench 2.0, LangChain ran an experiment: without changing the model, only changing the Harness, accuracy jumped from 52.8% to 66.5%, and the ranking went from below 30th place straight into the top 5. OpenAI has similar internal cases: with an Agent plus a reasonable Harness, 3 engineers finished about a million lines of code and around 1,500 PRs in 5 months.
Claude Code itself is the best real-world example of Harness Engineering. In those 510k lines of TypeScript, the vast majority of code is not about “letting the model call tools,” but about everything that happens after the tool calls.
The core formula of Agents
To understand Harnesses, you first need to understand the core formula of Agents. An Agent has to solve three things:
| Layer | Depends on | Analogy |
|---|---|---|
| Intelligence — can it figure things out | Model | Brain |
| Capability — can it actually get things done | Environment | Hands and feet |
| Reliability — will it do the wrong thing | Constraints / validation / correction | Reins |
In formula form: Agent = Model + Harness, and the Harness has two main parts: the Environment (letting the Agent act) and constraints/validation/correction (making the Agent act reliably).
In Claude Code’s implementation, the Environment includes 40 built-in tools, five-layer context compression, prompt cache economics, the CLAUDE.md memory system, Dream offline consolidation, Side Query parallel calls, speculative execution, etc. The constraints/validation/correction part includes fail-closed defaults, tool whitelists, shell semantic parsing, an LLM-based permissions classifier, a hook system, circuit breakers, message withholding, model downgrades, and death-spiral protection.
You need both: with only Environment you have an out-of-control genius; with only constraints you have a safely useless system. The rest of the article follows these two threads.
II. Overall architecture: a 1,700-line state machine propped up by while(true)
The surface and reality of the ReAct loop
Almost every Agent framework centers on a ReAct loop: call the model → parse output → execute tools → feed results back into messages → call the model again. In Python pseudocode, this fits in under 10 lines. Claude Code is no exception—its main loop is in query.ts—but that loop is 1,700 lines long.
Where’s the difference? The simple version is a while loop; Claude Code’s version is a state machine with 7 named continue branches. The simple version just throws errors; Claude Code does silent upgrades, multi-round continuation, and message withholding. The simple version executes tools sequentially; Claude Code does streaming parallelism with concurrency safety flags. The simple version has no context management; Claude Code has a five-layer compression pipeline. The simple version has no safety checks; Claude Code has five layers of permission checks plus circuit breakers. The simple version ignores caching; cache economics permeate Claude Code’s entire system.
Every continue branch in this 1,700-line loop carries a semantic transition.reason label: output-limit upgrade, reactive compression retry, context-collapse cleanup, stop-hook blocking, token-budget extension… every “why run another round” has a name, tracing, and dedicated recovery logic. State is destructured at the top of the loop and replaced wholesale at each continue branch—each iteration feels like a new turn. This isn’t a casual while(true); it’s a loop with full state-machine semantics.
Writing a CLI with React
Claude Code makes a surprising tech choice: the entire terminal UI is rendered with React (via the Ink framework). The source contains 552 .tsx files, and the entry file is cli.tsx—note the suffix.
At first glance this seems counterintuitive, but it makes sense once you consider Claude Code’s UI needs: streaming output, parallel tool execution states, permission confirmation dialogs, file diff previews—these are highly dynamic UI scenarios. In a traditional CLI, even a basic spinner is annoying to implement; Claude Code has to display several kinds of real-time-updating information at once. Only a declarative framework can handle this complexity cleanly.
React has a non-trivial startup cost, but Claude Code uses a clever cold-start optimization: in the cli.tsx entry, the --version flag follows a zero-import path, printing a compile-time inlined version string and exiting, without loading a single React module. Other commands go through separate import() paths, and only when you finally enter the main loop does it load the full React app.
Seven basic tools form a complete capability set
Claude Code has 40 built-in tools, but just 7 core ones can cover almost all tasks:
| Tool | Function |
|---|---|
| Read / Write / Edit | File operations |
| Glob / Grep | File discovery and content search |
| Bash | Shell command execution |
| Agent | Sub-agent orchestration |
Anthropic has a core viewpoint: the file system is the interaction bus for agents. All information is persisted, iterated, and version-controlled through files. Claude Code’s implementation perfectly proves this: memory uses Markdown files, configuration uses CLAUDE.md, tool results are stored on disk, and speculative execution uses an overlay file system.
3. Environment: enabling the agent to actually do things
Environment is the first major component of Harness, responsible for giving the agent sufficient perception, action, and learning capabilities. The engineering effort in Claude Code at this layer far exceeded expectations—prompt cache economics, five-layer context compression, parallel LLM calls, memory systems and offline consolidation; each of these is worth a deep dive on its own.
Prompt cache is not an optimization; it is an architectural constraint
If I could learn only one principle from the Claude Code source, I would choose this one: prompt cache hit rate is an architectural constraint you must consider on day one, not a performance problem to optimize after launch.
Cache boundaries are baked into the physical structure of the system prompt
There is an explicit SYSTEM_PROMPT_DYNAMIC_BOUNDARY marker in the system prompt that physically splits the prompt into two segments: the content before the marker can be cached across users; the content after it contains session-specific information. The comment comes with a capital-letter WARNING: do not remove or reorder this marker, otherwise the caching logic will break.
This means the organization of the system prompt is determined first by cache boundaries, and only second by semantic logic. Most people are used to structuring prompts by semantics, but in production systems, structuring them by cache boundaries may be more important.
Forked agents sharing the cache
Claude Code often needs to spawn parallel agents outside the main loop—to do compression, extract memories, speculative execution, and generate summaries. Each time it forks, the system passes in a CacheSafeParams bundle, which includes the system prompt, user context, tool list, message prefix, and thinking configuration. These must be byte-for-byte identical to the parent agent’s request in order to hit the same prompt cache entry.
The code comments are very clear: the API cache key is determined by system prompt, tools, model, messages prefix, and thinking config together. Even maxOutputTokens in the thinking config matters—on older models it changes budget_tokens, which causes cache misses.
At the boundary of every turn, the system saves the current CacheSafeParams into a global slot for all post-turn forks to reuse. There are 9 different fork callers across the codebase, and every one of them goes through this cache-sharing mechanism.
Global impact: every architectural decision makes concessions to “don’t break the cache”
Tool results also give way to caching. Tool outputs above a certain size are stored on disk, and replacement decisions are frozen—the same message must produce exactly the same string at different times, or the cache becomes useless. Message serialization requires deterministic JSON key ordering. Tool definitions are put into the system prompt to keep the cache stable.
The difference between cache hits and misses is not tens of percent; it’s an order of magnitude in cost and latency. This is not micro-optimization—it dictates how messages are serialized, how sub-agents are forked, and how tool results are stored.
Five-layer context compression pipeline
A common misconception is that “context compression means calling an LLM once to summarize the history.” Claude Code’s approach shows that this is far from enough—different types of information have completely different “shelf lives” and require completely different handling.
In reality, it has five compression mechanisms at different granularities, executed in order:
Layer 1: Tool Result Budget — the very first layer. Massive tool outputs are stored on disk, and the model only sees a preview plus file paths. Replacement decisions are frozen to protect caching.
Layer 2: HISTORY_SNIP — the finest-grained pruning. Certain messages are deleted outright, with no summarization at all. For example, a search might return 500 lines but the model only used 3 of them; the remaining 497 lines are pure noise. Summarizing them is a waste of tokens; deleting them directly is the most cost-effective option.
Layer 3: Microcompact — editing at the API cache layer, using the cache_edits API. It does not modify local message content; instead, it attaches cache edit instructions to the API request. The local message history remains completely unchanged; compression is done at the API layer.
Layer 4: CONTEXT_COLLAPSE — archiving older conversation turns into summaries, maintaining a structure similar to a git log. Unlike full compression, it preserves structure—what was done in each turn, and what the conclusions were.
Layer 5: Autocompact — final fallback. It first attempts lightweight session memory compression, and only if that is insufficient does it perform full compression.
The core principle: if an earlier layer can handle it, later layers are not triggered. Most of the time, the heaviest Autocompact doesn’t need to run at all.
Circuit breaker: preventing infinite spending on compression
Autocompact has a circuit breaker—after 3 consecutive compression failures, it stops retrying. The code comments directly cite internal data: there were once 1,279 sessions with more than 50 consecutive failures, and the most extreme session failed 3,272 times, wasting about 250,000 API calls per day globally. This real-world data led to the constant MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3.
The design takeaway is clear: information with different shelf lives needs different strategies. Intermediate tool outputs may be useless after a few turns (SNIP deletes them directly), conversational structure needs compression but the skeleton must be preserved (COLLAPSE archives it), and global background information needs to be retained long-term. A single compression method cannot handle all scenarios; you need a pipeline.
Side Query: turning “calling the LLM” into a lightweight operation scattered everywhere
Claude Code has an abstraction called sideQuery—a lightweight, non-streaming API wrapper dedicated to auxiliary LLM calls outside the main loop. It is used in at least five scenarios:
| Use case | Model | Description |
|---|---|---|
| Permission classification | Small model | Judge whether tool calls are safe |
| Memory retrieval | Small model | Which CLAUDE.md files are relevant to the current task |
| Tool Use Summary | Haiku | Asynchronously summarize tool operations |
| Agent summary | Small model | Progress reports from sub-agents |
| Prompt suggestions | Small model | Predict the user’s next step |
The design of Tool Use Summary is particularly clever: while the main model is reasoning (5–30 seconds), Haiku simultaneously summarizes the tool operations from the previous turn. This summary completes within 1 second, and the latency is completely hidden inside the main model’s reasoning time. Later, when context compression is needed, this pre-generated summary replaces the original tool outputs, greatly saving tokens.
This reflects an important architectural idea: the main agent loop should not be the only place where LLM calls happen. Permission checks, summary generation, and memory retrieval can be handled in parallel by smaller, faster models. Treat “calling the LLM” as a lightweight operation that can be scattered everywhere, not a heavyweight event that only happens in the main loop.
Memory architecture: why Markdown is better than a vector database
Claude Code’s memory system does not use a vector database; it uses plain Markdown files:
| File | Purpose |
|---|---|
CLAUDE.md |
Project-level memories and conventions |
AGENTS.md |
Agent capabilities and role descriptions |
memory/YYYY-MM-DD.md |
Interaction logs archived by date |
MEMORY.md |
Core facts and user preferences |
This choice seems counterintuitive but is in fact highly pragmatic. Markdown files are transparently editable—you can open them directly to see what the AI has remembered; if it’s wrong you can fix it, which vector databases can’t offer. They naturally support Git version control, so every memory modification is traceable and reversible. And this aligns perfectly with the file-system-as-bus philosophy, integrating naturally with other agent components.
Dream: the agent’s “sleep learning”
In the source code, there is a background memory consolidation system called Dream. It is triggered when: more than 24 hours have passed since the last consolidation, there have been at least 5 sessions in that period, and a file lock can be acquired. The gating logic is ordered by cost from low to high: first check the time, then scan session counts, and only then attempt to acquire the file lock.
Once conditions are met, the system spawns a background forked agent that organizes recent session experiences into long-term memory files. The consolidation process has four stages:
- Orient — scan the memory directory, read the index, and understand the current knowledge state
- Gather — collect new signals, grep recent sessions for information worth remembering
- Consolidate — merge new signals into existing topic files, avoid near-duplicates, convert relative dates to absolute dates, and delete obsolete facts that have been overturned
- Prune — trim the index to keep it concise (index <25KB, each entry <150 characters)
The core instruction of the consolidation prompt is written like this: “You are performing a dream — a reflective pass over your memory files. Synthesize what you’ve learned recently into durable, well-organized memories so that future sessions can orient quickly.”
Human brains consolidate memories during sleep; agents consolidate session memories during idle time. Calling it Dream is indeed fitting. The thresholds are remotely configurable via GrowthBook, allowing consolidation frequency to be tuned for different user cohorts.
4. Constraints and validation: keeping the Agent from making mistakes
The Environment makes the Agent powerful, but only constraints and validation make it reliable. In Claude Code, this layer actually has more code than the Environment — fail-closed defaults, a multi-layer permission system, tool orchestration, safety boundaries for speculative execution, capability partitioning across multiple Agents, each of which is a hard constraint.
Fail-closed tool defaults
Claude Code’s buildTool() factory function provides default values that are worth savoring:
1 | TOOL_DEFAULTS = { |
Every default is chosen to be the more conservative option. Forgot to declare read-only? When a write operation is involved, stricter permissions are required. Forgot to declare concurrency safety? Run exclusively. Forgetting to declare something is much safer than declaring it incorrectly. This is the fail-closed principle applied at the tool-system level.
The LLM as a permission classifier
Claude Code’s Auto Mode implements a forward-looking pattern: using an LLM to judge whether another LLM’s tool call is safe.
The classifier’s input is carefully constructed: it only extracts tool_use blocks from the conversation history — without the assistant’s free-form text. The reason is straightforward: if the classifier can see the assistant’s natural language output, an attacker could mislead the classifier by getting the main model to output specific text. By only looking at structured tool calls, the attack surface is much smaller.
Each tool also has a toAutoClassifierInput() method that controls what inputs the classifier can see. Read-only tools (Read, Grep, Glob) are whitelisted and skipped directly.
The full chain of five-layer permission checks
From tool call to final decision, there are five layers of checks, each with explicit veto power:
- Static Settings rules — alwaysDeny / alwaysAllow / alwaysAsk for fast pruning
- PreToolUse Hook — user-defined script, exit code 2 means “block”
- Tool attributes themselves — tools with isReadOnly are whitelisted and allowed directly
- LLM Auto-Classifier — sideQuery call that only sees the tool_use block
- Rejection circuit breaker — after 3 consecutive or 20 cumulative rejections, fall back to interactive prompts
The design philosophy is that each layer only handles one concern. Static rules prune fast, hooks support enterprise customization, the LLM handles fuzzy cases that can’t be predefined, and the circuit breaker is a final safeguard against deadlock.
Shell safety: semantic parsing beats keyword blacklists
The Bash tool’s safety mechanism is far more than “ban rm -rf /”. In the source code there is a full command semantic parser that type-annotates every flag of every git subcommand, and handles boundary cases such as the -- terminator, UNC paths, and decomposition of compound commands.
A real security incident is documented in the comments: the git diff -S flag was previously marked as none (no argument), but git’s actual behavior is that -S must consume the next argv. An attacker could construct git diff -S -- --output=/tmp/pwned, causing the validator to think -S takes no argument → advance 1 token → see -- and stop checking → --output goes unchecked → arbitrary file write. The fix was to change -S to type string.
This example illustrates a general rule: the granularity of your security mechanisms should match the granularity of your attack surface.
Streaming execution + concurrency-safety flags
General Agent implementations usually wait for the model to finish speaking before executing tools. Claude Code doesn’t wait — while the model is streaming later content, the earlier tools are already running.
Each tool has an isConcurrencySafe flag. Read-file, grep and other read-only operations can run in parallel, while write-file, bash, and the like require exclusivity. The system dynamically batches: concurrency-safe tools in the same batch run simultaneously; when a non-concurrency-safe tool is encountered, a new serial batch is created. Results are buffered and output in the order received, so there’s no reordering.
There’s also a fine-grained design for Bash error cascading. When a Bash command fails, siblingAbortController immediately terminates all sibling tool processes — but only Bash errors trigger this, because Bash commands often have implicit dependency chains (if mkdir fails, a subsequent cd is pointless). Read/Grep and other independent queries do not cascade on failure. Siblings are aborted, but the parent loop is not — the main loop continues because the model needs to see the error and decide what to do next.
Speculative execution: start working while the user is still typing
Claude Code has a complete speculative execution system. It doesn’t just “predict the next step” — it actually executes on an overlay filesystem. The flow is:
- While the user is still typing, the system uses Prompt Suggestion to predict the next step.
- It starts a restricted forked Agent.
- The Agent runs on an overlay filesystem: reads check the overlay first, then fall back to the real disk; writes only go to the overlay directory.
- If the user accepts, the overlay is “promoted” to the real disk; if the user rejects, the overlay is simply deleted.
A strict tool whitelist controls what speculation can do: write operations are only allowed for Edit/Write/NotebookEdit and only when permissions allow auto-accept; Bash is only allowed when it passes read-only verification. Speculation is capped at 20 turns or 100 messages, and its transcript is not written into the main session.
The essence of speculative execution is not “how accurate the prediction is,” but “zero cost when the prediction is wrong.” Overlay filesystem + tool whitelist + main-session isolation together ensure that correct predictions save time and incorrect ones cost nothing.
The essence of multi-Agent: capability partitioning, not role play
Claude Code’s multi-Agent design is not “give each Agent a different role prompt.” The core is explicit partitioning of the tool surface:
| Agent type | Tool surface |
|---|---|
| Main Agent | Full toolset |
| Sub-Agent | Forbidden to recursively create Agents, exit planning mode, etc. |
| Coordinator | Only 3 tools: create / send message to / stop worker |
| Async Agent | Only file read/write, search, shell |
| In-process Teammate | Async + task management + cron |
What an Agent “is” is not determined by its system prompt, but by what it can do. Capability-based identity is harder than role prompts: more rigid, more auditable, and harder to circumvent. Role prompts can be “creatively interpreted” by the model, but if a tool is disabled, it simply cannot be called.
When a Sub-Agent loads MCP tools, it also checks trust boundaries: only Agents with isSourceAdminTrusted can use MCP tools. This encodes the trust boundary explicitly in the code.
5. Correction: what to do when things go wrong
What Agents fear most is not that an operation fails, but that they endlessly retry on a failure and burn through the entire token budget. Claude Code’s error recovery strategy centers on one core principle: do not expose intermediate state until you’re sure recovery is impossible.
Two-step recovery when output hits the limit
When the model output hits max_output_tokens, the system doesn’t immediately raise an error; it performs a two-step recovery:
Step 1: Silently raise the limit. If the limit was the default 8K when it was hit, the system retries with 64K — without adding a meta message, completely transparent to the user, and only once.
Step 2: Multi-turn continuation. If 64K is still not enough, it then injects a meta message asking the model to continue. The wording is very deliberate: “Output token limit hit. Resume directly — no apology, no recap of what you were doing. Pick up mid-thought if that is where the cut happened.” It explicitly tells the model not to apologize, not to recap, just to continue. Up to 3 times.
Message withholding: the key to the recovery loop
During the entire recovery process, error messages are withheld, not sent to external consumers (desktop clients, IDE plugins, remote sessions). Those clients will terminate the session when they see an error field — the recovery loop would still be running, but no one would be listening.
If recovery succeeds, consumers never know anything went wrong. Only when all recovery options are exhausted are the withheld errors released.
The philosophy here is: the boundary of error handling is not a single API request, but the entire recovery loop. Before the loop ends, consumers should not see any intermediate state.
Model fallback
When the primary model is unavailable (overload, service outage), the system automatically switches to a backup model. The tricky part is that different models may have different tool_use formats and signing mechanisms, so on fallback the system must sanitize model-specific content in the message history — discard intermediate results from the old StreamingToolExecutor, strip out old-model-specific signature blocks, and then re-issue the request using the backup model.
Circuit breakers everywhere
| Scenario | Threshold | Notes |
|---|---|---|
| Context compression | 3 consecutive failures | Previously wasted ~250K API calls/day |
| Permission classification | 3 consecutive or 20 cumulative rejections | Fall back to interactive prompts |
| Output hitting limit | Up to 3 continuations | Raise limit first, then continue |
These thresholds were not chosen arbitrarily. The 3-failure threshold for the compression circuit breaker comes from real production data: 1,279 sessions had more than 50 consecutive failures. Ablation-testing infrastructure lets Anthropic quantify the value of each circuit breaker.
Death-spiral protection
If an API call fails → triggers stop hooks → stop hooks also call the API → they fail too → which triggers stop hooks again… that’s a death spiral. Claude Code’s rule is simple: when the API fails, all stop hooks are skipped. It’s better to miss one round of memory extraction and prompt suggestion than to let the system fall into an infinite loop.
6. Engineering Practice: Building Products with Scientific Methods
Ablation Experiments and A/B Testing
In the source code there is an ABLATION_BASELINE flag, with a comment labeled “Harness-science L0 ablation baseline.” When enabled, it turns off 7 features at once: thinking mode, context compression, automatic memory, background tasks, simple mode, interleaved thinking, and a second background task flag. This lets Anthropic run controlled experiments: after disabling a feature, do task completion rate, token consumption, or session length change significantly?
People who’ve done ML research are very familiar with ablation studies. But porting this methodology into product engineering, inside production code, is rare. The closest analogy is ByteDance—ByteDance is famous for being “data-driven,” where almost every product decision goes through A/B testing. Claude Code’s approach is highly similar in spirit: not “ship it because it feels useful,” but “ship it only when the data proves it’s useful.”
The experimentation platform behind this is GrowthBook, an open‑source A/B testing framework. The growthbook.ts file in the source reveals the full experimentation stack: server‑side bucketing (remoteEval: true, the bucketing logic runs on the server, and the client only gets the evaluated result); user targeting (bucketing based on properties like deviceId, organizationUUID, subscriptionType, rateLimitTier, platform, email, etc.); and exposure tracking (each feature’s experimentId + variationId is logged into a 1P event system, supporting attribution analysis for experiment effects). Internal users refresh config every 20 minutes, external users every 6 hours. The eval harness can override any flag via the CLAUDE_INTERNAL_FC_OVERRIDES environment variable, ensuring determinism for automated evaluations.
One notable engineering detail: the ablation experiment code lives in cli.tsx (the entry file) instead of in an initialization function. The reason is that modules like BashTool and AgentTool capture environment variables into module‑level constants at load time, so by the time init() runs it’s already too late. That means the environment variables must be set before any imports happen. And the feature() gate at compile time ensures that this entire block is removed by DCE in external builds—ablation experiments exist only for internal users.
Two‑Layer Feature Flags
Compile‑time flags are implemented using Bun’s feature() macro. At build time they’re replaced with true/false, and the code in the false branch is physically removed—it’s not just skipped at runtime, it disappears from the binary. Security researchers won’t find it even via reverse‑engineering. There are more than 20 compile‑time flags in the source, each corresponding to an unreleased feature.
Runtime flags are implemented with GrowthBook, serving three roles at once: gradual rollout, A/B testing, and an emergency kill switch. All gate names start with tengu_ (the internal codename for the Claude Code project). Reads use the CACHED_MAY_BE_STALE mode—values are read from a disk cache, allowing stale reads and avoiding blocking startup. Values are persisted in ~/.claude.json and shared across processes.
The two layers work together: compile time decides whether a feature exists at all, and runtime decides whether the feature is activated. This system lets Anthropic do gradual rollouts, A/B tests, and emergency rollbacks for any feature without shipping a new version—very closely aligned in capability to ByteDance’s internal experimentation platform.
Anti‑Distillation: Two Layers of Technical Defense
There is an independent subsystem in the codebase: anti‑distillation, designed to prevent competitors from using Claude’s API outputs to train their own models. The source comments explicitly label two places with “anti-distillation,” corresponding to two different defenses.
The first layer is fake tools injection. In claude.ts, inside getExtraBodyParams, there’s a comment: // Anti-distillation: send fake_tools opt-in for 1P CLI only. The client sends an anti_distillation: ['fake_tools'] directive to the API backend, and the backend uses this to inject fake tool calls into the response. These fake tool calls are mixed into the real output. Systems that try to bulk‑extract training data from API outputs will ingest them as well, effectively poisoning the training set. Attackers must either pay the cost to filter them out (but it’s hard to distinguish real from fake tool calls) or accept that their training data is contaminated. This feature is controlled by both a compile‑time feature flag (ANTI_DISTILLATION_CC) and a runtime GrowthBook remote config flag (tengu_anti_distill_fake_tool_injection), and applies only to 1P CLI users.
The second layer is connector text summarization. In betas.ts, a comment explicitly says // POC: server-side connector-text summarization (anti-distillation). The mechanism is more fine‑grained: the API server caches the free‑form text the model generates between tool calls (i.e., the “reasoning process”), and replaces it with a summary plus a cryptographic signature. The client sends this signed summary back in subsequent requests, and the server uses the signature to restore the original text so the conversation can continue normally. But external observers—whether scraping via the API or via a man‑in‑the‑middle—see only the summary and lose the model’s original chain of thought. This is the same principle as the redaction mechanism for the thinking block: the signature ensures only Anthropic’s server can reconstruct the content; third parties cannot decrypt it. It’s still at the POC stage, with the GrowthBook flag tengu_slate_prism.
The design logic of the two layers is complementary: fake tools target large‑scale “dragnet” scraping, offering low cost and broad coverage by making the training data unusable through noise; connector text summarization targets more precise reverse‑engineering—so even if an attacker can filter fake tool calls, the model’s intermediate reasoning process remains hidden behind signatures and cannot be extracted. Each layer has its own independent switch and its own GrowthBook‑controlled rollout path.
In addition, there is a streamlined mode in the SDK output layer. Comments in streamlinedTransform.ts describe it as a “distillation-resistant” output format—it retains only text messages and aggregate counts of tool calls, omitting the thinking content and the detailed tool list.
It’s worth noting that these anti‑distillation mechanisms are in tension with the caching system. The source code includes a sticky‑on latch mechanism to resolve this conflict: once a certain beta header (including anti‑distillation‑related headers) is sent for the first time in a session, that header continues to be sent even if the user later disables the corresponding feature. Removing a header would change the request signature, causing cache invalidation on the server side and wasting cache costs of 50,000 to 70,000 tokens. Feature state and protocol state are deliberately decoupled: headers (protocol layer) stay constant to preserve cacheability, while the actual feature control is adjusted dynamically at the request body layer.
Privacy Engineering: The Type System as an Audit Tool
The type for analytics data is called AnalyticsMetadata_I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHS. This type is actually never—you can’t assign to it directly. Every use must explicitly cast with as, which means each use is a statement to code reviewers: “I have verified that this string does not contain sensitive information.”
There is a dedicated anonymization function for MCP tool names: user‑defined MCP server names are all replaced with 'mcp_tool', because server names might contain user‑private information. Built‑in tool names are passed through as‑is.
The SDK client consumes streaming output in NDJSON format. Claude Code installs a guard on process.stdout.write: each line of output is first validated via JSON.parse; valid lines are forwarded to stdout, invalid ones are sent to stderr with a [stdout-guard] tag. Because a single stray print from a dependency could crash the entire streaming pipeline.
Privacy protection is not an after‑the‑fact “compliance layer”; it’s encoded into the type system and runtime guards, forcing developers to think “could this string leak user information?” at the time they write code.
Leak Prevention Engineering: Undercover Mode and Canary Detection
Claude Code has an internal mechanism called Undercover Mode. When it detects that the current repo is not in an internal allowlist, it automatically activates and strips model codenames, internal project names, and similar information from commit messages and PR descriptions.
The key design of this switch is: there is no force‑OFF. The comment states: “There is NO force-OFF. This guards against model codename leaks — if we’re not confident we’re in an internal repo, we stay undercover.” It can be force‑ON, but not force‑OFF. This is an asymmetric design for security: the cost of leaking internal information is far higher than the occasional cost of extra protection.
At the build‑artifact level, Anthropic maintains an excluded-strings.txt blacklist; CI greps the final binaries for internal codenames. All sensitive strings are constructed at runtime to bypass these checks—for example, the API key prefix sk-ant-api is written as ['sk', 'ant', 'api'].join('-').
The most interesting case appears in the Buddy pet system: capybara happens to be an internal codename for one of Anthropic’s models, so writing it as a raw string would trigger canary detection. To avoid making it stand out, all 18 species names are uniformly encoded in hex. Can the security checks distinguish between “pet names” and “model codenames”? No. So they’re treated identically.
7. Six Core Design Principles
Reading the entire codebase, you can extract six design principles that run through everything. The first three are at the Environment level, the latter three are at the constraint/correction level.
Environment Principles
Principle 1: Cache economics is an architectural constraint. Cache hit rate is not an optimization—it determines how messages are serialized, how sub-agents are forked, and how tool results are stored. Draw the cache boundary diagram on day one.
Principle 2: Layered handling of information with different “shelf lives.” Intermediate tool outputs have no value after a few rounds (SNIP and delete them), conversation structure needs to be compressed (COLLAPSE and archive), global background must be persisted. One compression method cannot handle all scenarios; you need a pipeline.
Principle 3: Parallelize every LLM call that can be parallelized. sideQuery turns “calling the LLM” into a lightweight operation scattered everywhere. While the main model is reasoning, permission classification, memory retrieval, and summary generation all run in parallel.
Constraint/Correction Principles
Principle 4: Put circuit breakers everywhere. The biggest fear for an agent is not failure, but infinite retries on top of failure. Every recovery path has an upper limit—3 compressions, 3 permission attempts, 3 output attempts.
Principle 5: Don’t expose errors too early. Until it’s confirmed that recovery is truly impossible, intermediate errors should not be leaked to the consumer. Silent upgrades, withholding errors, model degradation—“recovery loops” instead of “single retries.”
Principle 6: Safe defaults must be conservative. Tools are by default unsafe and by default have write operations. Undercover is enabled by default with no force-OFF. When uncertain, always choose the safer option.
The first three make the agent fast and smart, the latter three ensure the agent is stable and safe. Together they form a complete Harness.
8. From Model Commoditization to Ecosystem Moats
What exactly does “model commoditization” mean?
Earlier we mentioned “model capabilities trending toward commoditization.” Commoditization refers to a product that was once scarce and differentiated becoming widely available and mutually interchangeable. When multiple companies can provide base models with similar capabilities, the model itself is no longer the decisive differentiating factor—it turns from a “scarce resource” into “infrastructure,” like electricity or bandwidth.
This trend is indeed happening: the gap between open-source models and closed-source frontier models has shrunk from years to months; the API price for models at the same intelligence level has dropped by more than an order of magnitude over the past year; model training methodologies are copied by the entire industry within months of paper publication.
The escape law in the history of technology
Commoditization is not the destiny of technology. The three most classic historical cases reveal the same pattern:
| Commoditized part | Non-commoditized part | Escape mechanism | |
|---|---|---|---|
| NVIDIA | GPU compute (AMD, Google TPU are catching up) | CUDA ecosystem (gross margin ~75%) | Software platform lock-in, extremely high switching cost |
| Cloud providers | VM/storage/bandwidth (the big three are converging in features) | Higher-level PaaS services (AWS operating margin ~30%) | Data gravity + proprietary services |
| Apple | Phone hardware (Android ecosystem is already commoditized) | iOS ecosystem (iPhone gross margin ~45%) | Vertical integration + ecosystem lock-in |
Underlying technical capabilities always tend to commoditize. The only way to escape commoditization is to build an ecosystem, platform, or integrated experience on top of the technology that creates switching costs. NVIDIA sells chips; its moat is CUDA. Apple sells phones; its moat is the iOS ecosystem. Base model companies face the same problem: model APIs are being commoditized; they need to find their own “CUDA” or “iOS.”
Anthropic’s co-optimization flywheel
From Claude Code’s source code, Anthropic’s strategy is not just “build products,” but a co-optimization loop: define a paradigm (MCP, SKILLs, CLAUDE.md) → build products around the paradigm (Claude Code, Artifacts) → products generate usage data (GrowthBook experiments + ablation experiments) → data feeds back into model training → the model performs better under these paradigms → the ecosystem grows further around the paradigm. Structurally, this is isomorphic to CUDA’s flywheel.
A technical detail most people overlook is: SKILLs work by dynamically loading instructions into the middle of the conversation context, rather than the System Prompt. Instructions in the System Prompt follow norms that almost all models have been optimized for, but instruction-following in the middle of the context is a more subtle capability dimension—the classic “Lost in the Middle” problem. If Anthropic knows during training that products will inject SKILLs into the middle of the context, it can specifically optimize training for this scenario. Other model companies, if they don’t build similar products, might not even realize this is a dimension worth optimizing.
This is the true value of paradigm-defining power: not to define a standard that everyone uses, but to define a standard and then ensure your own model performs best under that standard. The standard is open, but the best implementation is closed.
Evidence in the codebase supports this judgment. The GrowthBook experimentation platform tracks the impact of each Harness feature on task completion rate; the ablation experimental infrastructure quantifies the value of individual features; exposure tracking supports attribution analysis. These signals not only guide product iteration, but can also feed back into model training—where the model performs poorly, which instructions it fails to follow, which tool-calling patterns are error-prone. Claude Code is not just a product; it is also a signal source for model training.
Of course, this flywheel is not yet as sticky as CUDA or iOS—SKILLs are essentially Markdown text, with near-zero migration cost; MCP is an open protocol, and competitors can free-ride. But the core question is not “is it strong enough now,” but whether the flywheel can spin fast enough for model-training advantages and product-data advantages to compound over time.
Conclusion: The real distance from demo to production
Many people’s agent development stops at “let the model call tools”—write prompts, provide a few tools, get calls working and call it done. But getting things to work is only the beginning. Claude Code’s source tells us that the real engineering effort lies beyond the model, prompts, and tools:
- Once tools are called, how are permissions decided? → A five-layer system: static rules + hooks + tool properties + LLM classifier + circuit breakers
- What happens when the model makes a mistake? → Withhold intermediate errors + silently upgrade output limits + multi-round continuation + model degradation + only expose after recovery is exhausted
- How do you stop losses when the model acts stupidly? → Every recovery path has a circuit-breaker cap + API errors skip stop hooks to avoid death spirals
- What if the context is too long—how do you compress it? → A five-layer pipeline handling information by “shelf life”
- How do you coordinate multiple tools running at the same time? → Concurrency safety flags + error cascading + execution order guarantees
- How do sub-agents fork and share cache? → CacheSafeParams + byte-level consistency + global slots
- If the user is only halfway through input, can you already start working? → Speculative execution + overlay filesystem + main-session isolation
- What if your model capability is distilled by others? → Two-layer anti-distillation: fake tools for poisoning training sets + connector text signatures to hide the reasoning process
- If build artifacts are distributed, how do you avoid leaking internal information? → String blacklists + compile-time DCE + Undercover Mode
This engineering “beyond the model and tools” is the real distance from demo to production. And that distance is much farther than most people imagine.
This is also why this leak—intentional or not—is so valuable. It reveals a simple but crucial fact: when base model capabilities converge rapidly, the more complete, robust, and refined Harness wins the agent product game. A deeper insight is: the best Harness must be deeply tailored to the specific model’s characteristics (cache economics, thinking mode, instruction-following in the middle of the context), and the best model training in turn needs real usage data generated by the Harness as a signal source. Model × Harness is a multiplicative relationship; only their co-optimization can form a true competitive moat. Claude Code’s 510,000 lines of source are both a practical sample of Harness Engineering and Anthropic’s strategic hedge for building an ecosystem moat.