skills/claude-code-deep-dive/SKILL.md
Deep architectural reference for Claude Code internals -- the query loop, The Prestige (prompt caching illusion), tool orchestration, state management, and cost model. Use when reasoning about Claude Code behavior, optimizing token usage, or debugging cache breaks.
npx skillsauth add abix-/claude-blueprints claude-code-deep-diveInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reverse-engineered from Claude Code source (TypeScript/Bun/Ink). This is the actual architecture, not documentation -- verified against the code.
Every great magic trick consists of three parts:
The Pledge -- the magician shows you something ordinary. A conversation. You type a message, Claude responds. It looks like a continuous session.
The Turn -- the magician makes it disappear. After every response, the server-side Claude instance is gone. No memory. No state. No session. The previous instance is dead.
The Prestige -- the magician brings it back. A brand new Claude instance appears, fed the entire conversation from the beginning. It replays every token, reconstructs the same mental state, and responds as if it was there the whole time.
The audience (you) sees one continuous conversation. Behind the curtain, it's a new clone every turn. The old one is destroyed. You pay for the cloning.
| The Prestige | Claude Code | |---|---| | The Pledge | User sees a continuous conversation | | The Turn | Server-side instance dies (stateless API) | | The Prestige | New instance cloned from full history replay | | The cloning machine | Prompt cache (KV tensor checkpoint) | | Cost of the machine | Cache read tokens (10% of input price) | | Drowning the original | Server discards all state between calls | | Angier's diary | Compaction summary (cliff notes for the clone) | | The audience | The user, who never sees the swap |
This framing is used throughout this document. "The Prestige" = the per-turn full-history replay. "The cloning machine" = prompt caching. "Angier's diary" = compaction.
File: main.tsx
Startup is performance-critical. Three side-effects fire before any imports evaluate:
profileCheckpoint('main_tsx_entry') -- marks wall-clock entrystartMdmRawRead() -- fires MDM subprocess (plutil/reg query) in parallel with importsstartKeychainPrefetch() -- fires macOS keychain reads (OAuth + legacy API key) in parallelThen:
--agent, --model, --remote, --bare, etc.)init() runs: config validation, env vars, TLS cert setup, graceful shutdown handlerslaunchRepl() dynamically imports App + REPL components and renders via InkFeature flags use feature() from bun:bundle for build-time dead code elimination. When false, the entire require() block is stripped from the bundle. This is why imports look like:
const VoiceCommand = feature('VOICE_MODE')
? require('./commands/voice/index.js').default
: null
Not runtime toggling -- compile-time stripping.
Files: state/AppStateStore.ts, state/store.ts, state/AppState.tsx
A single DeepImmutable<> object (~200+ fields) tracking everything:
DeepImmutable<> is a recursive mapped type making everything readonly. Prevents accidental mutation.
Simple external store pattern (like Zustand):
getState() / setState(prev => newState) / subscribe(listener)AppStateProvider for Ink componentsuseSyncExternalStore() -- React 18 concurrent-safesetAppState is no-op for async subagents (createSubagentContext). Infrastructure that outlives a turn uses setAppStateForTasks which always reaches the root store.
Files: query.ts, QueryEngine.ts
query() is an async generator that implements the agentic loop:
User message
-> Build system prompt (CLAUDE.md + git + context)
-> Call Claude API (streaming)
-> Parse assistant response
-> Text blocks: yield to UI for rendering
-> Tool use blocks: dispatch to tool implementations
-> Permission check (canUseTool)
-> Execute tool (BashTool, FileEditTool, etc.)
-> Return tool_result as UserMessage
-> Loop back to API call
-> Stop reason reached -> return Terminal
type State = {
messages: Message[]
toolUseContext: ToolUseContext
autoCompactTracking: AutoCompactTrackingState | undefined
maxOutputTokensRecoveryCount: number
hasAttemptedReactiveCompact: boolean
maxOutputTokensOverride: number | undefined
pendingToolUseSummary: Promise<ToolUseSummaryMessage | null> | undefined
stopHookActive: boolean | undefined
turnCount: number
transition: Continue | undefined // why the previous iteration continued
}
task_budget)One QueryEngine per conversation. Each submitMessage() starts a new turn. It:
canUseTool to track permission denials for SDK reportingdiscoveredSkillNames per turn (prevents unbounded growth in SDK mode)Files: Tool.ts, tools.ts
Every tool call receives ToolUseContext, containing:
options: commands, tools, model, MCP clients, agent definitions, thinking configabortController: cancellation signalreadFileState: LRU file content cache (prevents re-reading unchanged files)getAppState / setAppState: state accesssetToolJSX: render Ink components during tool executionmessages: full conversation historyfileReadingLimits / globLimits: per-tool resource capscontentReplacementState: tool result budget trackingrenderedSystemPrompt: frozen system prompt for fork subagentsgetAllBaseTools() returns ~30+ tools. Assembly pipeline:
getAllBaseTools() -- all possible tools (feature-flagged at build time)getTools() -- filters by deny rules, REPL mode, simple mode, isEnabled()assembleToolPool() -- merges built-in + MCP tools, deduplicates, sorts by nametoolOrchestration.ts partitions tool calls into batches:
MAX_TOOL_USE_CONCURRENCY (default 10)StreamingToolExecutor enables tool execution to begin before the full tool input is streamed. Tools opt in via eager_input_streaming: true on their schema.
File: commands.ts
local: Runs JS, returns text output (e.g., /compact, /cost)local-jsx: Renders Ink UI (e.g., /config, /mcp, /keybindings)prompt: Expands to text sent to the model (skills, /review, /commit)getCommands() merges sources in order (earlier wins on name conflict):
~/.claude/skills/)Filtered by:
meetsAvailabilityRequirement() -- auth/provider gates (claude-ai vs console)isCommandEnabled() -- feature flag / user settingSkills found during file operations (e.g., reading a SKILL.md in a project) are added to the command list mid-session via getDynamicSkills(). Inserting them triggers clearCommandMemoizationCaches().
getSkillToolCommands() -- all prompt-type commands the model can invoke (includes skills + legacy commands)getSlashCommandToolSkills() -- only user-invocable skills shown in / typeaheadFile: hooks/useCanUseTool.tsx, hooks/toolPermission/
alwaysAllowRules, alwaysDenyRules, alwaysAskRules from settings.json (per source: user, project, enterprise)hasPermissionsToUseTool(): Evaluates rules, returns allow/deny/askTRANSCRIPT_CLASSIFIER): Auto-mode classifier evaluates risk from conversation contextshouldAvoidPermissionPrompts is true (background agents)default: Ask for dangerous operationsplan: Restricted -- only read operations and plan file editsbypassPermissions: Allow everything (YOLO mode, requires trust dialog)auto: Classifier-driven approvalDenialTrackingState counts consecutive denials. After a threshold, falls back to prompting even in auto mode. Subagents get localDenialTracking since their setAppState is a no-op.
The Claude API is stateless. Every API call sends the entire conversation from scratch -- system prompt, tool definitions, and every prior message. On turn 50, even a 10-token user message triggers a request containing all 49 previous turns. This is The Prestige -- a new clone, built from the full script of every previous performance.
The "cache" is the cloning machine -- a KV tensor checkpoint, not a result cache. The server recognizes "I've computed attention for this exact token prefix before" and skips the transformer forward pass for those tokens. But it still loads the tensors, allocates GPU memory, and attends over cached positions during generation. The clone still needs to be built. The machine just builds it faster.
What Anthropic calls "cache read tokens" is the cost of the cloning machine. 90% off full input price. You still pay 10% for every token of every prior turn, every single time. And like Angier's machine, you pay every performance.
Per-million-token rates (from modelCost.ts):
| Model | Input | Cache Write | Cache Read | Cache Savings | |-------|-------|-------------|------------|---------------| | Opus 4.6 | $5.00 | $6.25 | $0.50 | 90% off input | | Opus 4.6 fast | $30.00 | $37.50 | $3.00 | 90% off input | | Sonnet 4.6 | $3.00 | $3.75 | $0.30 | 90% off input | | Opus 4/4.1 | $15.00 | $18.75 | $1.50 | 90% off input | | Haiku 4.5 | $1.00 | $1.25 | $0.10 | 90% off input |
Cache writes are 1.25x input price. They're the cost of creating the KV checkpoint. Amortized across subsequent reads.
Real-world example: 2.1 billion tokens/week on Haiku, 99.7% cache reads. Without caching: ~$2,120. With caching: ~$220. A stateful session API would reduce this to ~$9.
File: services/api/claude.ts
Three placement sites tell the server "cache everything up to here":
A. System prompt (buildSystemPromptBlocks):
splitSysPromptPrefix(systemPrompt).map(block => ({
type: 'text',
text: block.text,
...(enablePromptCaching && block.cacheScope !== null && {
cache_control: getCacheControl({ scope: block.cacheScope, querySource }),
}),
}))
System prompt splits into up to 4 blocks:
x-anthropic-billing-header) -- no cache scopeCLI_SYSPROMPT_PREFIXES) -- org scopeSYSTEM_PROMPT_DYNAMIC_BOUNDARY -- global scope (shared across ALL users)B. Tool schemas (toolToAPISchema in utils/api.ts):
cache_controlC. Messages (addCacheBreakpoints):
cache_control marker per requestskipCacheWrite fork agents)cache_store_int_token_boundaries. Two markers would waste GPU memory by protecting a position nothing will resume from.function getCacheControl({ scope, querySource }) {
return {
type: 'ephemeral',
...(should1hCacheTTL(querySource) && { ttl: '1h' }),
...(scope === 'global' && { scope }),
}
}
should1hCacheTTL():
ENABLE_PROMPT_CACHING_1H_BEDROCK env varscope: 'global'): Static system prompt shared across ALL first-party API users. Only for getAPIProvider() === 'firstParty'. Gated by shouldUseGlobalCacheScope().scope: 'org'): Per-organization caching. User-specific content (CLAUDE.md, tool schemas).When MCP tools are present, global scope on the system prompt is skipped (skipGlobalCacheForSystemPrompt). MCP tools are per-user, so the tool section following the system prompt can't be globally cached. Falls back to org-level caching.
Multiple values are latched (set once, never change within a session) to prevent cache key churn:
| Latched Value | Why | Bootstrap State Function |
|---|---|---|
| 1h TTL eligibility | Overage flip would change TTL | setPromptCache1hEligible() |
| AFK mode beta header | Auto-mode toggle would add/remove beta | setAfkModeHeaderLatched() |
| Fast mode beta header | /fast toggle would add/remove beta | setFastModeHeaderLatched() |
| Cache editing beta header | Feature enable would add beta | setCacheEditingHeaderLatched() |
| Tool schema base | GrowthBook flip would change tool descriptions | toolSchemaCache.ts |
| 1h cache allowlist | GrowthBook disk cache update would change patterns | setPromptCache1hAllowlist() |
| Thinking clear | Prevents thinking mode flips from busting cache | setThinkingClearLatched() |
File: services/api/promptCacheBreakDetection.ts
Two-phase detection system:
Phase 1 (pre-call): recordPromptState() snapshots everything that could affect the cache key:
cache_control -- catches scope/TTL flips)Phase 2 (post-call): checkResponseForCacheBreak() checks actual cache token response:
cache-break-XXXX.diff)tengu_prompt_cache_break analytics event with full diagnostic payloadSpecial cases that are NOT cache breaks:
notifyCompaction(): Resets baseline after compaction / diary rewrite (legitimately reduces prefix -- the clone is reading new cliff notes, not the old script)notifyCacheDeletion(): Resets after cache_edits deletions (expected drop)A beta feature for surgical cache manipulation without full reprocessing:
cache_reference: tool_use_id is added to tool_result blocks within the cached prefixcache_edits: [{ type: 'delete', cache_reference: 'tool_use_id_123' }] blocks are inserted into user messagesConstraints:
cache_reference must appear "before or on" the last cache_control markercache_edits splicingFiles: cost-tracker.ts, bootstrap/state.ts, utils/modelCost.ts, services/api/logging.ts
API streaming response
|
v
updateUsage() -- merges message_delta usage into per-message total
| (takes max of delta vs accumulated for input/cache tokens,
| because deltas report cumulative not incremental)
v (on message_stop)
accumulateUsage() -- adds message usage to total session usage
| (simple addition across all fields)
v
calculateUSDCost() -- multiplies tokens by per-model cost tiers
| (looks up ModelCosts by canonical model name)
v
addToTotalSessionCost() -- updates:
| - bootstrap/state MODEL_USAGE counters (per-model)
| - OTel counters (cost, tokens by type)
| - Recursive for advisor sub-usage
v
saveCurrentSessionCosts() -- persists to project config for session resume
ModelUsage tracks per model:
{
inputTokens: number
outputTokens: number
cacheReadInputTokens: number
cacheCreationInputTokens: number
webSearchRequests: number
costUSD: number
contextWindow: number
maxOutputTokens: number
}
Accumulated by canonical short name (e.g., claude-opus-4-6 not the full model string).
formatTotalCost() renders:
Total cost: $12.34
Total duration (API): 5m 23s
Total duration (wall): 12m 45s
Total code changes: 42 lines added, 7 lines removed
Usage by model:
claude-opus-4-6: 1,234 input, 5,678 output, 98,765 cache read, 432 cache write ($12.34)
Costs are saved to project config (saveCurrentSessionCosts()) and restored on resume (restoreCostStateForSession()). Only restores if the session ID matches the last saved session.
File: context.ts
getSystemContext() returns:
gitStatus: Branch, default branch, status (truncated at 2K chars), recent 5 commits, git user namecacheBreaker: Optional injection for cache breaking (ant-only debugging)Skipped in CCR (Claude Code Remote) and when git instructions are disabled.
getUserContext() returns:
claudeMd: Concatenated CLAUDE.md files from cwd walk up to home + additional directoriescurrentDate: Today's date in local ISO formatCLAUDE.md discovery:
.claude/CLAUDE.md and project-root CLAUDE.mdDisabled by:
CLAUDE_CODE_DISABLE_CLAUDE_MDS env var--bare mode (unless --add-dir explicitly provided)Files: services/compact/autoCompact.ts, services/compact/compact.ts, services/compact/prompt.ts
Compaction is not compression. It is an amnestic reset -- controlled memory destruction with a handwritten summary left behind for the next clone.
In The Prestige terms: the diary gets too long to carry. Someone writes cliff notes. The next clone reads the cliff notes instead of the full diary. It never actually lived those events -- it just read the summary and pretends it did.
Compaction is not free. The summary API call is itself a full Prestige:
On a 200K context compaction with Opus 4.6: ~$0.15-0.25 per compaction.
effective_context_window = context_window - 20,000 (reserved for summary output)
auto_compact_threshold = effective_context_window - 13,000 (buffer)
For 200K context: triggers at ~167K tokens. For 1M context: triggers at ~967K tokens.
Override with CLAUDE_AUTOCOMPACT_PCT_OVERRIDE (percentage of effective window).
When a compact boundary already exists, PARTIAL_COMPACT_PROMPT only summarizes messages AFTER the last boundary, keeping the previous summary intact. The new clone reads the old cliff notes plus a new addendum, instead of re-summarizing already-summarized content.
After 3 consecutive compaction failures, stops retrying. Prevents wasting API calls when context is irrecoverably over limit (e.g., prompt_too_long errors). BQ data showed 1,279 sessions with 50+ consecutive failures.
Compaction destroys the cached prefix (conversation history is rewritten as a summary). The next Prestige builds a completely different clone -- different token sequence, no cache hit on the old prefix. notifyCompaction() resets the cache break detection baseline so the expected drop in cache reads isn't flagged as a bug.
File: Task.ts, tasks.ts
| Type | Prefix | Description |
|---|---|---|
| local_bash | b | Shell command running in background |
| local_agent | a | Subagent (Agent tool) |
| remote_agent | r | Remote agent (CCR) |
| dream | d | Auto-dream background processing |
| local_workflow | w | Workflow script execution |
| monitor_mcp | m | MCP server monitor |
| in_process_teammate | t | Swarm teammate |
Task IDs: prefix + 8 random chars from [0-9a-z] (36^8 ~ 2.8 trillion combinations, resists brute-force symlink attacks).
pending -> running -> completed | failed | killed
Terminal states checked by isTerminalTaskStatus(). Guards against injecting messages into dead teammates, evicting finished tasks, orphan cleanup.
Files: ink.ts, screens/REPL.tsx, replLauncher.tsx
All renders wrapped with ThemeProvider. Exports themed versions of Box/Text (ThemedBox, ThemedText) and design system primitives.
The main interactive screen handles:
80+ React hooks in hooks/ directory managing:
useTextInput, useVimInput, useArrowKeyHistory, usePasteHandleruseCanUseTool, useMergedTools, useMergedCommandsuseSettingsChange, useSkillsChange, useDynamicConfiguseVoice, useRemoteSession, useSwarmInitializationuseVirtualScroll, useBlink, useElapsedTime, useTerminalSizeMCP tools are prefixed mcp__servername__toolname. They're:
When MCP_SKILLS feature is enabled, prompt-type MCP commands are filtered into skill listings via getMcpSkillCommands().
URL elicitations (-32042 tool call errors) are handled via:
handleElicitation callback in ToolUseContext (SDK/print mode)The AgentTool spawns subagents with isolated contexts. Key design:
createSubagentContext() makes setAppState a no-op (subagent can't corrupt parent state)setAppStateForTasks still reaches root store (background tasks outlive the turn)renderedSystemPrompt is frozen at fork time (prevents GrowthBook cold->warm divergence)skipCacheWrite for fork agents avoids polluting the KV cache with ephemeral branchesBuilt-in agents: general-purpose, Explore, Plan, statusline-setup, claude-code-guide
Custom agents: loaded from ~/.claude/agents/ directories
When COORDINATOR_MODE is enabled, the main thread becomes a coordinator that only uses AgentTool, TaskStopTool, and SendMessageTool. Workers get the full tool set.
| Metric | Healthy | Unhealthy | What it means in Prestige terms | |---|---|---|---| | Cache read ratio | >95% of input | <80% | Cloning machine is working (high) or broken (low) | | Cache breaks per session | 0-2 | >5 | Cloning machine had to be rebuilt (script changed) | | Compaction frequency | 0-3 per session | >10 | Diary rewrites -- each one is a full Prestige + output cost | | Cache write tokens | Small fraction of reads | Approaching reads | Machine is rebuilding every show instead of reusing | | 1h TTL eligibility | Latched true | Flipping mid-session | Machine's rental agreement is unstable | | Total cache reads/week | Context-dependent | Billions | The weekly cost of the illusion -- every clone, every turn |
The cloning machine's cost scales with script length x number of performances. To reduce it:
Shorter scripts (compact earlier): CLAUDE_CODE_AUTO_COMPACT_WINDOW=80000 and CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=60 -- rewrite the diary when it hits 48K tokens instead of the default ~967K (1M model). Each subsequent clone reads a shorter script.
Thinner scripts (less context per turn): Trim CLAUDE.md files, remove unused MCP tools, use Read with offset/limit, use Grep with head_limit. Every token saved compounds across all future performances.
Fewer performances (shorter sessions): Start fresh conversations more often. A 50-turn session means 50 Prestiges. Five 10-turn sessions mean 50 Prestiges too, but with much shorter scripts at peak.
Cheaper performers (model choice): Haiku at $0.10/Mtok cache read vs Opus at $0.50/Mtok. But a smarter model that finishes in fewer turns can be cheaper overall -- fewer performances beats cheaper clones.
tools
AutoHotkey v2 scripting standards for Windows automation, hotkeys, and game macros. Built from the official AHK v2 docs and the AHK community conventions. v1 reached EOL in March 2024.
data-ai
Analyze why Claude made its previous response -- trace reasoning to system prompt, CLAUDE.md, memory, skills, or context
tools
development
Build, test, and release Timberbot mod to GitHub and Steam Workshop