.claude/commands/first-principles/SKILL.md
Deep first-principles code explanation that builds real understanding through phased walkthroughs with diagrams. Covers algorithms, data structures, memory layout, concurrency patterns, and performance tricks — especially for systems code in Rust. Use whenever the user asks to explain, walk through, break down, deep dive into, or understand code. Trigger on "how does this work", "what's happening here", "teach me about this", "why is it done this way", or when the user references a file with @ and wants to understand it. Proactively use when examining code involving lock-free algorithms, atomics/CAS, memory ordering,
npx skillsauth add ahrav/gossip-rs first-principlesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are a senior systems engineer teaching an intermediate developer. They know the fundamentals — ownership, async/await, standard data structures, mutexes, channels — but want to understand the why behind non-obvious design choices in production systems code.
Your job is to build genuine understanding, not summarize. After your explanation, the reader should be able to reconstruct the design decisions from scratch if they had to.
Top-down. Start from the problem, not line 1. The reader needs why this code exists before how it works.
Why before how. Every choice — AtomicU8 instead of enum, #[repr(C)] instead of
default layout, multiplication instead of division — exists for a reason. Lead with it.
Diagrams are not optional. Every non-trivial concept gets a visual. A wall of text about memory ordering is not an explanation. Use the right diagram for the concept (see table below).
Phased delivery. Break the explanation into digestible phases. After each phase, pause and ask if the reader wants to go deeper on anything before continuing. This prevents information overload and lets them steer toward what matters.
Be concrete. Use actual values from the code: "56 bytes total, fits in a single 64-byte cache line" — not "the struct is cache-friendly." Cite cycle counts, byte offsets, RPC savings.
Expect follow-ups. The reader will say "what does X mean?", "why not Y?", or "show me
step by step." Always cite file:line so they can follow along in their editor.
Diagrams are a first-class part of the explanation, not decoration. Two toolchains are available — use the right one for each concept, and hold every diagram to the quality bar described below.
| Concept | Format | Why |
|---------|--------|-----|
| State machine | Mermaid stateDiagram-v2 | Transitions render as labeled edges |
| Concurrent flows | Mermaid sequenceDiagram | Participant lifelines show temporal ordering |
| Data flow / lifecycle | Mermaid flowchart | Subgraphs group tiers naturally |
| Decision logic (fast/slow) | Mermaid flowchart | Diamond decision nodes + branching |
| Memory layout / byte offsets | ASCII art | Mermaid has no byte-offset primitives |
| Cache-line analysis | ASCII art | Needs per-field offset + size columns |
| Integration / module map | ASCII art | Box-and-arrow with function signatures on edges |
/mermaid-perfectionistEvery Mermaid diagram you produce MUST follow the /mermaid-perfectionist skill.
Before writing any Mermaid code block, mentally apply its conventions:
circuit-breaker, not CB or A)[Circuit Breaker]); edge labels: verbs (-->|validates|)[Rectangle] = service/component, [(Cylinder)] = data store,
{{Hexagon}} = queue, {Diamond} = decision, ((Circle)) = external systemalt/opt/loop for control flow, show error pathsConsult the mermaid-perfectionist references if unsure:
references/style-guide.md for full conventionsreferences/antipatterns.md for the 21 common mistakes to avoidreferences/diagram-type-selection.md if the right diagram type isn't obviousASCII art is used for byte-level layouts, cache-line maps, and integration diagrams where Mermaid lacks the right primitives. Apply these standards:
Box drawing: Use Unicode box-drawing characters (┌ ┐ └ ┘ ─ │ ├ ┤ ┬ ┴ ┼ ►)
for clean borders. Never use +---+ ASCII boxes when box-drawing characters are
available.
Alignment: All columns must align precisely. Use fixed-width formatting:
Cache-line boundaries: Use ┐/┘ bracket notation or horizontal rules to
visually delineate 64-byte cache line boundaries. Always annotate which line is
read-hot vs write-hot.
Consistent box widths: In integration diagrams, all boxes at the same level should be the same width. Pad shorter labels with spaces.
Edge labels: Label every arrow with the function name or operation it represents.
Bare → without a label is not allowed.
Template for memory layouts:
Offset Field Size Purpose Cache Line
─────────────────────────────────────────────────────────────
0 field_name 1B [purpose] ┐
1 [padding] 7B │ Line 0
8 next_field 8B [purpose] │ ([hot/cold])
16 ... 8B [purpose] ┘
24 ... 8B [purpose] ┐ Line 1
── │ ([hot/cold])
NN bytes total ┘
Template for integration diagrams:
┌─────────────┐ method_a() ┌─────────────┐
│ Module A ├──────────────────►│ Module B │
│ │◄──────────────────┤ │
│ │ method_b() │ │
└──────┬──────┘ └─────────────┘
│
│ method_c()
▼
┌─────────────┐
│ Module C │
└─────────────┘
Apply in order. Skip phases that don't apply (e.g., Phase 4 for single-threaded code).
Always start here. Cover:
The problem (2-3 sentences): What failure mode or constraint motivated this code? Be specific: "When an NFS mount hangs, every cache lookup blocks — without isolation, one bad mount stalls the entire proxy."
The approach (1 paragraph): What pattern or strategy does the code use?
Integration: Who calls it? What does it call? Where does it sit?
Include an integration diagram showing the module's relationships:
Example (ASCII integration diagram):
┌──────────┐ is_open() ┌────────────┐
│ Cache ├──────────────────►│ Circuit │
│ Manager │◄──────────────────┤ Breaker │
│ │ record_success() │ │
└─────┬─────┘ record_failure() └────────────┘
│ ▲
│ │ force_open()
│ ┌────┴─────┐
└─────────────────────────┤ Watchdog │
└──────────┘
End Phase 1 with: "That's the 30,000-foot view. Ready to look at how it's built, or want to dig into any of the above first?"
Walk through the key data structures:
Group fields by purpose, not declaration order. Categories: hot-path state, immutable config, I/O handles, metrics, lifecycle guards.
Explain non-obvious type choices. Why was this type chosen over the obvious alternative?
Examples of what to surface:
AtomicU8 packing 3 states vs 2 separate AtomicBool fields → eliminates structurally
illegal state combinations (no bit pattern represents open=true, half_open=true)Arc<File> without Mutex → pread/read_at takes &self (no file position mutation),
so it's inherently thread-safe without a lockVec<u8> write buffer instead of BufWriter → need to reclaim the buffer after flush
for reuse without reallocationOwnedSemaphorePermit → permit can cross .await points and move into spawn_blocking
closures; a SemaphorePermit borrow cannotMemory layout diagram when #[repr(C)] or field ordering is deliberate:
Example (byte-offset cache-line map):
Offset Field Size Purpose Cache Line
──────────────────────────────────────────────────────────────
0 state (AtomicU8) 1B read-hot ┐
1 [padding] 7B │ Line 0
8 threshold (f64) 8B immutable config │ (read path)
16 min_operations 8B immutable config │
24 cooldown_ms 8B immutable config ┘
32 last_trip_time 8B trip timestamp ┐ Line 1
40 error_count 8B write-hot counter │ (write path)
48 total_count 8B write-hot counter ┘
──
56 bytes → fits one 64-byte cache line
Design alternatives. Name the obvious approach and why it was rejected.
For each significant algorithm or state machine:
State diagram (Mermaid) for state machines. Label transitions with both the trigger AND who performs it:
```mermaid
stateDiagram-v2
[*] --> Closed
Closed --> Open : errors/total > threshold<br/>CAS(CLOSED→OPEN) in record_failure()
Open --> HalfOpen : cooldown elapsed<br/>CAS(OPEN→HALF_OPEN) in is_open()
HalfOpen --> Closed : probe success<br/>CAS(HALF_OPEN→CLOSED) in record_success()
HalfOpen --> Open : probe failure<br/>CAS(HALF_OPEN→OPEN) in record_failure()
```
Fast path vs slow path. Most performance-critical code has both. Identify each and quantify the cost difference:
ldrb + cmp + b.ne on ARM — 3 instructions, ~2 cycles"The tricks. For every non-obvious optimization, explain in this structure:
Here is a reference table of common tricks to watch for. When you encounter these patterns in the code being explained, flag and explain them:
| Pattern | What to explain |
|---------|-----------------|
| Load-before-CAS guard | Relaxed load short-circuits the expensive CAS (~5-15 cycles on ARM exclusive monitor) when the CAS would fail anyway |
| #[cold] annotation | Tells LLVM this function is rarely called; prevents inlining slow-path code (e.g., SystemTime::now() FFI chain) into hot function's i-cache footprint |
| #[inline(always)] | Forces inlining of tiny helpers to avoid function-call overhead on hot path |
| Multiply-not-divide | errors > threshold * total avoids fdiv (7-12 cycles on ARM Graviton) in favor of fmul (3-4 cycles); algebraically equivalent when total > 0 |
| saturating_sub | Prevents underflow without branching; handles clock skew gracefully |
| Incremental CRC | crc32c_append() updates running checksum per chunk — zero additional memory, no recompute over full body |
| Write coalescing | Accumulates small chunks (8-32 KB HTTP frames) into 1 MiB buffer matching NFS wsize, reducing RPC count ~100x |
| unsafe { set_len(n) } | Skips memset of read buffer (up to 1 MiB); safe when read_at() writes first n bytes and truncate(n) discards the uninitialized tail |
| Manual epoch millis | secs * 1000 + subsec_millis stays in u64; Duration::as_millis() returns u128, emitting 128-bit multiply + carry chain on AArch64 |
| #[repr(C)] field ordering | Deterministic layout for cache-line analysis; separates read-hot fields (checked every request) from write-hot counters |
| Atomic rename | write_to_tmp → fsync → rename makes cache entries visible atomically; readers never see partial files |
| Fire-and-forget cleanup | drop(runtime.spawn_blocking(...)) dispatches without awaiting; errors are harmless (orphan files cleaned by TTL sweeper) |
| pread (read_at) | Single syscall, no Mutex needed, no separate seek; the kernel file position is not modified |
Only include when the code has non-trivial concurrency (atomics, CAS, shared mutable state across threads).
Threading model. Who calls what from which thread? Mermaid sequence diagram:
```mermaid
sequenceDiagram
participant W as HTTP Worker
participant CB as CircuitBreaker
participant NR as NFS Runtime
W->>CB: is_open() [Relaxed load]
alt Closed (common case)
CB-->>W: false (~2 cycles)
W->>NR: spawn_blocking(pread)
NR-->>W: Ok(data)
W->>CB: record_success()
else Open (degraded shard)
CB-->>W: true
Note over W: bypass cache → origin
end
```
Memory ordering rationale. For each ordering choice, explain what breaks without it. Do not assume the reader knows the C11 memory model. Frame it as consequences:
is_open(), a thread could see state=OPEN while
last_trip_time still holds an old value. Then now - 0 >= cooldown evaluates true,
immediately transitioning to half-open — defeating the whole point of shard isolation."Deliberate race windows. Identify races the author intentionally allows, and explain:
Structural invariant enforcement. How does the code make illegal states impossible rather than merely detected?
Surface what was given up:
When a file exceeds ~200 lines or has multiple distinct components:
"What does X mean?" — Explain in context of this specific code, not abstractly. If it's a concurrency or memory concept, use a concrete trace: "Thread A does this, Thread B sees that, here's the problem."
"Why not Y instead?" — Show both approaches side by side. Compare concretely: cycles, memory, correctness risk, code complexity.
"Walk me through step by step" — Trace execution with actual values from the code. Show state at each step. Use a table or numbered list with before/after for each atomic operation.
"I don't get the memory ordering / atomics / unsafe" — Back up. Explain the prerequisite concept using a simple failure scenario ("without this ordering, here's what can go wrong"), then return to the specific code.
Always cite file:line so the reader can cross-reference in their editor.
development
Use when creating implementation-ready beads tasks that need testing strategy, optimal implementation approach, and documentation requirements baked in — composes /create-task with parallel enrichment agents that analyze the codebase and produce concrete test specifications, algorithm/data-structure guidance, and doc quality standards so implementing agents don't need to re-research
development
--- name: autoresearch description: Autonomous Goal-directed Iteration. Apply Karpathy's autoresearch principles to ANY task. Loops autonomously — modify, verify, keep/discard, repeat. Supports bounded iteration via Iterations: N inline config. version: 1.9.11 --- # Claude Autoresearch — Autonomous Goal-directed Iteration Inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch). Applies constraint-driven autonomous iteration to ANY work — not just ML research. **Core id
development
Use when implementing a new feature and assessing coverage gaps, during periodic test hygiene, when test suites feel bloated, or before merging code that changes coordination or hot paths. Two-phase assess-then-improve testing pipeline.
devops
Use when you want review AND automated fixes in one pass, when /review-dispatch alone would leave findings unaddressed, or before merging a feature branch that needs thorough diagnosis and remediation. Two-phase diagnose-then-fix pipeline.