skills/golem-powers/architectural-conformance-audit/SKILL.md
Pre-R0 sprint gate that diffs implementation vs SOTA research output verbatim. Surfaces cited counter-examples and architectural mismatches before sprint hooks fire. Triggers: 'before R0', 'architectural audit', 'verify against research'. NOT for per-PR review or post-merge.
npx skillsauth add etanhey/golems architectural-conformance-auditInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Wave evidence (severity 10, 4 corroborating digests): AP1 root cause was architectural ground-truth blindness. Researcher's SOTA output at idx [380] LITERALLY cited Letta-on-FastAPI as a counter-example, yet R0→R5 sprint hooks optimized within the daemon assumption anyway. ~35h misdirected work + 2.5h explicit correction. PR #312 fixed it in code (FastAPI daemon deleted, socket-direct CLI added, merged May 22 11:39Z). See
pain-points/_consolidated.mdPattern 4 for the full chronology. This skill prevents AP1-class recurrence procedurally.
This skill fires at a specific sprint moment: the kickoff of a new R0→R5 sprint OR a large-plan phase that builds atop a research output (whichever fires first in a given build arc). It is NOT a per-PR check; it is a per-sprint gate.
research/*.md or similar).The audit produces three artifacts before any R0 work begins:
For each cited research output:
research/<topic>-research.md or equivalent).Output → docs.local/audits/<sprint>/<date>-sota-excerpt.md.
For the current implementation:
Output → docs.local/audits/<sprint>/<date>-impl-map.md.
For each (SOTA excerpt × impl primitive) pair:
The gate rule: if ANY DIVERGE — UNJUSTIFIED exists → SPRINT R0 IS BLOCKED until the divergence is either reconciled (impl changed) or justified (rationale documented + brain_store'd).
Output → docs.local/audits/<sprint>/<date>-conformance-verdict.md.
Canonical scan order (most-recent on tie via ls -lat):
ls -lat research/*.md 2>/dev/null
ls -lat docs.local/research/*.md 2>/dev/null
ls -lat ~/Gits/orchestrator/docs.local/research/*.md 2>/dev/null
ls -lat ~/Gits/orchestrator/docs.local/handoffs/**/research/*.md 2>/dev/null
If multiple SOTA outputs conflict, the audit MUST list both and require Etan to pick the canonical source before proceeding. Do NOT auto-canonicalize by date — staleness is the AP1 root cause.
Read the SOTA output in full. For each architectural claim, extract:
[N] reference).If extraction grows large or repetitive, follow workflows/extract-claims.md.
Per architectural primitive in scope:
find src packages -name "daemon.py" -o -name "service.py" -o -name "*.service.ts" | xargs wc -l
grep -rn "^from fastapi\|^import fastapi\|from socketio\|import asyncio" src packages 2>/dev/null
grep -rn "mcp__server__\|@server\.tool\|@server\.resource" src packages 2>/dev/null
For each primitive found: file path + line range; direct vs. transitive (inherited) authorship; first-principles vs. scaffolded. If mapping grows large, follow workflows/map-impl.md.
For each (SOTA claim, impl primitive):
git log -p --follow <impl-file> and brain_search "<primitive> chosen over <alternative>").DIVERGE — UNJUSTIFIED → R0 BLOCKED. Surface to Etan + sprint LEAD with verbatim SOTA cite + impl divergence + proposed reconciliation path (change impl OR document rationale).MATCH | DIVERGE — JUSTIFIED | N/A → R0 CLEARED. brain_store the audit verdict at importance ≥8 with tags [architectural-audit, <sprint>, R0-cleared]. Composes with /brain-store-fallback for transport failures.There is no --override flag. Per gen-8 decision: document or change impl — those are the only two paths. Footgun risk too high. If override is later deemed necessary, it must brain_store at importance 10 with tag [audit-override] + verbatim rationale.
The historical case. Researcher's output at idx [380] existed; multiple agents READ it (researcher, R5 evaluator, Codex workers, Cursor auditors). Nobody DIFFED it against daemon.py. The audit MUST produce a literal pairwise diff, not "I read the research."
Architectural assumptions decay. Each sprint must re-audit. If nothing changed, the audit is fast (re-cite the prior verdict). If something changed, the audit catches it.
SOTA outputs themselves can be wrong or stale. The audit's job is conformance — does impl match SOTA? — NOT validation that SOTA is correct. If SOTA is wrong, that's a separate research-correction sprint (and worth flagging).
Code review reads the diff and checks craft. This audit reads the architecture and checks first-principles alignment. They compose; this fires BEFORE R0, code review fires DURING R3.
If audits routinely come back MATCH for everything with no friction, suspect false-pass. The audit's value comes from catching real DIVERGE cases. If 3 consecutive sprints show all-MATCH, run a meta-audit on the auditor (was it reading the right SOTA? was it checking the right primitives?).
/never-fabricate — audit verdicts cite specific file paths + line ranges; never-fabricate enforces that those citations are real. This skill is the architectural-level fabrication guard; never-fabricate is the file-level guard./brain-store-fallback (SHIP-2, merged) — audit verdicts get stored at importance ≥8; brain-store-fallback handles transport failures during storage. Mandatory composition./coderabbit — composes downstream; coderabbit fires per-PR, this skill fires per-sprint./plan-validate — adjacent skill (general assumption checks); plan-validate is general, this is architecture-specific./large-plan/workflows/scaffold — the audit is a pre-R0 step in scaffold.md so it isn't skipped by oversight./orc — orc invokes this skill at sprint kickoff when Tier-1 triggers fire.evals/evals.json)| # | Scenario | Without skill (baseline) | With skill (target) | Assertion | |---|---|---|---|---| | 1 | SOTA recommends socket-direct; impl uses FastAPI HTTP (AP1 re-creation) | R5 graded local-optimum 8.85/10; mismatch slips through | DIVERGE — UNJUSTIFIED; R0 blocks until reconciled | Verdict file lists FastAPI primitive with counter-example cite | | 2 | SOTA recommends X; impl uses X | No-op | MATCH; R0 clears | Verdict shows MATCH; brain_store fires | | 3 | SOTA silent on primitive Z; impl uses Z | No-op | N/A for Z (doesn't block) | Verdict file shows N/A for Z | | 4 | Two SOTA outputs conflict | Stale SOTA used silently | Audit lists both; gate held pending Etan pick | Both files referenced; no auto-canonicalize | | 5 | Mixed MATCH/DIVERGE across multiple primitives | Slips through | Lists all; ANY UNJUSTIFIED blocks | R0 blocked even if 9/10 MATCH |
The AP1 re-creation eval (scenario 1) is load-bearing. Fixture: real-world excerpt from the May 2026 brainlayer-readpath research output + synthetic FastAPI daemon snippet mimicking the deleted PR-α daemon.py.
Smoke test (retrospective): run against the current brainlayer codebase POST-PR #312. Expected: MATCH on socket-direct primitive (the audit retrospectively confirms the fix held). Note: smoke is read-only against ~/Gits/brainlayer/ per cross-repo constraint.
docs.local/audits/<sprint>/<date>-{sota-excerpt,impl-map,conformance-verdict}.md.brain_store'd at importance ≥8 with tags [architectural-audit, <sprint>, R0-cleared|R0-blocked]. Use /brain-store-fallback if BL transport fails./coderabbit, /large-plan/scaffold.Per consolidated.md Pattern 4 system-fix:
"R5 evaluator skill change: must include 'goal-envelope check' — score against the original parent objective, not the sprint's local optimization."
This skill does NOT modify the R5 evaluator (it's a separate skill change, tracked as a future SHIP-9 candidate). This skill surfaces the parent-objective in the audit so the R5 evaluator has a referenceable target. The two compose; they ship independently.
/brain-store-fallback and report the fallback file path in the verdict.tools
The human-eval UX contract for Phoenix views: turn-by-turn scrollable replay (not a scorecard), hide-but-copyable IDs, collapsed thinking, identity chips, tool filters, tiny frozen starter datasets, mark-wrong-in-thread, mobile-first. Use when: building or reviewing ANY Phoenix/eval view, annotation UI, session replay, or human-grading surface. Triggers: phoenix view, eval UI, annotation view, session replay, human eval UX, grading interface. NOT for: Phoenix data pipelines/ingest (capture scripts have their own specs).
tools
macOS systems specialist — AppKit NSPanel architecture, launchd services, socket activation, MCP bridge resilience, syspolicyd, and high-frequency SwiftUI dashboards. Use when building menu-bar apps, LaunchAgents, debugging syspolicyd/Gatekeeper/TCC, resilient UDS/MCP bridges, or SwiftUI dashboards at 10Hz+.
development
Bulk LLM-judging protocol for fleet-dispatched verdict runs (KG cluster, eval harness). Use when: dispatching or running judge workers (J1/J2/RT), planning bulk-apply from verdict JSONL, or triaging evidence_degraded outputs. Triggers: judge fleet, bulk judge, R3 verdicts, kg-judge, RT gate, evidence_degraded. NOT for: single-item code review, Phoenix view UX (use phoenix-human-view), or non-judge eval pipelines.
development
Quiet-down protocol for sprint close: when the fleet wraps, delete ALL polling crons and monitors, send ONE final dashboard + ONE message, then go SILENT. Use when: fleet wraps, all workers done, overnight queue exhausted, sprint close, Etan asleep/away with nothing approved left. Triggers: fleet wrap, wrap the fleet, stand down, going quiet, sprint close. NOT for: mid-sprint monitoring (keep your loops), spawning a successor (use /session-handoff first).