skills/golem-powers/agada-bench/SKILL.md
Standing BrainLayer quality benchmark. Scores live brain_search against the frozen gold corpus for recall@K, MRR, precision@5, placebo rate, and regression vs baseline. Run after BrainLayer PRs, FM fixes, schema changes, or embedder swaps. Rare build mode adds a new user-domain corpus. NOT for non-BL retrieval systems or rubric edits.
npx skillsauth add etanhey/golems agada-benchInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Score live BrainLayer recall against the frozen 4-domain gold corpus. Run whenever BrainLayer changes (PR merges, FM fixes ship, enrichment updates) to detect regressions or confirm improvements.
MISSION = a regression/improvement scorecard for the current BrainLayer build.
Not "ran some brain_search calls." Not "produced numbers."
Done = bench-report.md exists + per-domain breakdown + regression diff vs prior baseline + brain_store TASK_DONE.
Anything short of a complete scorecard with provenance-tagged hits is a partial run — flag it loudly, don't summarize over it.
/agada-bench runs the benchScore live BL recall against the standing gold. Use this 95% of the time. Triggers: post-PR-merge audit, weekly drift check, post-enrichment-schema-change verification, post-embedder-swap regression sweep.
See workflows/run-bench.md for the 6-wave procedure.
/agada-bench build extends the standing corpusAdd a new user-domain to the frozen gold. Used ~a few times a year when Etan's life adds a new topic worth benchmarking (Health/WHOOP, Family, a new business, etc.). Spawns 3 judges in cmux panes, runs the full 7-phase build pipeline.
See workflows/build-new-domain.md for the build pipeline.
| Is | Isn't |
|---|---|
| The standing scorecard for BrainLayer health. | A one-off query test. Just call brain_search yourself. |
| Anti-placebo (every hit is classified true_hit / echo_fm11 / downstream / uncertain / metadata_gap). | A vanilla recall@K calculator. Provenance matters more than the number. |
| Frozen gold: 4 domains × 67 queries × 198 pairs (53 L2+L3 relevant). | A flexible-rubric eval. Rubric is v1.1 locked. |
| Regression-diff-aware (compare vs prior bench summary JSON). | A trend monitor. Each run produces its summary; the operator chains them by passing --baseline. |
| Operates with Python helper + Claude/Codex session driving live brain_search via MCP. | A pure CLI tool. Python can't reach the MCP; live calls happen in the operator's session. |
# Run-anywhere invocation from a Claude/Codex session with BL MCP
/agada-bench
# Score only a subset of domains (e.g., skip architecture's LOW-POWER noise explicitly)
/agada-bench --domains techgym,freelance,recruiting
# With baseline for regression diff
/agada-bench --baseline ~/Gits/orchestrator/docs.local/audits/2026-05-15-bench-summary.json
# Custom output dir + K values
/agada-bench --output-dir ~/Gits/orchestrator/docs.local/audits/ --k 1,3,5,10,20,50
# Include LOW-POWER architecture in aggregates (default: excluded)
/agada-bench --include-low-power
What the skill does under the hood:
scripts/run-bench.py prepare loads the 4 frozen gold paths + each domain's phase-0b-corpus/corpus.jsonl → emits bench-queries.jsonl with 67 queries + verbatim brain_search args.workflows/run-bench.md and executes Waves 1–6: fires live brain_search per query, classifies each returned chunk for provenance (anti-placebo), appends to bench-results.jsonl.scripts/run-bench.py score reads results, computes recall@K / MRR / precision@5 / placebo rate / regression diff → emits <date>-brainlayer-quality-bench-results.md + <date>-bench-summary.json.# Add a new domain to the standing corpus
/agada-bench build --session ~/.claude/projects/.../<new-session>.jsonl --domain health-whoop
# With 4-way panel (Runs 2-4 historical setup)
/agada-bench build \
--session <jsonl> \
--domain family \
--judges claude,codex,cursor,gemini \
--schema v1.1-3p-1s \
--primary-judges claude,codex,cursor \
--shadow-judge gemini
See workflows/build-new-domain.md for the full 7-phase build pipeline (corpus extract → judge dispatch → liveness gate → crossref → κ → gold-lock → pending-RT).
| Domain | Gold rows | L2+L3 pairs | Power | Source |
|---|---:|---:|---|---|
| techgym | 58 | 33 | HIGH | ~/Gits/orchestrator/docs.local/plans/2026-05-15-agada-bench-4way-judge/phase-3-gold/gold.jsonl |
| freelance | 53 | 13 | HIGH | ~/Gits/orchestrator/docs.local/plans/2026-05-16-agada-run-2-freelance/phase-3-gold/gold.jsonl |
| recruiting | 47 | 6 | BORDERLINE | ~/Gits/orchestrator/docs.local/plans/2026-05-16-agada-run-3-recruiting/phase-3-gold/gold.jsonl |
| architecture | 40 | 1 | LOW-POWER | ~/Gits/orchestrator/docs.local/plans/2026-05-16-agada-run-4-architecture/phase-3-gold/gold.jsonl |
| Total | 198 | 53 | — | 67 unique queries |
LOW-POWER means recall@K is statistically meaningless on that domain (single positive sample). Excluded from cross-domain aggregates by default.
Verbatim brain_search args live in each domain's phase-0b-corpus/corpus.jsonl (sibling dir of gold). See references/prior-runs.md for the full provenance.
| Setting | Default | Why | Override condition |
|---|---|---|---|
| Domains | all 4 | Most representative spread. | --domains to subset; useful when isolating a single-domain regression. |
| LOW-POWER inclusion | excluded | Run 4 (architecture) has L2+L3=1 → recall@K noise dominates. | --include-low-power only when specifically auditing architecture. |
| K values | 1,3,5,10,20,50 | Standard IR eval set. | --k to override. |
| Anti-placebo | mandatory | All BL retrievals must be classified (true_hit / echo_fm11 / downstream / uncertain / metadata_gap). | Cannot disable. Skipping placebo classification = corrupt scorecard. |
| Rubric | v1.1 frozen | Tonight's gold uses v1.1. Don't iterate. | v2 trigger (see references/roadmap-v2.md). |
| brain_store on completion | yes | TASK_DONE chunk with headline metrics. | --no-store for dry-run smokes. |
When extending the standing corpus with a new domain, the same defaults from v1 build apply:
| Setting | Default | Why |
|---|---|---|
| --judges | claude,codex,gemini (3) | W3.2 κ matrix — codex is most independent voter; cursor most redundant with claude (κ=0.811 on Runs 2-4 avg / 0.915 on Run 1 alone). |
| --schema | v1.1-3p (3 primary, no shadow) | Matches the default 3-judge panel; cleaner than the 3-primary-+-1-shadow split. |
| --rubric-version | v1.1 | FM12 = 0/145 in production. |
| --liveness-check | strict (≥95% rows, fail-loud, exit 2) | Closes W3.1's silent-gemini-absent hole from Run 4. |
| --pending-rt-cascade | opus-4-7 | W3.3: 12 of 13 v1.1 pending-RT cases are FM6-PreCompact single-judge outliers; cheap Opus resolves. |
| Tiebreaker | claudeJudge | Cleanest calibration (mean 86.6); 0 hallucinations on Runs 2/3/4. |
See references/judge-panel.md for the full κ rationale.
brain_search ad-hoc.Auto-extraction from session JSONL only has access to what BL renders in the brain_search tool_result text:
rt-0c2e3cb8-), not the full chunk_id. Tonight's hand-curated runs used a live BL DB lookup to resolve prefixes; v1 doesn't.All gated by the same fix: extract-corpus.py needs a live brain_recall(chunk_id_prefix) lookup to resolve prefixes and pull full bodies. That's a v2 task; see references/roadmap-v2.md Phase A.5.
workflows/run-bench.md. Errors here corrupt every downstream metric.Per W1.7's skeptical pass, v1.1 is SHIP. v2 triggers only when the flip conditions in references/roadmap-v2.md fire (FM12 > 0 on any run, second judge dies, etc.). v2 buildout sketches:
brain_recall (extract-corpus.py limitation)This skill is referenced by:
/orc — when sweeping multi-domain regressions.This skill references:
/never-fabricate — anti-placebo classification is mandatory; no "true_hit" without verified chunk_created_iso vs query_asked_at./cmux-agents — only build mode uses cmux pane spawn (for judge dispatch); bench mode runs entirely in the operator's session./superpowers:verification-before-completion — bench-report.md verdict must say GREEN | YELLOW | RED based on actual numbers, not approximations.# Run inside a Claude/Codex session with BL MCP connected
python3 ~/Gits/golems/skills/golem-powers/agada-bench/scripts/run-bench.py prepare \
--gold techgym:~/Gits/orchestrator/docs.local/plans/2026-05-15-agada-bench-4way-judge/phase-3-gold/gold.jsonl \
--gold freelance:~/Gits/orchestrator/docs.local/plans/2026-05-16-agada-run-2-freelance/phase-3-gold/gold.jsonl \
--gold recruiting:~/Gits/orchestrator/docs.local/plans/2026-05-16-agada-run-3-recruiting/phase-3-gold/gold.jsonl \
--gold architecture:~/Gits/orchestrator/docs.local/plans/2026-05-16-agada-run-4-architecture/phase-3-gold/gold.jsonl \
--output ~/Gits/orchestrator/docs.local/audits/$(date +%Y-%m-%d)-bench-queries.jsonl
# Operator: read workflows/run-bench.md, do the 6 waves of brain_search, write bench-results.jsonl
python3 ~/Gits/golems/skills/golem-powers/agada-bench/scripts/run-bench.py score \
--queries ~/Gits/orchestrator/docs.local/audits/$(date +%Y-%m-%d)-bench-queries.jsonl \
--results ~/Gits/orchestrator/docs.local/audits/$(date +%Y-%m-%d)-bench-results.jsonl \
--output ~/Gits/orchestrator/docs.local/audits/$(date +%Y-%m-%d)-brainlayer-quality-bench-results.md \
--json-out ~/Gits/orchestrator/docs.local/audits/$(date +%Y-%m-%d)-bench-summary.json \
--baseline <previous-summary.json>
Or via the top-level driver:
bash ~/Gits/golems/skills/golem-powers/agada-bench/scripts/run-agada.sh
# Add a new domain (run once per new domain — a few times a year)
bash ~/Gits/golems/skills/golem-powers/agada-bench/scripts/run-agada.sh build \
--session <new-domain-session.jsonl> \
--domain <new-domain-name> \
--output ~/Gits/orchestrator/docs.local/plans/2026-MM-DD-agada-<new-domain>/
See references/prior-runs.md for the 4 standing-corpus domains and the dates they were locked.
End SKILL.md. Primary workflow: workflows/run-bench.md. Build workflow: workflows/build-new-domain.md. Per-judge briefs: adapters/. Standing-gold provenance: references/prior-runs.md.
development
Create, edit, and verify golem-powers skills using the standard SKILL.md structure, workflow files, adapters, templates, and eval fixtures. Use for new skills, structural edits, workflows/adapters, and pre-deploy validation. NOT for invoking existing skills, superpowers skills, or skill-creator agent workflows.
testing
Extract structured knowledge from any video source — YouTube URLs or local screen recordings. YouTube → gems workflow (yt-dlp transcript → keyword hotspots → frame extract → brain_digest → structured gems). Screen recordings → QA workflow (reuses /qa-video stalker pipeline). Use when user shares a YouTube link wanting deep extraction with frames, shares a .mov/.mp4 for QA processing, says "extract from video", "video gems", "process this recording", or mentions gem extraction from video content.
testing
Use when running or reviewing any recurring monitor loop for merge queues, worker queues, collab tails, or agent completion. Enforces drive-to-completion ticks: every tick must query live state with `!`, classify whether real progress happened, and then dispatch, verify-and-decrement, or escalate-park. Triggers on: monitor loop, /loop, recurring tick, keep monitoring, silent autonomous, merge gate, blocked review, no-progress loop.
tools
MeHayom freelance client management — daily updates, decision tracking, time logging. Use when drafting Yuval updates, logging scope changes, tracking hours, or any MeHayom client communication. Triggers: 'draft Yuval update', 'client update', 'daily update', 'log decision', 'track time', 'mehayom'.