skills/golem-powers/judge-fleet/SKILL.md
Bulk LLM-judging protocol for fleet-dispatched verdict runs (KG cluster, eval harness). Use when: dispatching or running judge workers (J1/J2/RT), planning bulk-apply from verdict JSONL, or triaging evidence_degraded outputs. Triggers: judge fleet, bulk judge, R3 verdicts, kg-judge, RT gate, evidence_degraded. NOT for: single-item code review, Phoenix view UX (use phoenix-human-view), or non-judge eval pipelines.
npx skillsauth add etanhey/golems judge-fleetInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Three R3 runs (morning + evening 2026-06-06) proved seven non-negotiables. A generic "judge these N items" dispatch loses artifacts, degrades silently, and bulk-applies refuted merges. This skill encodes what the rerun briefs already harden — so agents don't re-learn from /tmp wipes and DB locks.
NOT for: one-off PR review, Phoenix annotation UX, or skills that don't produce verdict artifacts.
Before fan-out, verify:
brain_search (MCP or CLI) on one stem; timeout/lock → HOLD, do not dispatch judges concurrently with enrichmenteval_results/, not ephemeral)Evidence: enrichment-locked DB nullified 308-verdict runs twice (79% morning, 100% evening evidence_degraded).
/tmpAll verdicts, sidecars, RT results, and DONE sentinels live under a durable repo path:
eval_results/<campaign>/prompts/
eval_results/<campaign>/verdicts/
eval_results/<campaign>/rt-mandatory/
eval_results/<campaign>/DONE/
Multi-hour judge runs never stage in /tmp — overnight wipes lost 9/21 deliverables.
Each prompt gets individual LLM reasoning with cited evidence.
Forbidden: batch Python with hardcoded TECHNOLOGY_STEMS / lookup-table classifiers, regex-rule refutation scripts masquerading as RT, or collapsing "brain_search retry-once per stem" into 2–6 representative searches per batch.
Scripts for validation (schema check, set-diff coverage) are fine; scripts that produce verdicts are not.
Cross-worker completion gates use sentinel files, not terminal grep or chat markers:
eval_results/<campaign>/DONE/J1.done
eval_results/<campaign>/DONE/J2.done
eval_results/<campaign>/DONE/RT_MANDATORY.done
R3_J1_DONE printed only in final chat was never observable to RT — file-count heuristics are a fallback, not the protocol.
Workers report learnings via append-only writes — one section per worker/batch:
### 2026-06-06T09:35Z J2 (prompts 155-308)
...
Forbidden: concurrent StrReplace on a shared anchor in one collab file (6 workers → repeated anchor-miss retries). Prefer per-worker section files or atomic append to distinct headings.
evidence_degraded honesty flagWhen brain_search fails (DB lock, timeout), every affected verdict MUST:
evidence_degraded: trueBulk-apply MUST treat evidence_degraded verdicts as a filter — do not silently merge degraded evidence as if live memory confirmed it.
Never bulk-apply straight from judge verdicts.
Run RT on the riskiest subset first (degraded + medium-confidence + merge recommendations). Phase-2 continuous sweep across all verdicts. Re-judge REFUTE entries before any merge. Historical refute rates: ~49% of RT-mandatory stems, ~41% Phase-2.
PRE-FLIGHT: brain_search probe OK? staging dir exists? prompt count verified?
STAGING: eval_results/<campaign>/ — NEVER /tmp
WORKERS: per-prompt LLM reasoning; validation scripts OK, verdict scripts NOT OK
DONE: write eval_results/<campaign>/DONE/<worker>.done — do NOT rely on chat markers
COLLAB: append-only per-worker sections — no shared-anchor StrReplace
HONESTY: evidence_degraded when brain_search fails — flag in collab
MERGE: RT gate complete; REFUTE re-judged; filter degraded before bulk-apply
| Skill | Relationship |
|---|---|
| /never-fabricate | Read verdict files before claiming counts; no synthesized completion times |
| /cron-payload-discipline | Monitor ticks waiting on judge fleet use live file counts + DONE sentinels, not hardcoded "154/154 done" |
| /cmux-agents | Dispatch briefs must inline absolute staging paths and precondition steps |
| /pr-loop | Skill changes ship through full PR loop with eval scorecard in body |
| /skill-creator | RED/GREEN evals required before merge |
| Don't | Evidence |
|---|---|
| Dispatch judges while enrichment holds DB lock | 100% evidence_degraded, ~500 duplicate judgments |
| Stage verdicts in /tmp | 9/21 deliverables lost to wipe |
| judge_j2.py lookup-table batch classify | Invalid enum + misclassifications; caught only post-hoc |
| RT polls sleep 120 waiting on J1 chat DONE | R3_J1_DONE never in terminal; fragile gating |
| StrReplace shared collab anchor with 6 workers | Anchor-miss retries; luck-dependent no-duplicates |
| Bulk-apply without RT | 34/70 refuted (49%); ~126/308 Phase-2 REFUTE |
tools
The human-eval UX contract for Phoenix views: turn-by-turn scrollable replay (not a scorecard), hide-but-copyable IDs, collapsed thinking, identity chips, tool filters, tiny frozen starter datasets, mark-wrong-in-thread, mobile-first. Use when: building or reviewing ANY Phoenix/eval view, annotation UI, session replay, or human-grading surface. Triggers: phoenix view, eval UI, annotation view, session replay, human eval UX, grading interface. NOT for: Phoenix data pipelines/ingest (capture scripts have their own specs).
tools
macOS systems specialist — AppKit NSPanel architecture, launchd services, socket activation, MCP bridge resilience, syspolicyd, and high-frequency SwiftUI dashboards. Use when building menu-bar apps, LaunchAgents, debugging syspolicyd/Gatekeeper/TCC, resilient UDS/MCP bridges, or SwiftUI dashboards at 10Hz+.
development
Quiet-down protocol for sprint close: when the fleet wraps, delete ALL polling crons and monitors, send ONE final dashboard + ONE message, then go SILENT. Use when: fleet wraps, all workers done, overnight queue exhausted, sprint close, Etan asleep/away with nothing approved left. Triggers: fleet wrap, wrap the fleet, stand down, going quiet, sprint close. NOT for: mid-sprint monitoring (keep your loops), spawning a successor (use /session-handoff first).
development
Brain Drive filing discipline — where every artifact goes + how to name it. Use WHENEVER touching Google Drive / Brain Drive: uploading, creating folders, saving research prompts/results, audits, plans, transcripts, dashboards, or when about to leave a durable artifact in docs.local/. Teaches the numbered folder model (01_STANDARDS / 02_GROUNDING / 03_RESEARCH / 04_INGEST / 06_ARCHIVE), date-prefixed naming, and the rule: FILE durable artifacts in the right Drive folder — docs.local/ is cache-only. NOT for querying Drive via Gemini (use /braindrive) or web research (use /gemini-research); for >100KB heavy archival defer to /google-drive-archive.