Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

etanhey/judge-fleet

Name: judge-fleet
Author: etanhey

skills/golem-powers/judge-fleet/SKILL.md

npx skillsauth add etanhey/golems judge-fleet

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

/judge-fleet — bulk LLM-judging protocol

Three R3 runs (morning + evening 2026-06-06) proved seven non-negotiables. A generic "judge these N items" dispatch loses artifacts, degrades silently, and bulk-applies refuted merges. This skill encodes what the rerun briefs already harden — so agents don't re-learn from /tmp wipes and DB locks.

When to use

Dispatching or executing bulk LLM judges (J1, J2, red-team gate, Phase-2 sweep)
Planning fan-out over hundreds of prompt files → verdict JSON/JSONL
Bulk-apply or triage after a judge fleet completes
Dispatcher writing worker briefs for brainlayer eval_results/ campaigns

NOT for: one-off PR review, Phoenix annotation UX, or skills that don't produce verdict artifacts.

The seven rules (in dispatch order)

1. Pre-dispatch precondition validation

Before fan-out, verify:

BrainLayer DB is not enrichment-locked — probe brain_search (MCP or CLI) on one stem; timeout/lock → HOLD, do not dispatch judges concurrently with enrichment
Staging directory exists on disk and is writable (repo path under eval_results/, not ephemeral)
Prompt inventory count matches brief (e.g. 154 + 154 + 70 RT stems)

Evidence: enrichment-locked DB nullified 308-verdict runs twice (79% morning, 100% evening evidence_degraded).

2. Durable staging — NEVER `/tmp`

All verdicts, sidecars, RT results, and DONE sentinels live under a durable repo path:

eval_results/<campaign>/prompts/
eval_results/<campaign>/verdicts/
eval_results/<campaign>/rt-mandatory/
eval_results/<campaign>/DONE/

Multi-hour judge runs never stage in /tmp — overnight wipes lost 9/21 deliverables.

3. Per-prompt reasoning — no compiled judgment scripts

Each prompt gets individual LLM reasoning with cited evidence.

Forbidden: batch Python with hardcoded TECHNOLOGY_STEMS / lookup-table classifiers, regex-rule refutation scripts masquerading as RT, or collapsing "brain_search retry-once per stem" into 2–6 representative searches per batch.

Scripts for validation (schema check, set-diff coverage) are fine; scripts that produce verdicts are not.

4. Completion via durable DONE sentinel files

Cross-worker completion gates use sentinel files, not terminal grep or chat markers:

eval_results/<campaign>/DONE/J1.done
eval_results/<campaign>/DONE/J2.done
eval_results/<campaign>/DONE/RT_MANDATORY.done

R3_J1_DONE printed only in final chat was never observable to RT — file-count heuristics are a fallback, not the protocol.

5. Append-only per-worker collab sections

Workers report learnings via append-only writes — one section per worker/batch:

### 2026-06-06T09:35Z J2 (prompts 155-308)
...

Forbidden: concurrent StrReplace on a shared anchor in one collab file (6 workers → repeated anchor-miss retries). Prefer per-worker section files or atomic append to distinct headings.

6. `evidence_degraded` honesty flag

When brain_search fails (DB lock, timeout), every affected verdict MUST:

Set evidence_degraded: true
Cite only packet + on-disk grep evidence (never fabricate memory hits)
Post degradation loudly in collab summary

Bulk-apply MUST treat evidence_degraded verdicts as a filter — do not silently merge degraded evidence as if live memory confirmed it.

7. Mandatory red-team gate before bulk-apply

Never bulk-apply straight from judge verdicts.

Run RT on the riskiest subset first (degraded + medium-confidence + merge recommendations). Phase-2 continuous sweep across all verdicts. Re-judge REFUTE entries before any merge. Historical refute rates: ~49% of RT-mandatory stems, ~41% Phase-2.

Dispatcher checklist (copy into briefs)

PRE-FLIGHT: brain_search probe OK? staging dir exists? prompt count verified?
STAGING: eval_results/<campaign>/ — NEVER /tmp
WORKERS: per-prompt LLM reasoning; validation scripts OK, verdict scripts NOT OK
DONE: write eval_results/<campaign>/DONE/<worker>.done — do NOT rely on chat markers
COLLAB: append-only per-worker sections — no shared-anchor StrReplace
HONESTY: evidence_degraded when brain_search fails — flag in collab
MERGE: RT gate complete; REFUTE re-judged; filter degraded before bulk-apply

Integration

| Skill | Relationship | |---|---| | /never-fabricate | Read verdict files before claiming counts; no synthesized completion times | | /cron-payload-discipline | Monitor ticks waiting on judge fleet use live file counts + DONE sentinels, not hardcoded "154/154 done" | | /cmux-agents | Dispatch briefs must inline absolute staging paths and precondition steps | | /pr-loop | Skill changes ship through full PR loop with eval scorecard in body | | /skill-creator | RED/GREEN evals required before merge |

Anti-patterns

| Don't | Evidence | |---|---| | Dispatch judges while enrichment holds DB lock | 100% evidence_degraded, ~500 duplicate judgments | | Stage verdicts in /tmp | 9/21 deliverables lost to wipe | | judge_j2.py lookup-table batch classify | Invalid enum + misclassifications; caught only post-hoc | | RT polls sleep 120 waiting on J1 chat DONE | R3_J1_DONE never in terminal; fragile gating | | StrReplace shared collab anchor with 6 workers | Anchor-miss retries; luck-dependent no-duplicates | | Bulk-apply without RT | 34/70 refuted (49%); ~126/308 Phase-2 REFUTE |

etanhey/judge-fleet

skills/golem-powers/judge-fleet/SKILL.md

Bulk LLM-judging protocol for fleet-dispatched verdict runs (KG cluster, eval harness). Use when: dispatching or running judge workers (J1/J2/RT), planning bulk-apply from verdict JSONL, or triaging evidence_degraded outputs. Triggers: judge fleet, bulk judge, R3 verdicts, kg-judge, RT gate, evidence_degraded. NOT for: single-item code review, Phoenix view UX (use phoenix-human-view), or non-judge eval pipelines.

3 stars

development

Updated Jun 7, 2026

$ install --global

skillsauth

npx skillsauth add etanhey/golems judge-fleet

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 7, 2026, 3:06 AM14.4s3 files scanned

SKILL.md

name:: judge-fleet
description:: Bulk LLM-judging protocol for fleet-dispatched verdict runs (KG cluster, eval harness). Use when: dispatching or running judge workers (J1/J2/RT), planning bulk-apply from verdict JSONL, or triaging evidence_degraded outputs. Triggers: judge fleet, bulk judge, R3 verdicts, kg-judge, RT gate, evidence_degraded. NOT for: single-item code review, Phoenix view UX (use phoenix-human-view), or non-judge eval pipelines.

/judge-fleet — bulk LLM-judging protocol

Three R3 runs (morning + evening 2026-06-06) proved seven non-negotiables. A generic "judge these N items" dispatch loses artifacts, degrades silently, and bulk-applies refuted merges. This skill encodes what the rerun briefs already harden — so agents don't re-learn from /tmp wipes and DB locks.

When to use

Dispatching or executing bulk LLM judges (J1, J2, red-team gate, Phase-2 sweep)
Planning fan-out over hundreds of prompt files → verdict JSON/JSONL
Bulk-apply or triage after a judge fleet completes
Dispatcher writing worker briefs for brainlayer eval_results/ campaigns

NOT for: one-off PR review, Phoenix annotation UX, or skills that don't produce verdict artifacts.

The seven rules (in dispatch order)

1. Pre-dispatch precondition validation

Before fan-out, verify:

BrainLayer DB is not enrichment-locked — probe brain_search (MCP or CLI) on one stem; timeout/lock → HOLD, do not dispatch judges concurrently with enrichment
Staging directory exists on disk and is writable (repo path under eval_results/, not ephemeral)
Prompt inventory count matches brief (e.g. 154 + 154 + 70 RT stems)

Evidence: enrichment-locked DB nullified 308-verdict runs twice (79% morning, 100% evening evidence_degraded).

2. Durable staging — NEVER `/tmp`

All verdicts, sidecars, RT results, and DONE sentinels live under a durable repo path:

eval_results/<campaign>/prompts/
eval_results/<campaign>/verdicts/
eval_results/<campaign>/rt-mandatory/
eval_results/<campaign>/DONE/

Multi-hour judge runs never stage in /tmp — overnight wipes lost 9/21 deliverables.

3. Per-prompt reasoning — no compiled judgment scripts

Each prompt gets individual LLM reasoning with cited evidence.

Scripts for validation (schema check, set-diff coverage) are fine; scripts that produce verdicts are not.

4. Completion via durable DONE sentinel files

Cross-worker completion gates use sentinel files, not terminal grep or chat markers:

eval_results/<campaign>/DONE/J1.done
eval_results/<campaign>/DONE/J2.done
eval_results/<campaign>/DONE/RT_MANDATORY.done

R3_J1_DONE printed only in final chat was never observable to RT — file-count heuristics are a fallback, not the protocol.

5. Append-only per-worker collab sections

Workers report learnings via append-only writes — one section per worker/batch:

### 2026-06-06T09:35Z J2 (prompts 155-308)
...

Forbidden: concurrent StrReplace on a shared anchor in one collab file (6 workers → repeated anchor-miss retries). Prefer per-worker section files or atomic append to distinct headings.

6. `evidence_degraded` honesty flag

When brain_search fails (DB lock, timeout), every affected verdict MUST:

Set evidence_degraded: true
Cite only packet + on-disk grep evidence (never fabricate memory hits)
Post degradation loudly in collab summary

Bulk-apply MUST treat evidence_degraded verdicts as a filter — do not silently merge degraded evidence as if live memory confirmed it.

7. Mandatory red-team gate before bulk-apply

Never bulk-apply straight from judge verdicts.

Dispatcher checklist (copy into briefs)

PRE-FLIGHT: brain_search probe OK? staging dir exists? prompt count verified?
STAGING: eval_results/<campaign>/ — NEVER /tmp
WORKERS: per-prompt LLM reasoning; validation scripts OK, verdict scripts NOT OK
DONE: write eval_results/<campaign>/DONE/<worker>.done — do NOT rely on chat markers
COLLAB: append-only per-worker sections — no shared-anchor StrReplace
HONESTY: evidence_degraded when brain_search fails — flag in collab
MERGE: RT gate complete; REFUTE re-judged; filter degraded before bulk-apply

Integration

Anti-patterns

Related Skills

etanhey/phoenix-human-view

tools

VerifiedTrustedCommunity

The human-eval UX contract for Phoenix views: turn-by-turn scrollable replay (not a scorecard), hide-but-copyable IDs, collapsed thinking, identity chips, tool filters, tiny frozen starter datasets, mark-wrong-in-thread, mobile-first. Use when: building or reviewing ANY Phoenix/eval view, annotation UI, session replay, or human-grading surface. Triggers: phoenix view, eval UI, annotation view, session replay, human eval UX, grading interface. NOT for: Phoenix data pipelines/ingest (capture scripts have their own specs).

3SKILL.mdUpdated Jun 7, 2026

etanhey/phoenix-human-view

etanhey/mac-systems

tools

VerifiedTrustedCommunity

macOS systems specialist — AppKit NSPanel architecture, launchd services, socket activation, MCP bridge resilience, syspolicyd, and high-frequency SwiftUI dashboards. Use when building menu-bar apps, LaunchAgents, debugging syspolicyd/Gatekeeper/TCC, resilient UDS/MCP bridges, or SwiftUI dashboards at 10Hz+.

3SKILL.mdUpdated Jun 7, 2026

etanhey/fleet-wrap

development

VerifiedTrustedCommunity

Quiet-down protocol for sprint close: when the fleet wraps, delete ALL polling crons and monitors, send ONE final dashboard + ONE message, then go SILENT. Use when: fleet wraps, all workers done, overnight queue exhausted, sprint close, Etan asleep/away with nothing approved left. Triggers: fleet wrap, wrap the fleet, stand down, going quiet, sprint close. NOT for: mid-sprint monitoring (keep your loops), spawning a successor (use /session-handoff first).

3SKILL.mdUpdated Jun 7, 2026

etanhey/drive-usage

development

VerifiedTrustedCommunity

Brain Drive filing discipline — where every artifact goes + how to name it. Use WHENEVER touching Google Drive / Brain Drive: uploading, creating folders, saving research prompts/results, audits, plans, transcripts, dashboards, or when about to leave a durable artifact in docs.local/. Teaches the numbered folder model (01_STANDARDS / 02_GROUNDING / 03_RESEARCH / 04_INGEST / 06_ARCHIVE), date-prefixed naming, and the rule: FILE durable artifacts in the right Drive folder — docs.local/ is cache-only. NOT for querying Drive via Gemini (use /braindrive) or web research (use /gemini-research); for >100KB heavy archival defer to /google-drive-archive.

3SKILL.mdUpdated Jun 7, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/etanhey/golems.git

# Copy into Claude Code skills folder (global)
cp -r golems/skills/golem-powers/judge-fleet ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

etanhey/golems

3 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

etanhey/judge-fleet

$ install --global

Security Scan Results

SKILL.md

/judge-fleet — bulk LLM-judging protocol

When to use

The seven rules (in dispatch order)

1. Pre-dispatch precondition validation

2. Durable staging — NEVER /tmp

3. Per-prompt reasoning — no compiled judgment scripts

4. Completion via durable DONE sentinel files

5. Append-only per-worker collab sections

6. evidence_degraded honesty flag

7. Mandatory red-team gate before bulk-apply

Dispatcher checklist (copy into briefs)

Integration

Anti-patterns

Related Skills

etanhey/phoenix-human-view

etanhey/mac-systems

etanhey/fleet-wrap

etanhey/drive-usage

etanhey/judge-fleet

$ install --global

Security Scan Results

SKILL.md

/judge-fleet — bulk LLM-judging protocol

When to use

The seven rules (in dispatch order)

1. Pre-dispatch precondition validation

2. Durable staging — NEVER /tmp

3. Per-prompt reasoning — no compiled judgment scripts

4. Completion via durable DONE sentinel files

5. Append-only per-worker collab sections

6. evidence_degraded honesty flag

7. Mandatory red-team gate before bulk-apply

Dispatcher checklist (copy into briefs)

Integration

Anti-patterns

Related Skills

etanhey/phoenix-human-view

etanhey/mac-systems

etanhey/fleet-wrap

etanhey/drive-usage

2. Durable staging — NEVER `/tmp`

6. `evidence_degraded` honesty flag

2. Durable staging — NEVER `/tmp`

6. `evidence_degraded` honesty flag