plugins/workflow/skills/evaluator/SKILL.md
Grade implementation work against bead acceptance criteria using a separate judge agent. Use after subagent work passes mechanical gates, as a pre-merge check, or on-demand to evaluate existing features. The evaluator is NOT the orchestrator and NOT the implementer — it only judges. Integrates with browser-qa for runtime verification when CDT MCP is available.
npx skillsauth add rbergman/dark-matter-marketplace evaluatorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Separate the agent doing work from the agent judging it. This is more tractable than making one agent self-critical.
The orchestrator calls the evaluator in these situations:
evaluate <bead-id> to test an existing feature against its criteria/dm-work:post-merge runs evaluator against closed beadsSkip evaluator when:
The intent review and evaluator have complementary scope:
If no runtime testing is possible, the evaluator's value over intent review is minimal. Skip it.
Task(subagent_type="general-purpose", model="opus", description="Evaluate against acceptance criteria", prompt="
# Use model="haiku" for code-only evaluation with simple criteria (no browser-qa)
ROLE: Evaluator. You judge work against acceptance criteria. You do NOT implement or fix.
BEAD: <id>
ACCEPTANCE CRITERIA (from bead --design field):
<numbered list of criteria>
CODE DIFF:
<git diff output or summary of changes>
EVALUATION PROCESS:
1. Classify each criterion:
- RUNTIME: requires browser interaction to verify ("user can...", "page shows...", "form validates...")
- CODE: verifiable from code inspection ("function exists", "type is correct", "test passes")
2. If browser-qa available (CDT MCP connected, app running at <url>):
- Activate dm-work:browser-qa
- For each RUNTIME criterion: navigate, interact, assert
- For each CODE criterion: inspect the diff
3. If browser-qa NOT available:
- For each CODE criterion: inspect the diff
- For each RUNTIME criterion: mark UNTESTABLE with reason
- If ALL criteria are UNTESTABLE: return early with overall: SKIP
4. Grade each criterion: PASS / FAIL / UNTESTABLE
- PASS: criterion is satisfied (code or runtime evidence)
- FAIL: criterion is not satisfied (describe what's wrong)
- UNTESTABLE: cannot verify without runtime / missing prerequisite
SKILLS: dm-work:browser-qa (if CDT MCP available)
OUTPUT FORMAT (JSON to stdout):
{
\"bead_id\": \"<id>\",
\"criteria_results\": [
{
\"criterion\": 1,
\"text\": \"User can navigate to /settings\",
\"type\": \"RUNTIME\",
\"result\": \"PASS\",
\"detail\": \"Navigated to /settings, page loads with profile form visible\"
},
{
\"criterion\": 2,
\"text\": \"Email validates client-side\",
\"type\": \"RUNTIME\",
\"result\": \"FAIL\",
\"detail\": \"Entered invalid email 'notanemail', no validation error shown\"
}
],
\"overall\": \"FAIL\",
\"pass_count\": 1,
\"fail_count\": 1,
\"untestable_count\": 0,
\"summary\": \"1/2 criteria pass. Email validation missing on client side.\"
}
RULES:
- Judge ONLY against the listed acceptance criteria. Do not invent requirements.
- PASS means the criterion is satisfied, not that the code is perfect.
- Report what you observed, not what you assumed.
- If a criterion is ambiguous, grade it and note the ambiguity in detail.
- Do NOT modify code, commit, or close beads.
")
The orchestrator processes evaluator output:
overall: PASS → proceed to merge overall: SKIP → all criteria untestable, proceed (evaluator adds no value here) overall: FAIL →
50% failures: likely a spec problem — escalate to user, don't iterate
bd create --title="Eval: <failed criterion>" --type=bug --priority=2
bd dep add <new-bead> discovered-from:<parent-bead>
Circuit breaker: If evaluator fails twice on the same criterion after rework, escalate to user. Don't loop.
Not all projects use browser-qa. The evaluator should adapt:
| Project type | Verification method | Evaluator behavior | |-------------|--------------------|--------------------| | Standard web app | browser-qa (CDT MCP) | Full runtime evaluation | | WebGL / Canvas game | Manual screenshots + human verification | Mark runtime criteria UNTESTABLE; take screenshots if CDT available for visual reference, but can't assert on canvas content | | Native iOS/Android | Maestro or platform-specific tools | Mark runtime criteria UNTESTABLE unless project has automated UI test tooling wired | | CLI tool | Bash execution + output assertion | Code-only evaluation; test commands via bash, not browser | | API / backend | curl / httpie + response assertion | Code-only for endpoints; evaluate_script or direct API calls |
When runtime verification isn't possible, the evaluator should:
| Component | How evaluator connects | |-----------|----------------------| | Orchestrator | Calls evaluator as Step 1.5 in post-subagent verification | | Browser-qa | Evaluator activates browser-qa skill for standard web apps | | Beads | Reads acceptance criteria from bead; files new beads for failures | | Sprint contracts | Acceptance criteria in bead ARE the sprint contract | | Post-merge review | Post-merge command uses evaluator for closed beads | | Intent review | Complementary: intent checks code coverage, evaluator checks behavior |
development
Initialize a new repository with standard scaffolding - git, gitignore, AGENTS.md, justfile, mise, beads, and timbers. Use when starting a new project or setting up an existing repo for Claude Code workflows.
data-ai
Activate at session start when using Agent Teams for complex multi-agent work. Establishes team lead role with delegation protocols, teammate spawning, model selection, and beads integration. You coordinate the team; teammates implement.
data-ai
Use when creating a worktree, setting up a worktree, starting feature work that needs isolation, or before executing implementation plans. Covers git worktree creation under .worktrees/, gitignore setup, beads integration, and merge guardrails.
data-ai
Activate when you are a delegated subagent (not the orchestrator). Establishes subagent protocol with terse returns, details to history/, file ownership boundaries, and escalation rules. You implement; orchestrator reviews and commits.