Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

rbergman/evaluator

Name: evaluator
Author: rbergman

plugins/workflow/skills/evaluator/SKILL.md

npx skillsauth add rbergman/dark-matter-marketplace evaluator

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Evaluator Protocol

Separate the agent doing work from the agent judging it. This is more tractable than making one agent self-critical.

When to Invoke

The orchestrator calls the evaluator in these situations:

Post-subagent, pre-merge — after implementation passes mechanical gates AND either:
- Browser-qa is available (CDT MCP connected + app running) — runtime evaluation
- Acceptance criteria require runtime testing ("user can...", "page shows...", "form validates...")
On-demand — evaluate <bead-id> to test an existing feature against its criteria
Post-merge — /dm-work:post-merge runs evaluator against closed beads

Skip evaluator when:

XS/S tasks without acceptance criteria
Intent review returned COVERAGE: full, DRIFT: none, GAPS: none AND no browser-qa available
Bead has no acceptance criteria (report to orchestrator, don't run empty evaluation)
Task has no bead (ad-hoc work)

The intent review and evaluator have complementary scope:

Intent review checks CODE COVERAGE — does the diff contain the right changes?
Evaluator checks BEHAVIORAL CORRECTNESS — does the running app satisfy each criterion?

If no runtime testing is possible, the evaluator's value over intent review is minimal. Skip it.

Evaluator Agent Template

Task(subagent_type="general-purpose", model="opus", description="Evaluate against acceptance criteria", prompt="
# Use model="haiku" for code-only evaluation with simple criteria (no browser-qa)
ROLE: Evaluator. You judge work against acceptance criteria. You do NOT implement or fix.

BEAD: <id>
ACCEPTANCE CRITERIA (from bead --design field):
<numbered list of criteria>

CODE DIFF:
<git diff output or summary of changes>

EVALUATION PROCESS:

1. Classify each criterion:
   - RUNTIME: requires browser interaction to verify ("user can...", "page shows...", "form validates...")
   - CODE: verifiable from code inspection ("function exists", "type is correct", "test passes")

2. If browser-qa available (CDT MCP connected, app running at <url>):
   - Activate dm-work:browser-qa
   - For each RUNTIME criterion: navigate, interact, assert
   - For each CODE criterion: inspect the diff

3. If browser-qa NOT available:
   - For each CODE criterion: inspect the diff
   - For each RUNTIME criterion: mark UNTESTABLE with reason
   - If ALL criteria are UNTESTABLE: return early with overall: SKIP

4. Grade each criterion: PASS / FAIL / UNTESTABLE
   - PASS: criterion is satisfied (code or runtime evidence)
   - FAIL: criterion is not satisfied (describe what's wrong)
   - UNTESTABLE: cannot verify without runtime / missing prerequisite

SKILLS: dm-work:browser-qa (if CDT MCP available)

OUTPUT FORMAT (JSON to stdout):
{
  \"bead_id\": \"<id>\",
  \"criteria_results\": [
    {
      \"criterion\": 1,
      \"text\": \"User can navigate to /settings\",
      \"type\": \"RUNTIME\",
      \"result\": \"PASS\",
      \"detail\": \"Navigated to /settings, page loads with profile form visible\"
    },
    {
      \"criterion\": 2,
      \"text\": \"Email validates client-side\",
      \"type\": \"RUNTIME\",
      \"result\": \"FAIL\",
      \"detail\": \"Entered invalid email 'notanemail', no validation error shown\"
    }
  ],
  \"overall\": \"FAIL\",
  \"pass_count\": 1,
  \"fail_count\": 1,
  \"untestable_count\": 0,
  \"summary\": \"1/2 criteria pass. Email validation missing on client side.\"
}

RULES:
- Judge ONLY against the listed acceptance criteria. Do not invent requirements.
- PASS means the criterion is satisfied, not that the code is perfect.
- Report what you observed, not what you assumed.
- If a criterion is ambiguous, grade it and note the ambiguity in detail.
- Do NOT modify code, commit, or close beads.
")

Handling Evaluator Results

The orchestrator processes evaluator output:

overall: PASS → proceed to merge overall: SKIP → all criteria untestable, proceed (evaluator adds no value here) overall: FAIL →

Check fail count vs total:
- 1-2 failures: send FAIL details back to original subagent for targeted fix
- 50% failures: likely a spec problem — escalate to user, don't iterate
Check if failures are criteria bugs (criterion is impossible/ambiguous):
- If so, update the bead criteria, don't blame the implementation

Create beads for persistent failures:

bd create --title="Eval: <failed criterion>" --type=bug --priority=2
bd dep add <new-bead> discovered-from:<parent-bead>

Circuit breaker: If evaluator fails twice on the same criterion after rework, escalate to user. Don't loop.

Cost and Timing

Evaluator adds ~1-2 minutes per invocation (code-only) or ~2-4 minutes (with browser-qa)
Skip aggressively when not needed (see skip conditions above)
Use haiku model for code-only evaluation if criteria are simple
Use opus for browser-qa evaluation (needs to drive CDT tools effectively)

Platform-Specific Verification

Not all projects use browser-qa. The evaluator should adapt:

| Project type | Verification method | Evaluator behavior | |-------------|--------------------|--------------------| | Standard web app | browser-qa (CDT MCP) | Full runtime evaluation | | WebGL / Canvas game | Manual screenshots + human verification | Mark runtime criteria UNTESTABLE; take screenshots if CDT available for visual reference, but can't assert on canvas content | | Native iOS/Android | Maestro or platform-specific tools | Mark runtime criteria UNTESTABLE unless project has automated UI test tooling wired | | CLI tool | Bash execution + output assertion | Code-only evaluation; test commands via bash, not browser | | API / backend | curl / httpie + response assertion | Code-only for endpoints; evaluate_script or direct API calls |

When runtime verification isn't possible, the evaluator should:

Grade all code-verifiable criteria normally
Mark platform-specific criteria as UNTESTABLE with the reason and recommended verification method
Include a note: "Manual verification recommended for: [list criteria]"

Integration Points

| Component | How evaluator connects | |-----------|----------------------| | Orchestrator | Calls evaluator as Step 1.5 in post-subagent verification | | Browser-qa | Evaluator activates browser-qa skill for standard web apps | | Beads | Reads acceptance criteria from bead; files new beads for failures | | Sprint contracts | Acceptance criteria in bead ARE the sprint contract | | Post-merge review | Post-merge command uses evaluator for closed beads | | Intent review | Complementary: intent checks code coverage, evaluator checks behavior |

rbergman/evaluator

plugins/workflow/skills/evaluator/SKILL.md

Grade implementation work against bead acceptance criteria using a separate judge agent. Use after subagent work passes mechanical gates, as a pre-merge check, or on-demand to evaluate existing features. The evaluator is NOT the orchestrator and NOT the implementer — it only judges. Integrates with browser-qa for runtime verification when CDT MCP is available.

3 stars

tools

Updated Apr 22, 2026

$ install --global

skillsauth

npx skillsauth add rbergman/dark-matter-marketplace evaluator

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 22, 2026, 3:30 PM51.7s1 file scanned

SKILL.md

name:: evaluator
description:: Grade implementation work against bead acceptance criteria using a separate judge agent. Use after subagent work passes mechanical gates, as a pre-merge check, or on-demand to evaluate existing features. The evaluator is NOT the orchestrator and NOT the implementer — it only judges. Integrates with browser-qa for runtime verification when CDT MCP is available.

Evaluator Protocol

Separate the agent doing work from the agent judging it. This is more tractable than making one agent self-critical.

When to Invoke

The orchestrator calls the evaluator in these situations:

Post-subagent, pre-merge — after implementation passes mechanical gates AND either:
- Browser-qa is available (CDT MCP connected + app running) — runtime evaluation
- Acceptance criteria require runtime testing ("user can...", "page shows...", "form validates...")
On-demand — evaluate <bead-id> to test an existing feature against its criteria
Post-merge — /dm-work:post-merge runs evaluator against closed beads

Skip evaluator when:

XS/S tasks without acceptance criteria
Intent review returned COVERAGE: full, DRIFT: none, GAPS: none AND no browser-qa available
Bead has no acceptance criteria (report to orchestrator, don't run empty evaluation)
Task has no bead (ad-hoc work)

The intent review and evaluator have complementary scope:

Intent review checks CODE COVERAGE — does the diff contain the right changes?
Evaluator checks BEHAVIORAL CORRECTNESS — does the running app satisfy each criterion?

If no runtime testing is possible, the evaluator's value over intent review is minimal. Skip it.

Evaluator Agent Template

Task(subagent_type="general-purpose", model="opus", description="Evaluate against acceptance criteria", prompt="
# Use model="haiku" for code-only evaluation with simple criteria (no browser-qa)
ROLE: Evaluator. You judge work against acceptance criteria. You do NOT implement or fix.

BEAD: <id>
ACCEPTANCE CRITERIA (from bead --design field):
<numbered list of criteria>

CODE DIFF:
<git diff output or summary of changes>

EVALUATION PROCESS:

1. Classify each criterion:
   - RUNTIME: requires browser interaction to verify ("user can...", "page shows...", "form validates...")
   - CODE: verifiable from code inspection ("function exists", "type is correct", "test passes")

2. If browser-qa available (CDT MCP connected, app running at <url>):
   - Activate dm-work:browser-qa
   - For each RUNTIME criterion: navigate, interact, assert
   - For each CODE criterion: inspect the diff

3. If browser-qa NOT available:
   - For each CODE criterion: inspect the diff
   - For each RUNTIME criterion: mark UNTESTABLE with reason
   - If ALL criteria are UNTESTABLE: return early with overall: SKIP

4. Grade each criterion: PASS / FAIL / UNTESTABLE
   - PASS: criterion is satisfied (code or runtime evidence)
   - FAIL: criterion is not satisfied (describe what's wrong)
   - UNTESTABLE: cannot verify without runtime / missing prerequisite

SKILLS: dm-work:browser-qa (if CDT MCP available)

OUTPUT FORMAT (JSON to stdout):
{
  \"bead_id\": \"<id>\",
  \"criteria_results\": [
    {
      \"criterion\": 1,
      \"text\": \"User can navigate to /settings\",
      \"type\": \"RUNTIME\",
      \"result\": \"PASS\",
      \"detail\": \"Navigated to /settings, page loads with profile form visible\"
    },
    {
      \"criterion\": 2,
      \"text\": \"Email validates client-side\",
      \"type\": \"RUNTIME\",
      \"result\": \"FAIL\",
      \"detail\": \"Entered invalid email 'notanemail', no validation error shown\"
    }
  ],
  \"overall\": \"FAIL\",
  \"pass_count\": 1,
  \"fail_count\": 1,
  \"untestable_count\": 0,
  \"summary\": \"1/2 criteria pass. Email validation missing on client side.\"
}

RULES:
- Judge ONLY against the listed acceptance criteria. Do not invent requirements.
- PASS means the criterion is satisfied, not that the code is perfect.
- Report what you observed, not what you assumed.
- If a criterion is ambiguous, grade it and note the ambiguity in detail.
- Do NOT modify code, commit, or close beads.
")

Handling Evaluator Results

The orchestrator processes evaluator output:

overall: PASS → proceed to merge overall: SKIP → all criteria untestable, proceed (evaluator adds no value here) overall: FAIL →

Check fail count vs total:
- 1-2 failures: send FAIL details back to original subagent for targeted fix
- 50% failures: likely a spec problem — escalate to user, don't iterate
Check if failures are criteria bugs (criterion is impossible/ambiguous):
- If so, update the bead criteria, don't blame the implementation

Create beads for persistent failures:

bd create --title="Eval: <failed criterion>" --type=bug --priority=2
bd dep add <new-bead> discovered-from:<parent-bead>

Circuit breaker: If evaluator fails twice on the same criterion after rework, escalate to user. Don't loop.

Cost and Timing

Evaluator adds ~1-2 minutes per invocation (code-only) or ~2-4 minutes (with browser-qa)
Skip aggressively when not needed (see skip conditions above)
Use haiku model for code-only evaluation if criteria are simple
Use opus for browser-qa evaluation (needs to drive CDT tools effectively)

Platform-Specific Verification

Not all projects use browser-qa. The evaluator should adapt:

When runtime verification isn't possible, the evaluator should:

Grade all code-verifiable criteria normally
Mark platform-specific criteria as UNTESTABLE with the reason and recommended verification method
Include a note: "Manual verification recommended for: [list criteria]"

Integration Points

Related Skills

rbergman/repo-init

development

VerifiedTrustedCommunity

Initialize a new repository with standard scaffolding - git, gitignore, AGENTS.md, justfile, mise, beads, and timbers. Use when starting a new project or setting up an existing repo for Claude Code workflows.

5SKILL.mdUpdated Apr 22, 2026

rbergman/lead

data-ai

VerifiedTrustedCommunity

Activate at session start when using Agent Teams for complex multi-agent work. Establishes team lead role with delegation protocols, teammate spawning, model selection, and beads integration. You coordinate the team; teammates implement.

5SKILL.mdUpdated Apr 22, 2026

rbergman/worktrees

data-ai

VerifiedTrustedCommunity

Use when creating a worktree, setting up a worktree, starting feature work that needs isolation, or before executing implementation plans. Covers git worktree creation under .worktrees/, gitignore setup, beads integration, and merge guardrails.

4SKILL.mdUpdated Apr 22, 2026

rbergman/subagent

data-ai

VerifiedTrustedCommunity

Activate when you are a delegated subagent (not the orchestrator). Establishes subagent protocol with terse returns, details to history/, file ownership boundaries, and escalation rules. You implement; orchestrator reviews and commits.

4SKILL.mdUpdated Apr 22, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/rbergman/dark-matter-marketplace.git

# Copy into Claude Code skills folder (global)
cp -r dark-matter-marketplace/plugins/workflow/skills/evaluator ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

rbergman/dark-matter-marketplace

3 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT