skills/assistant/eval-agent-md/SKILL.md
Behavioral compliance testing for any CLAUDE.md or agent definition file. Auto-generates test scenarios from your rules, runs them via LLM-as-judge scoring, and reports a compliance score with per-rule pass/fail breakdown. Optionally improves failing rules via automated mutation loop. Use when: (1) testing whether your CLAUDE.md rules are actually followed, (2) evaluating an agent definition for role-boundary compliance, (3) dogfooding a skill's own SKILL.md. Triggers on: "eval", "compliance test", "test my CLAUDE.md", "check rules", "behavioral test", "/eval-agent-md". Do not trigger for: editing or writing CLAUDE.md rules, general code review, adding linting config, or any task that is not explicitly about testing behavioral compliance.
npx skillsauth add ravnhq/ai-toolkit eval-agent-mdInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
--holistic)claude -p with LLM-as-judge scoringAlways run scripts with uv run --script — never python, never python3, never a bare script name. The scripts declare their own dependencies via inline # /// script metadata; uv run --script resolves all dependencies automatically — no pip install required, ever. Invoking with python or python3 will fail with import errors because the dependencies are not installed in the system environment.
This skill runs long operations (30s-5min per step). Always keep the user informed:
Find the target file to test. Priority order:
0. If user passed --self, target is [SKILL_DIR]/SKILL.md — skip to confirmation below
/eval-agent-md ./CLAUDE.md), use that~/.claude/CLAUDE.md (user global)Read the file and confirm with the user: "I found [filename] at [path] ([N] lines). Testing this file." Wait for user acknowledgment before proceeding to Step 2.
Tell the user: "Generating test scenarios from [filename]... this calls claude -p --model sonnet and takes 30-60 seconds on average."
Before running, mention whether this is a warm or cold generation run:
Run the scenario generator script bundled with this skill. IMPORTANT: Do NOT capture output — run via the Bash tool so the user sees progress lines in real time:
uv run --script [SKILL_DIR]/scripts/generate-scenarios.py [TARGET_FILE]
# For SKILL.md files, add --skill for workflow-aware scenarios:
# uv run --script [SKILL_DIR]/scripts/generate-scenarios.py --skill [TARGET_FILE]
# For self-testing (implies --skill):
# uv run --script [SKILL_DIR]/scripts/generate-scenarios.py --self
# To also generate integration scenarios (multi-rule interaction tests):
# uv run --script [SKILL_DIR]/scripts/generate-scenarios.py --holistic [TARGET_FILE]
The script auto-detects the repository name from git and saves to /tmp/eval-agent-md-<repo>-scenarios.yaml (e.g., /tmp/eval-agent-md-my-project-scenarios.yaml). Override with --repo-name NAME or -o PATH.
It also reuses an exact-input scenario cache by default; pass --no-scenario-cache to force fresh generation. --no-cache remains as a compatibility alias.
After generation, read the output file and show the user a summary:
Ask the user: "Generated [N] test scenarios. Ready to run? (Or edit/skip any?)"
Validation gate: If the output file is missing or contains 0 scenarios, do not proceed. Tell the user: "Scenario generation produced no scenarios. Check that the target file has clearly structured rules (headings, numbered items, or labeled sections)." Then stop.
Tell the user: "Running [N] scenarios x [runs] run(s) against [model]... each scenario calls claude -p twice (subject + judge), so this takes a few minutes. You'll see per-scenario results as they complete."
Also summarize the work budget before starting:
Tip: --effort low --runs 3 costs roughly the same as --effort high --runs 1 and gives majority-vote reliability — a practical default for regular compliance checks.
IMPORTANT: Do NOT capture output — run via the Bash tool so the user sees per-scenario progress ([1/N] scenario_id... PASS/FAIL (Xs)) in real time:
uv run --script [SKILL_DIR]/scripts/eval-behavioral.py \
--scenarios-file /tmp/eval-agent-md-<repo>-scenarios.yaml \
--claude-md [TARGET_FILE] \
--runs 1 \
--model sonnet
Options the user can control:
--runs N — runs per scenario for majority vote (default: 1, recommend 3 for reliability)--model MODEL — model for test subject (default: sonnet)--compare-models — run across haiku/sonnet/opus and show comparison matrix--workers N — opt into higher concurrency than the safe default--no-judge-cache — force fresh judge verdicts instead of reusing exact-input cache entries--no-subject-cache — force fresh subject responses instead of exact-input cache reuseResults now include multi-dimensional metrics: per-scenario response size (char count, word count) alongside timing and cache stats. This enables better A/B comparison during mutation testing.
Validation gate: If all scenarios return an error or null verdict (e.g., script crash, missing model), do not print a compliance report. Tell the user: "All scenarios failed to produce a verdict — the run may have crashed. Check the output above for errors before interpreting results." Then stop.
Print a compliance report:
## Compliance Report — [filename]
### Per-rule: 8/10 (80%)
| Scenario | Rule | Verdict | Evidence |
|----------|------|---------|----------|
| gate1_think | GATE-1 | PASS | Lists assumptions before code |
| ... | ... | ... | ... |
### Integration: 3/5 (60%) ← only shown with --holistic
| Scenario | Rules Tested | Verdict | Evidence |
|----------|--------------|---------|----------|
| integration_gate1_tdd | GATE-1, TDD | PASS | Assumptions before test before impl |
| ... | ... | ... | ... |
### Combined: 11/15 (73%) [per-rule: 8/10, integration: 3/5]
### Failing Rules
- [rule]: [what went wrong] — suggested fix: [brief suggestion]
If the user says "improve", "fix", or passed --improve:
Tell the user: "Starting mutation loop (dry-run) — this iteratively generates wording fixes for failing rules and A/B tests them. Each iteration takes 1-2 minutes."
For performance, explain that scoped mutation checks now reuse the baseline already computed for the current content and only re-evaluate the mutated candidate for the targeted scenario before any full-suite validation.
IMPORTANT: Do NOT capture output — run via the Bash tool so the user sees iteration progress in real time:
uv run --script [SKILL_DIR]/scripts/mutate-loop.py \
--target [TARGET_FILE] \
--scenarios-file /tmp/eval-agent-md-<repo>-scenarios.yaml \
--max-iterations 3 \
--runs 3 \
--model sonnet
This is always dry-run by default. Show the user each suggested mutation and ask before applying.
The mutation loop includes three safety guardrails (disable with --no-boundary-check):
--- markers)When a mutation produces delta=0 (equal correctness), the --neutral-strategy flag controls the decision:
revert (default) — discard neutral mutationskeep — keep neutral mutationssize — keep only if the mutated response is shorter (efficiency win)Parse the user's /eval-agent-md invocation for these common options:
[path] — target file (positional, e.g., /eval-agent-md ./CLAUDE.md)--improve — run mutation loop after testing--runs N — runs per scenario (default: 1, recommend 3 for reliability)--model MODEL — model for test subject (default: sonnet)--self — test this skill's own SKILL.md (implies --skill)--skill / --agent — hint the target type for better scenario generation--holistic — also generate integration scenarios that test multiple rules interacting (priority ordering, conflict resolution, cumulative compliance)--coverage — report rule coverage after scenario generation (shows tested vs untested rules)--effort LEVEL — effort for subject calls: low / medium / high (default: high). Lower effort reduces cost and latency.--gen-effort LEVEL — effort for scenario generation: low / medium / high (default: medium). Use high for complex or densely-ruled files.--save-reference PATH — save scenarios to a stable reference directory for deterministic test suitesSee references/script-reference.md for the full flag reference (caching, workers, compare-models, timeouts).
User: "Run compliance tests against my CLAUDE.md to check if all rules are being followed."
Expected behavior: Begin Step 1 immediately without asking for confirmation — locate the CLAUDE.md, confirm it with the user (filename, path, line count), then proceed through the full workflow: generate scenarios → run behavioral tests → report compliance score with per-rule pass/fail breakdown. Do not pause to ask permission or clarify intent before starting.
User: "Add a new linting rule to our ESLint config."
Expected behavior: Do not use this skill. Choose a more relevant skill or proceed directly.
User: "Help me write a new CLAUDE.md rule that enforces conventional commits."
Expected behavior: Do not use this skill. The user is authoring rules, not testing whether existing rules are followed. Proceed directly without invoking the eval workflow.
User: "Test my CLAUDE.md and check if the rules hold even when Claude is being fast and lazy."
Expected behavior: Immediately run with --effort low --runs 3 — do not ask which file to use first, use the default file resolution (Step 1 priority order). Explain that low effort is a stricter bar for critical rules — if a rule fails at low effort, it means compliance relies on Claude being in careful mode, which is a fragility worth fixing.
generate-scenarios.py exits with non-zero status or produces empty output.--runs 1) is susceptible to LLM variance. The model may not follow rules consistently in a single sample.--runs 3 for majority-vote scoring to reduce noise.No such file or directory when running skill scripts.chmod +x on the scripts in the scripts/ directory.references/script-reference.md — all flags, caching strategy, performance notesreferences/scenario-format.md — YAML schema and field rules for manually reviewing or editing generated scenarios before runningassets/report-template.md — structured compliance report format with a Next Steps checklisttesting
Transform user requests into detailed, precise prompts for AI models. Use when users say 'promptify', 'promptify this', 'rewrite this prompt', 'make this prompt better/more specific', or explicitly request prompt engineering or improvement of their request for better AI responses.
tools
Manage AI skills from the Ravn AI Toolkit via corvus CLI — install, update, remove, search, and configure skills for any project. Use when: (1) Installing AI skills into a project, (2) Updating installed skills to latest versions, (3) Browsing or searching available skills, (4) Configuring global or per-project skill sets, (5) Troubleshooting corvus setup. Triggers on: "install skills", "add skills", "update skills", "corvus", "skill manager", "browse skills", "set up AI rules".
development
Generate a gallery of design variations for a UI component. Takes an existing component (referenced by name, pasted code, or screenshot) and produces N distinct rendered alternatives in a single comparison page. Use when exploring visual directions, generating mockups, comparing design approaches for a component, creating A/B candidates, or when anyone says "show me options" or "give me variations" for a UI element.
tools
Create custom QA agent personalities for project-specific testing needs. Guided builder that asks about the specialty, tools, and test scenarios, then generates a personality file and registers it in the QA config. Trigger on "create a QA personality", "add a custom test agent", "build a webhook tester", or when the user needs a project-specific QA agent. Also triggered by /qa-create-personality.