skills/evals/SKILL.md
Run and create evals for testing agent behavior. Use when the user wants to create or run an eval.
npx skillsauth add adriancooney/evals evalsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Run and create evals for testing agent behavior.
Evals are markdown files matching *.eval.md. Use glob to find them:
**/*.eval.md
A common pattern is to collect evals in an evals/ directory.
An eval file contains a prompt and an expectation:
# Eval Title
<prompt>
Instructions for the agent to execute.
</prompt>
<expectation>
Success criteria - describe what must be true for the eval to pass.
</expectation>
<prompt> content<expectation> using LLM judgmentSUCCESS or FAIL with reasoningWhen running multiple evals, spawn all subagents in parallel. Report aggregate results at the end.
Always end output with exactly one of these lines for CI parsing:
eval result: pass — all evals passedeval result: fail — one or more evals failedIMPORTANT: The subagent must only test and observe. It must NOT attempt to fix, modify, or change anything to make the expectation pass. The subagent executes the prompt, observes the outcome, and reports whether the expectation was met. If the expectation fails, report FAIL — do not try to make it pass.
Run a single eval:
/eval run <path-to-eval.eval.md>
Run all evals:
/eval run-all
Gather from the user:
/eval create <name>
Write the eval to <name>.eval.md in the current directory.
When creating an eval, try to make it self-contained and reproducible. This isn't critical, but helps:
If you see an opportunity to improve isolation but need clarification, ask the user.
testing
Create, edit, improve, or audit AgentSkills. Use when creating a new skill from scratch or when asked to improve, review, audit, tidy up, or clean up an existing skill or SKILL.md file. Also use when editing or restructuring a skill directory (moving files to references/ or scripts/, removing stale content, validating against the AgentSkills spec). Triggers on phrases like "create a skill", "author a skill", "tidy up a skill", "improve this skill", "review the skill", "clean up the skill", "audit the skill".
testing
Host security hardening and risk-tolerance configuration for OpenClaw deployments. Use when a user asks for security audits, firewall/SSH/update hardening, risk posture, exposure review, OpenClaw cron scheduling for periodic checks, or version status checks on a machine running OpenClaw (laptop, workstation, Pi, VPS).
testing
Create, edit, improve, or audit AgentSkills. Use when creating a new skill from scratch or when asked to improve, review, audit, tidy up, or clean up an existing skill or SKILL.md file. Also use when editing or restructuring a skill directory (moving files to references/ or scripts/, removing stale content, validating against the AgentSkills spec). Triggers on phrases like "create a skill", "author a skill", "tidy up a skill", "improve this skill", "review the skill", "clean up the skill", "audit the skill".
testing
Host security hardening and risk-tolerance configuration for OpenClaw deployments. Use when a user asks for security audits, firewall/SSH/update hardening, risk posture, exposure review, OpenClaw cron scheduling for periodic checks, or version status checks on a machine running OpenClaw (laptop, workstation, Pi, VPS).