.claude/skills/run-evals/SKILL.md
Run the recommendation engine evaluation suite, analyze results, and produce a clear summary with comparisons to previous runs. Use when the user wants to test the recommendation engine quality.
npx skillsauth add stopitdan/recommendagame run-evalsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are running the boredgame.lol recommendation engine evaluation suite. This is a comprehensive test system with 3,000+ eval cases across 16 categories that tests whether the engine returns the right games for user queries.
evals/evals/cases.json (currently ~3,028 cases)evals/runs/ (JSON) and evals/logs/ (human-readable)Ensure dev server is running on localhost:1337. If not, start it:
npm run dev
Wait for it to be ready before proceeding.
Run the eval suite based on user args:
source .env.local && npx tsx evals/runner.ts --concurrency=5source .env.local && npx tsx evals/runner.ts --quick --concurrency=8source .env.local && npx tsx evals/runner.ts --category=$ARGUMENTS --concurrency=5source .env.local && npx tsx evals/runner.ts $ARGUMENTSIf the user passed arguments like --quick or --category=mechanic-focused, pass them through.
After the run completes, show the summary:
source .env.local && npx tsx evals/summary.ts
Compare with previous run (if one exists):
source .env.local && npx tsx evals/compare-runs.ts
Run failure analysis:
source .env.local && npx tsx evals/analyze-failures.ts
Present results to the user in a clear, organized format:
npm run dev firstevals/RECOMMENDATIONS.md if the user asks what to fixevals/EVAL-OVERVIEW.md for the full system documentationtesting
Generate more eval test cases to expand coverage. Targets 5,000+ total cases. Use when the user wants more test cases for the recommendation engine evaluation suite.
testing
Improve the eval system quality -- add better cases, fix broken cases, improve the LLM judge, add new metrics, or refine pass/fail criteria based on eval run data. Use when the user wants to make the eval system itself better.
testing
Create, edit, improve, or audit AgentSkills. Use when creating a new skill from scratch or when asked to improve, review, audit, tidy up, or clean up an existing skill or SKILL.md file. Also use when editing or restructuring a skill directory (moving files to references/ or scripts/, removing stale content, validating against the AgentSkills spec). Triggers on phrases like "create a skill", "author a skill", "tidy up a skill", "improve this skill", "review the skill", "clean up the skill", "audit the skill".
testing
Host security hardening and risk-tolerance configuration for OpenClaw deployments. Use when a user asks for security audits, firewall/SSH/update hardening, risk posture, exposure review, OpenClaw cron scheduling for periodic checks, or version status checks on a machine running OpenClaw (laptop, workstation, Pi, VPS).