.claude/skills/increase-evals/SKILL.md
Generate more eval test cases to expand coverage. Targets 5,000+ total cases. Use when the user wants more test cases for the recommendation engine evaluation suite.
npx skillsauth add stopitdan/recommendagame increase-evalsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are expanding the boredgame.lol recommendation engine eval suite. The goal is to have 5,000+ diverse, high-quality test cases that cover every conceivable query a real user might type.
evals/cases.json| Category | Count | Notes | |----------|-------|-------| | mechanic-focused | 530 | Good coverage | | multi-constraint | 384 | Good | | theme-focused | 356 | Good | | video-game | 262 | Good | | similar-to | 212 | Good | | mood-vibe | 189 | Could use more | | player-count | 177 | Good | | time-constraint | 164 | Good | | free-text-intent | 159 | Could use more | | edge-case | 153 | Good | | negative-preference | 123 | Could use more | | designer-search | 116 | Could use more | | complexity | 112 | Could use more | | real-user-feedback | 78 | NEEDS MORE -- these are the most valuable | | regression | 9 | NEEDS MORE -- known failure probes | | party-game | 4 | NEEDS MUCH MORE |
The generate-massive.ts script has 7 additional batch definitions (batches 24-30) that haven't been run yet. These target:
source .env.local && npx tsx evals/generate-massive.ts
This loads the existing 3,028 cases and adds new batches on top. It uses GPT-4o-mini with 120s timeout and batch size 30.
After generation, verify the count:
node -e "console.log(JSON.parse(require('fs').readFileSync('evals/cases.json')).length)"
The most impactful categories to expand:
Edit evals/generate-cases.ts to add cases with the addCase() function, then regenerate:
npx tsx evals/generate-cases.ts
If the user said /increase-evals 500 or /increase-evals party-game:
generate-massive.ts (adjust batch counts)node -e "console.log(JSON.parse(require('fs').readFileSync('evals/cases.json')).length)"source .env.local && npx tsx evals/runner.ts --quick --no-judgeevals/EVAL-WORKLOG.md about what was added and whytesting
Run the recommendation engine evaluation suite, analyze results, and produce a clear summary with comparisons to previous runs. Use when the user wants to test the recommendation engine quality.
testing
Improve the eval system quality -- add better cases, fix broken cases, improve the LLM judge, add new metrics, or refine pass/fail criteria based on eval run data. Use when the user wants to make the eval system itself better.
testing
Create, edit, improve, or audit AgentSkills. Use when creating a new skill from scratch or when asked to improve, review, audit, tidy up, or clean up an existing skill or SKILL.md file. Also use when editing or restructuring a skill directory (moving files to references/ or scripts/, removing stale content, validating against the AgentSkills spec). Triggers on phrases like "create a skill", "author a skill", "tidy up a skill", "improve this skill", "review the skill", "clean up the skill", "audit the skill".
testing
Host security hardening and risk-tolerance configuration for OpenClaw deployments. Use when a user asks for security audits, firewall/SSH/update hardening, risk posture, exposure review, OpenClaw cron scheduling for periodic checks, or version status checks on a machine running OpenClaw (laptop, workstation, Pi, VPS).