.claude/skills/enhance-evals/SKILL.md
Improve the eval system quality -- add better cases, fix broken cases, improve the LLM judge, add new metrics, or refine pass/fail criteria based on eval run data. Use when the user wants to make the eval system itself better.
npx skillsauth add stopitdan/recommendagame enhance-evalsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are improving the boredgame.lol recommendation engine evaluation system. The user wants the eval system itself to be better -- more accurate tests, better failure detection, smarter judging, more useful output.
evals/runner.ts (626 lines) - Core runner: parallel execution, LLM judge, constraint checker, regression tracking, persistent loggingevals/types.ts (218 lines) - Type definitions for cases, results, runs, metricsevals/metrics.ts (109 lines) - NDCG@K, Precision@K, MRR, Hit Rate calculationsevals/llm-judge.ts (105 lines) - GPT-4o-mini rates overall result quality 0-10evals/constraint-checker.ts (110 lines) - Detects player count, time, complexity, game type violationsevals/compare-runs.ts (139 lines) - Side-by-side run comparison with regression detectionevals/summary.ts (109 lines) - Run summary viewer with history modeevals/analyze-failures.ts (254 lines) - Failure pattern categorization + most-missing-game trackingevals/generate-cases.ts - 130 hand-curated base casesevals/generate-expanded-cases.ts - +177 systematic variationsevals/generate-massive.ts - LLM-generated thousands (GPT-4o-mini, batched)evals/cases.json - Currently ~3,028 cases across 16 categoriesPass/fail criteria may be too strict: A case fails if ANY ideal game (relevance >= 2) is missing. But the results might be excellent games the eval case didn't anticipate. The LLM judge catches this (7.14/10 average vs 68.4% pass rate means many "failing" cases have good results).
idealGames are sometimes wrong: Some generated cases reference games that don't exist in the DB, or use slightly wrong names (e.g., "Castles of Burgundy" vs "The Castles of Burgundy").
LLM judge uses 0-10 scale: Research (arxiv 2411.15594) shows 0-2 scales with per-dimension scoring are more reliable. Current judge gives a single holistic score.
No serendipity metric: Research shows users want accuracy + novelty. We don't measure whether recommendations include unexpected-but-delightful games.
No familiarity balance metric: Optimal is 20-30% recognizable games, 70-80% discovery. We don't track this.
Catalog coverage is only 0.5%: Only ~400-500 unique games appear across all recommendations. Could indicate test cases are too similar or engine has narrow candidate fetching.
Generated cases may have quality issues: LLM-generated cases (GPT-4o-mini at temp 0.9) may include unrealistic queries, wrong game names, or contradictory constraints.
No confidence intervals: We report point estimates without standard errors. With 3,000+ cases we should use statistical significance testing.
From docs/research/recommendation-eval-methodology.md and evals/RECOMMENDATIONS.md:
| Category | Cases | Pass Rate | Notes | |----------|-------|-----------|-------| | mechanic-focused | 530 | ~32% | Weakest. BGG mechanic alias gap is the root cause. | | multi-constraint | 384 | ~36% | Combined constraints are hard to satisfy. | | theme-focused | 356 | ~86% | Strong. Theme matching works well. | | video-game | 262 | ~100% | Perfect. No cross-contamination. | | similar-to | 212 | ~83% | Good. "Like X" queries work. | | mood-vibe | 189 | ~29% | Weak. Missing Patchwork, Jaipur for chill. | | player-count | 177 | ~60% | Moderate. Some constraint violations. | | time-constraint | 164 | ~50% | Moderate. Time violations persist. | | free-text-intent | 159 | ~73% | Good. Natural language decent. | | edge-case | 153 | ~100% | Perfect. Handles garbage gracefully. | | negative-preference | 123 | ~73% | Good. Respects exclusions. | | designer-search | 116 | ~42% | Weak. Non-designer games mixed in. | | complexity | 112 | ~44% | Moderate. Misses gateway games. | | real-user-feedback | 78 | ~33% | Weak. BGG user issues persist. | | regression | 9 | ~89% | Good. Past bugs mostly fixed. | | party-game | 4 | ~50% | Too few cases. |
Based on what the user asks (or $ARGUMENTS), pick the right enhancement:
evals/cases.json and look for cases with wrong game names, unrealistic queries, or contradictory constraintsscripts/validate-eval-cases.ts)generate-cases.ts, generate-expanded-cases.ts, or generate-massive.ts) and regenerateevals/metrics.ts and evals/types.tscomputeCaseMetrics() and computeAggregateMetrics() in the runnerevals/llm-judge.tsgenerate-cases.ts for the weakest categoriesnpm run eval:generate-massive to fill gaps with LLM-generated casesevals/summary.ts, evals/compare-runs.ts, evals/analyze-failures.tssource .env.local && npx tsx evals/runner.ts --quick --no-judgeevals/EVAL-WORKLOG.mdtesting
Run the recommendation engine evaluation suite, analyze results, and produce a clear summary with comparisons to previous runs. Use when the user wants to test the recommendation engine quality.
testing
Generate more eval test cases to expand coverage. Targets 5,000+ total cases. Use when the user wants more test cases for the recommendation engine evaluation suite.
testing
Create, edit, improve, or audit AgentSkills. Use when creating a new skill from scratch or when asked to improve, review, audit, tidy up, or clean up an existing skill or SKILL.md file. Also use when editing or restructuring a skill directory (moving files to references/ or scripts/, removing stale content, validating against the AgentSkills spec). Triggers on phrases like "create a skill", "author a skill", "tidy up a skill", "improve this skill", "review the skill", "clean up the skill", "audit the skill".
testing
Host security hardening and risk-tolerance configuration for OpenClaw deployments. Use when a user asks for security audits, firewall/SSH/update hardening, risk posture, exposure review, OpenClaw cron scheduling for periodic checks, or version status checks on a machine running OpenClaw (laptop, workstation, Pi, VPS).