Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

stopitdan/enhance-evals

Name: enhance-evals
Author: stopitdan

.claude/skills/enhance-evals/SKILL.md

npx skillsauth add stopitdan/recommendagame enhance-evals

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Enhance the Eval System

You are improving the boredgame.lol recommendation engine evaluation system. The user wants the eval system itself to be better -- more accurate tests, better failure detection, smarter judging, more useful output.

Full Context You MUST Know

System Architecture

evals/runner.ts (626 lines) - Core runner: parallel execution, LLM judge, constraint checker, regression tracking, persistent logging
evals/types.ts (218 lines) - Type definitions for cases, results, runs, metrics
evals/metrics.ts (109 lines) - NDCG@K, Precision@K, MRR, Hit Rate calculations
evals/llm-judge.ts (105 lines) - GPT-4o-mini rates overall result quality 0-10
evals/constraint-checker.ts (110 lines) - Detects player count, time, complexity, game type violations
evals/compare-runs.ts (139 lines) - Side-by-side run comparison with regression detection
evals/summary.ts (109 lines) - Run summary viewer with history mode
evals/analyze-failures.ts (254 lines) - Failure pattern categorization + most-missing-game tracking
evals/generate-cases.ts - 130 hand-curated base cases
evals/generate-expanded-cases.ts - +177 systematic variations
evals/generate-massive.ts - LLM-generated thousands (GPT-4o-mini, batched)
evals/cases.json - Currently ~3,028 cases across 16 categories

Known Weaknesses in the Eval System

Pass/fail criteria may be too strict: A case fails if ANY ideal game (relevance >= 2) is missing. But the results might be excellent games the eval case didn't anticipate. The LLM judge catches this (7.14/10 average vs 68.4% pass rate means many "failing" cases have good results).
idealGames are sometimes wrong: Some generated cases reference games that don't exist in the DB, or use slightly wrong names (e.g., "Castles of Burgundy" vs "The Castles of Burgundy").
LLM judge uses 0-10 scale: Research (arxiv 2411.15594) shows 0-2 scales with per-dimension scoring are more reliable. Current judge gives a single holistic score.
No serendipity metric: Research shows users want accuracy + novelty. We don't measure whether recommendations include unexpected-but-delightful games.
No familiarity balance metric: Optimal is 20-30% recognizable games, 70-80% discovery. We don't track this.
Catalog coverage is only 0.5%: Only ~400-500 unique games appear across all recommendations. Could indicate test cases are too similar or engine has narrow candidate fetching.
Generated cases may have quality issues: LLM-generated cases (GPT-4o-mini at temp 0.9) may include unrealistic queries, wrong game names, or contradictory constraints.
No confidence intervals: We report point estimates without standard errors. With 3,000+ cases we should use statistical significance testing.

Research-Backed Improvements to Consider

From docs/research/recommendation-eval-methodology.md and evals/RECOMMENDATIONS.md:

Pairwise LLM comparison instead of absolute scoring (more reliable)
Per-dimension scoring: Rate mechanic match, theme match, constraint satisfaction separately
Chain-of-thought reasoning in LLM judge (explain BEFORE scoring)
Trust buster detection: Flag obviously-wrong individual results (Chess for party game query)
Constraint violation breakdown by type (time vs player count vs complexity)
Serendipity@K: relevant AND dissimilar to obvious matches
Familiarity-discovery ratio: % of results that are well-known vs obscure

The 16 Categories and Their Current Performance

| Category | Cases | Pass Rate | Notes | |----------|-------|-----------|-------| | mechanic-focused | 530 | ~32% | Weakest. BGG mechanic alias gap is the root cause. | | multi-constraint | 384 | ~36% | Combined constraints are hard to satisfy. | | theme-focused | 356 | ~86% | Strong. Theme matching works well. | | video-game | 262 | ~100% | Perfect. No cross-contamination. | | similar-to | 212 | ~83% | Good. "Like X" queries work. | | mood-vibe | 189 | ~29% | Weak. Missing Patchwork, Jaipur for chill. | | player-count | 177 | ~60% | Moderate. Some constraint violations. | | time-constraint | 164 | ~50% | Moderate. Time violations persist. | | free-text-intent | 159 | ~73% | Good. Natural language decent. | | edge-case | 153 | ~100% | Perfect. Handles garbage gracefully. | | negative-preference | 123 | ~73% | Good. Respects exclusions. | | designer-search | 116 | ~42% | Weak. Non-designer games mixed in. | | complexity | 112 | ~44% | Moderate. Misses gateway games. | | real-user-feedback | 78 | ~33% | Weak. BGG user issues persist. | | regression | 9 | ~89% | Good. Past bugs mostly fixed. | | party-game | 4 | ~50% | Too few cases. |

What To Do

Based on what the user asks (or $ARGUMENTS), pick the right enhancement:

If they want to fix broken/inaccurate test cases:

Read evals/cases.json and look for cases with wrong game names, unrealistic queries, or contradictory constraints
Cross-reference idealGames against the actual database (query Supabase or use the validate script from scripts/validate-eval-cases.ts)
Fix the cases in the appropriate generator (generate-cases.ts, generate-expanded-cases.ts, or generate-massive.ts) and regenerate

If they want better metrics:

Read evals/metrics.ts and evals/types.ts
Add new metric calculations (serendipity, familiarity balance, catalog coverage per run)
Update computeCaseMetrics() and computeAggregateMetrics() in the runner
Update the report formatter to display new metrics

If they want a better LLM judge:

Read evals/llm-judge.ts
Implement per-dimension scoring (mechanic match, theme match, constraint satisfaction as separate 0-2 scores)
Add chain-of-thought reasoning requirement
Consider pairwise comparison mode for A/B testing system versions

If they want more/better test cases:

Identify which categories are underrepresented (party-game has only 4 cases!)
Add hand-curated cases in generate-cases.ts for the weakest categories
Run npm run eval:generate-massive to fill gaps with LLM-generated cases
Validate new cases against the database

If they want better reporting/visualization:

Read evals/summary.ts, evals/compare-runs.ts, evals/analyze-failures.ts
Add new views (e.g., trend over time, per-game analysis, category drill-down)
Consider adding an HTML report generator for richer visualization

Important Rules

READ the relevant files thoroughly before making changes
ALWAYS regenerate cases.json after modifying generators (run the appropriate generate script)
ALWAYS run a quick eval after changes to verify nothing broke: source .env.local && npx tsx evals/runner.ts --quick --no-judge
Document what you changed in evals/EVAL-WORKLOG.md
Do NOT change engine code -- this skill is about the eval system only

stopitdan/enhance-evals

.claude/skills/enhance-evals/SKILL.md

Improve the eval system quality -- add better cases, fix broken cases, improve the LLM judge, add new metrics, or refine pass/fail criteria based on eval run data. Use when the user wants to make the eval system itself better.

testing

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add stopitdan/recommendagame enhance-evals

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 16, 2026, 2:19 PM31.2s1 file scanned

SKILL.md

name:: enhance-evals
description:: Improve the eval system quality -- add better cases, fix broken cases, improve the LLM judge, add new metrics, or refine pass/fail criteria based on eval run data. Use when the user wants to make the eval system itself better.
disable-model-invocation:: true
argument-hint:: [area to improve]
allowed-tools:: Bash Read Write Edit Glob Grep Agent
effort:: high

Enhance the Eval System

Full Context You MUST Know

System Architecture

evals/runner.ts (626 lines) - Core runner: parallel execution, LLM judge, constraint checker, regression tracking, persistent logging
evals/types.ts (218 lines) - Type definitions for cases, results, runs, metrics
evals/metrics.ts (109 lines) - NDCG@K, Precision@K, MRR, Hit Rate calculations
evals/llm-judge.ts (105 lines) - GPT-4o-mini rates overall result quality 0-10
evals/constraint-checker.ts (110 lines) - Detects player count, time, complexity, game type violations
evals/compare-runs.ts (139 lines) - Side-by-side run comparison with regression detection
evals/summary.ts (109 lines) - Run summary viewer with history mode
evals/analyze-failures.ts (254 lines) - Failure pattern categorization + most-missing-game tracking
evals/generate-cases.ts - 130 hand-curated base cases
evals/generate-expanded-cases.ts - +177 systematic variations
evals/generate-massive.ts - LLM-generated thousands (GPT-4o-mini, batched)
evals/cases.json - Currently ~3,028 cases across 16 categories

Known Weaknesses in the Eval System

Pass/fail criteria may be too strict: A case fails if ANY ideal game (relevance >= 2) is missing. But the results might be excellent games the eval case didn't anticipate. The LLM judge catches this (7.14/10 average vs 68.4% pass rate means many "failing" cases have good results).
idealGames are sometimes wrong: Some generated cases reference games that don't exist in the DB, or use slightly wrong names (e.g., "Castles of Burgundy" vs "The Castles of Burgundy").
LLM judge uses 0-10 scale: Research (arxiv 2411.15594) shows 0-2 scales with per-dimension scoring are more reliable. Current judge gives a single holistic score.
No serendipity metric: Research shows users want accuracy + novelty. We don't measure whether recommendations include unexpected-but-delightful games.
No familiarity balance metric: Optimal is 20-30% recognizable games, 70-80% discovery. We don't track this.
Catalog coverage is only 0.5%: Only ~400-500 unique games appear across all recommendations. Could indicate test cases are too similar or engine has narrow candidate fetching.
Generated cases may have quality issues: LLM-generated cases (GPT-4o-mini at temp 0.9) may include unrealistic queries, wrong game names, or contradictory constraints.
No confidence intervals: We report point estimates without standard errors. With 3,000+ cases we should use statistical significance testing.

Research-Backed Improvements to Consider

From docs/research/recommendation-eval-methodology.md and evals/RECOMMENDATIONS.md:

Pairwise LLM comparison instead of absolute scoring (more reliable)
Per-dimension scoring: Rate mechanic match, theme match, constraint satisfaction separately
Chain-of-thought reasoning in LLM judge (explain BEFORE scoring)
Trust buster detection: Flag obviously-wrong individual results (Chess for party game query)
Constraint violation breakdown by type (time vs player count vs complexity)
Serendipity@K: relevant AND dissimilar to obvious matches
Familiarity-discovery ratio: % of results that are well-known vs obscure

The 16 Categories and Their Current Performance

What To Do

Based on what the user asks (or $ARGUMENTS), pick the right enhancement:

If they want to fix broken/inaccurate test cases:

Read evals/cases.json and look for cases with wrong game names, unrealistic queries, or contradictory constraints
Cross-reference idealGames against the actual database (query Supabase or use the validate script from scripts/validate-eval-cases.ts)
Fix the cases in the appropriate generator (generate-cases.ts, generate-expanded-cases.ts, or generate-massive.ts) and regenerate

If they want better metrics:

Read evals/metrics.ts and evals/types.ts
Add new metric calculations (serendipity, familiarity balance, catalog coverage per run)
Update computeCaseMetrics() and computeAggregateMetrics() in the runner
Update the report formatter to display new metrics

If they want a better LLM judge:

Read evals/llm-judge.ts
Implement per-dimension scoring (mechanic match, theme match, constraint satisfaction as separate 0-2 scores)
Add chain-of-thought reasoning requirement
Consider pairwise comparison mode for A/B testing system versions

If they want more/better test cases:

Identify which categories are underrepresented (party-game has only 4 cases!)
Add hand-curated cases in generate-cases.ts for the weakest categories
Run npm run eval:generate-massive to fill gaps with LLM-generated cases
Validate new cases against the database

If they want better reporting/visualization:

Read evals/summary.ts, evals/compare-runs.ts, evals/analyze-failures.ts
Add new views (e.g., trend over time, per-game analysis, category drill-down)
Consider adding an HTML report generator for richer visualization

Important Rules

READ the relevant files thoroughly before making changes
ALWAYS regenerate cases.json after modifying generators (run the appropriate generate script)
ALWAYS run a quick eval after changes to verify nothing broke: source .env.local && npx tsx evals/runner.ts --quick --no-judge
Document what you changed in evals/EVAL-WORKLOG.md
Do NOT change engine code -- this skill is about the eval system only

Related Skills

stopitdan/run-evals

testing

VerifiedTrustedCommunity

Run the recommendation engine evaluation suite, analyze results, and produce a clear summary with comparisons to previous runs. Use when the user wants to test the recommendation engine quality.

SKILL.mdUpdated Apr 16, 2026

stopitdan/increase-evals

testing

VerifiedTrustedCommunity

Generate more eval test cases to expand coverage. Targets 5,000+ total cases. Use when the user wants more test cases for the recommendation engine evaluation suite.

SKILL.mdUpdated Apr 16, 2026

stopitdan/increase-evals

steipete/skill-creator

testing

VerifiedTrustedCommunity

Create, edit, improve, or audit AgentSkills. Use when creating a new skill from scratch or when asked to improve, review, audit, tidy up, or clean up an existing skill or SKILL.md file. Also use when editing or restructuring a skill directory (moving files to references/ or scripts/, removing stale content, validating against the AgentSkills spec). Triggers on phrases like "create a skill", "author a skill", "tidy up a skill", "improve this skill", "review the skill", "clean up the skill", "audit the skill".

356,423SKILL.mdUpdated Apr 13, 2026

steipete/skill-creator

steipete/healthcheck

testing

VerifiedTrustedCommunity

Host security hardening and risk-tolerance configuration for OpenClaw deployments. Use when a user asks for security audits, firewall/SSH/update hardening, risk posture, exposure review, OpenClaw cron scheduling for periodic checks, or version status checks on a machine running OpenClaw (laptop, workstation, Pi, VPS).

356,423SKILL.mdUpdated Apr 13, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/stopitdan/recommendagame.git

# Copy into Claude Code skills folder (global)
cp -r recommendagame/.claude/skills/enhance-evals ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

stopitdan/recommendagame

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT