Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

stopitdan/increase-evals

Name: increase-evals
Author: stopitdan

.claude/skills/increase-evals/SKILL.md

npx skillsauth add stopitdan/recommendagame increase-evals

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Increase Eval Test Cases

You are expanding the boredgame.lol recommendation engine eval suite. The goal is to have 5,000+ diverse, high-quality test cases that cover every conceivable query a real user might type.

Current State

3,028 cases currently in evals/cases.json
Target: 5,000 cases minimum
Gap: ~1,972 more needed

Current Category Distribution

| Category | Count | Notes | |----------|-------|-------| | mechanic-focused | 530 | Good coverage | | multi-constraint | 384 | Good | | theme-focused | 356 | Good | | video-game | 262 | Good | | similar-to | 212 | Good | | mood-vibe | 189 | Could use more | | player-count | 177 | Good | | time-constraint | 164 | Good | | free-text-intent | 159 | Could use more | | edge-case | 153 | Good | | negative-preference | 123 | Could use more | | designer-search | 116 | Could use more | | complexity | 112 | Could use more | | real-user-feedback | 78 | NEEDS MORE -- these are the most valuable | | regression | 9 | NEEDS MORE -- known failure probes | | party-game | 4 | NEEDS MUCH MORE |

What To Do

Option 1: Run the LLM generator (fastest way to add ~2,000 cases)

The generate-massive.ts script has 7 additional batch definitions (batches 24-30) that haven't been run yet. These target:

Mechanic combo queries (150)
Theme + mechanic combos (150)
Known failure mode probes / regression tests (100)
Party and social scenarios (100)
Specific video game queries (150)
Diverse natural language (150)
Final diverse fill (200)

source .env.local && npx tsx evals/generate-massive.ts

This loads the existing 3,028 cases and adds new batches on top. It uses GPT-4o-mini with 120s timeout and batch size 30.

After generation, verify the count:

node -e "console.log(JSON.parse(require('fs').readFileSync('evals/cases.json')).length)"

Option 2: Add hand-curated cases for weak categories

The most impactful categories to expand:

party-game (only 4 cases!) -- Add 50+ cases covering different group sizes, settings, audiences
regression (only 9 cases) -- Add 50+ cases probing every known failure mode
real-user-feedback (78 cases) -- Add more messy, real-world style queries
designer-search (116 cases) -- Add more designers, varied phrasings

Edit evals/generate-cases.ts to add cases with the addCase() function, then regenerate:

npx tsx evals/generate-cases.ts

Option 3: User specifies a count or category

If the user said /increase-evals 500 or /increase-evals party-game:

For a number: generate that many additional cases via generate-massive.ts (adjust batch counts)
For a category: focus generation on that specific category

Quality Rules for New Cases

Queries must feel like real humans typed them -- messy, informal, with personality. Not "Please recommend a worker placement game with medium complexity" but "whats a good WP game thats not too brainy"
shouldInclude max 2 games -- only the absolute most obvious matches. Most cases should have 0-1.
Use exact BGG game names -- "7 Wonders Duel" not "7 wonders duel"
shouldNotInclude should be clearly wrong -- UNO for strategy, Chess for party games, Twilight Imperium for quick games
~20% of queries should have typos/casual language -- "dekc biulder", "workar playsment", "somthing chill"
Every query must be unique -- no duplicates or trivial rephrasing
Include constraints where natural -- "for 2 players", "under 30 minutes", "not too complex"

After Adding Cases

Verify case count: node -e "console.log(JSON.parse(require('fs').readFileSync('evals/cases.json')).length)"
Run a quick eval to check: source .env.local && npx tsx evals/runner.ts --quick --no-judge
Update the worklog: Add a note to evals/EVAL-WORKLOG.md about what was added and why
Tell the user exactly how many cases were added and the new total

stopitdan/increase-evals

.claude/skills/increase-evals/SKILL.md

Generate more eval test cases to expand coverage. Targets 5,000+ total cases. Use when the user wants more test cases for the recommendation engine evaluation suite.

testing

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add stopitdan/recommendagame increase-evals

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 16, 2026, 2:19 PM31.7s1 file scanned

SKILL.md

name:: increase-evals
description:: Generate more eval test cases to expand coverage. Targets 5,000+ total cases. Use when the user wants more test cases for the recommendation engine evaluation suite.
disable-model-invocation:: true
argument-hint:: [count | category]
allowed-tools:: Bash Read Write Edit Glob Grep

Increase Eval Test Cases

You are expanding the boredgame.lol recommendation engine eval suite. The goal is to have 5,000+ diverse, high-quality test cases that cover every conceivable query a real user might type.

Current State

3,028 cases currently in evals/cases.json
Target: 5,000 cases minimum
Gap: ~1,972 more needed

Current Category Distribution

What To Do

Option 1: Run the LLM generator (fastest way to add ~2,000 cases)

The generate-massive.ts script has 7 additional batch definitions (batches 24-30) that haven't been run yet. These target:

Mechanic combo queries (150)
Theme + mechanic combos (150)
Known failure mode probes / regression tests (100)
Party and social scenarios (100)
Specific video game queries (150)
Diverse natural language (150)
Final diverse fill (200)

source .env.local && npx tsx evals/generate-massive.ts

This loads the existing 3,028 cases and adds new batches on top. It uses GPT-4o-mini with 120s timeout and batch size 30.

After generation, verify the count:

node -e "console.log(JSON.parse(require('fs').readFileSync('evals/cases.json')).length)"

Option 2: Add hand-curated cases for weak categories

The most impactful categories to expand:

party-game (only 4 cases!) -- Add 50+ cases covering different group sizes, settings, audiences
regression (only 9 cases) -- Add 50+ cases probing every known failure mode
real-user-feedback (78 cases) -- Add more messy, real-world style queries
designer-search (116 cases) -- Add more designers, varied phrasings

Edit evals/generate-cases.ts to add cases with the addCase() function, then regenerate:

npx tsx evals/generate-cases.ts

Option 3: User specifies a count or category

If the user said /increase-evals 500 or /increase-evals party-game:

For a number: generate that many additional cases via generate-massive.ts (adjust batch counts)
For a category: focus generation on that specific category

Quality Rules for New Cases

Queries must feel like real humans typed them -- messy, informal, with personality. Not "Please recommend a worker placement game with medium complexity" but "whats a good WP game thats not too brainy"
shouldInclude max 2 games -- only the absolute most obvious matches. Most cases should have 0-1.
Use exact BGG game names -- "7 Wonders Duel" not "7 wonders duel"
shouldNotInclude should be clearly wrong -- UNO for strategy, Chess for party games, Twilight Imperium for quick games
~20% of queries should have typos/casual language -- "dekc biulder", "workar playsment", "somthing chill"
Every query must be unique -- no duplicates or trivial rephrasing
Include constraints where natural -- "for 2 players", "under 30 minutes", "not too complex"

After Adding Cases

Verify case count: node -e "console.log(JSON.parse(require('fs').readFileSync('evals/cases.json')).length)"
Run a quick eval to check: source .env.local && npx tsx evals/runner.ts --quick --no-judge
Update the worklog: Add a note to evals/EVAL-WORKLOG.md about what was added and why
Tell the user exactly how many cases were added and the new total

Related Skills

stopitdan/run-evals

testing

VerifiedTrustedCommunity

Run the recommendation engine evaluation suite, analyze results, and produce a clear summary with comparisons to previous runs. Use when the user wants to test the recommendation engine quality.

SKILL.mdUpdated Apr 16, 2026

stopitdan/enhance-evals

testing

VerifiedTrustedCommunity

Improve the eval system quality -- add better cases, fix broken cases, improve the LLM judge, add new metrics, or refine pass/fail criteria based on eval run data. Use when the user wants to make the eval system itself better.

SKILL.mdUpdated Apr 16, 2026

stopitdan/enhance-evals

steipete/skill-creator

testing

VerifiedTrustedCommunity

Create, edit, improve, or audit AgentSkills. Use when creating a new skill from scratch or when asked to improve, review, audit, tidy up, or clean up an existing skill or SKILL.md file. Also use when editing or restructuring a skill directory (moving files to references/ or scripts/, removing stale content, validating against the AgentSkills spec). Triggers on phrases like "create a skill", "author a skill", "tidy up a skill", "improve this skill", "review the skill", "clean up the skill", "audit the skill".

356,423SKILL.mdUpdated Apr 13, 2026

steipete/skill-creator

steipete/healthcheck

testing

VerifiedTrustedCommunity

Host security hardening and risk-tolerance configuration for OpenClaw deployments. Use when a user asks for security audits, firewall/SSH/update hardening, risk posture, exposure review, OpenClaw cron scheduling for periodic checks, or version status checks on a machine running OpenClaw (laptop, workstation, Pi, VPS).

356,423SKILL.mdUpdated Apr 13, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/stopitdan/recommendagame.git

# Copy into Claude Code skills folder (global)
cp -r recommendagame/.claude/skills/increase-evals ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

stopitdan/recommendagame

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT