Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

stopitdan/run-evals

Name: run-evals
Author: stopitdan

.claude/skills/run-evals/SKILL.md

npx skillsauth add stopitdan/recommendagame run-evals

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Run Recommendation Engine Evals

You are running the boredgame.lol recommendation engine evaluation suite. This is a comprehensive test system with 3,000+ eval cases across 16 categories that tests whether the engine returns the right games for user queries.

Context You MUST Know

The eval system lives in evals/
Cases are in evals/cases.json (currently ~3,028 cases)
Results are persisted in evals/runs/ (JSON) and evals/logs/ (human-readable)
Each run automatically compares against the previous run for regression detection
The LLM judge (GPT-4o-mini) scores each result set 0-10
A case PASSES if: no ideal games (relevance >= 2) are missing from top 10, no anti-games appear, no constraint violations in top 5, and API returns results

Current Baseline (from 307-case run on 2026-04-05)

Pass Rate: 68.4% (210/307)
LLM Judge: 7.14/10
NDCG@10: 0.9855
Constraint Violations: 1.0%
Weakest categories: mechanic-focused (32%), mood-vibe (29%), designer-search (42%)
Strongest: edge-case (100%), video-game (100%), theme-focused (86%)
#1 failure mode: missing famous games (89 of 97 failures)

What To Do

Ensure dev server is running on localhost:1337. If not, start it:
```
npm run dev
```
Wait for it to be ready before proceeding.
Run the eval suite based on user args:
- Default (full suite with judge): source .env.local && npx tsx evals/runner.ts --concurrency=5
- Quick (50 cases, no judge): source .env.local && npx tsx evals/runner.ts --quick --concurrency=8
- Specific category: source .env.local && npx tsx evals/runner.ts --category=$ARGUMENTS --concurrency=5
- With limit: source .env.local && npx tsx evals/runner.ts $ARGUMENTS
If the user passed arguments like --quick or --category=mechanic-focused, pass them through.

After the run completes, show the summary:

source .env.local && npx tsx evals/summary.ts

Compare with previous run (if one exists):

source .env.local && npx tsx evals/compare-runs.ts

Run failure analysis:

source .env.local && npx tsx evals/analyze-failures.ts

Present results to the user in a clear, organized format:
- Overall pass rate and LLM judge score
- Category breakdown (worst to best)
- Regression comparison (what improved, what regressed)
- Top 10 worst cases with details
- Most commonly missing games
- Specific constraint violations found
- Concrete recommendations for what to fix next

Important Rules

ALWAYS source .env.local before running eval scripts
NEVER modify engine code during this skill -- only observe and report
If the server is not running or requests fail, tell the user to start npm run dev first
If many cases show "api-error", the server is likely overloaded -- suggest reducing concurrency
The 307-case definitive baseline from 2026-04-05T04-41-09 is the comparison point
Read evals/RECOMMENDATIONS.md if the user asks what to fix
Read evals/EVAL-OVERVIEW.md for the full system documentation

stopitdan/run-evals

.claude/skills/run-evals/SKILL.md

Run the recommendation engine evaluation suite, analyze results, and produce a clear summary with comparisons to previous runs. Use when the user wants to test the recommendation engine quality.

testing

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add stopitdan/recommendagame run-evals

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 16, 2026, 2:18 PM11.1s1 file scanned

SKILL.md

name:: run-evals
description:: Run the recommendation engine evaluation suite, analyze results, and produce a clear summary with comparisons to previous runs. Use when the user wants to test the recommendation engine quality.
disable-model-invocation:: true
argument-hint:: [--quick | --category=X | --limit=N]
allowed-tools:: Bash Read Glob Grep

Run Recommendation Engine Evals

Context You MUST Know

The eval system lives in evals/
Cases are in evals/cases.json (currently ~3,028 cases)
Results are persisted in evals/runs/ (JSON) and evals/logs/ (human-readable)
Each run automatically compares against the previous run for regression detection
The LLM judge (GPT-4o-mini) scores each result set 0-10
A case PASSES if: no ideal games (relevance >= 2) are missing from top 10, no anti-games appear, no constraint violations in top 5, and API returns results

Current Baseline (from 307-case run on 2026-04-05)

Pass Rate: 68.4% (210/307)
LLM Judge: 7.14/10
NDCG@10: 0.9855
Constraint Violations: 1.0%
Weakest categories: mechanic-focused (32%), mood-vibe (29%), designer-search (42%)
Strongest: edge-case (100%), video-game (100%), theme-focused (86%)
#1 failure mode: missing famous games (89 of 97 failures)

What To Do

Ensure dev server is running on localhost:1337. If not, start it:
```
npm run dev
```
Wait for it to be ready before proceeding.
Run the eval suite based on user args:
- Default (full suite with judge): source .env.local && npx tsx evals/runner.ts --concurrency=5
- Quick (50 cases, no judge): source .env.local && npx tsx evals/runner.ts --quick --concurrency=8
- Specific category: source .env.local && npx tsx evals/runner.ts --category=$ARGUMENTS --concurrency=5
- With limit: source .env.local && npx tsx evals/runner.ts $ARGUMENTS
If the user passed arguments like --quick or --category=mechanic-focused, pass them through.

After the run completes, show the summary:

source .env.local && npx tsx evals/summary.ts

Compare with previous run (if one exists):

source .env.local && npx tsx evals/compare-runs.ts

Run failure analysis:

source .env.local && npx tsx evals/analyze-failures.ts

Present results to the user in a clear, organized format:
- Overall pass rate and LLM judge score
- Category breakdown (worst to best)
- Regression comparison (what improved, what regressed)
- Top 10 worst cases with details
- Most commonly missing games
- Specific constraint violations found
- Concrete recommendations for what to fix next

Important Rules

ALWAYS source .env.local before running eval scripts
NEVER modify engine code during this skill -- only observe and report
If the server is not running or requests fail, tell the user to start npm run dev first
If many cases show "api-error", the server is likely overloaded -- suggest reducing concurrency
The 307-case definitive baseline from 2026-04-05T04-41-09 is the comparison point
Read evals/RECOMMENDATIONS.md if the user asks what to fix
Read evals/EVAL-OVERVIEW.md for the full system documentation

Related Skills

stopitdan/increase-evals

testing

VerifiedTrustedCommunity

Generate more eval test cases to expand coverage. Targets 5,000+ total cases. Use when the user wants more test cases for the recommendation engine evaluation suite.

SKILL.mdUpdated Apr 16, 2026

stopitdan/increase-evals

stopitdan/enhance-evals

testing

VerifiedTrustedCommunity

Improve the eval system quality -- add better cases, fix broken cases, improve the LLM judge, add new metrics, or refine pass/fail criteria based on eval run data. Use when the user wants to make the eval system itself better.

SKILL.mdUpdated Apr 16, 2026

stopitdan/enhance-evals

steipete/skill-creator

testing

VerifiedTrustedCommunity

Create, edit, improve, or audit AgentSkills. Use when creating a new skill from scratch or when asked to improve, review, audit, tidy up, or clean up an existing skill or SKILL.md file. Also use when editing or restructuring a skill directory (moving files to references/ or scripts/, removing stale content, validating against the AgentSkills spec). Triggers on phrases like "create a skill", "author a skill", "tidy up a skill", "improve this skill", "review the skill", "clean up the skill", "audit the skill".

356,423SKILL.mdUpdated Apr 13, 2026

steipete/skill-creator

steipete/healthcheck

testing

VerifiedTrustedCommunity

Host security hardening and risk-tolerance configuration for OpenClaw deployments. Use when a user asks for security audits, firewall/SSH/update hardening, risk posture, exposure review, OpenClaw cron scheduling for periodic checks, or version status checks on a machine running OpenClaw (laptop, workstation, Pi, VPS).

356,423SKILL.mdUpdated Apr 13, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/stopitdan/recommendagame.git

# Copy into Claude Code skills folder (global)
cp -r recommendagame/.claude/skills/run-evals ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

stopitdan/recommendagame

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT