Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

chelch5/skill-eval-runner

Name: skill-eval-runner
Author: chelch5

01-package-scaffolding/skill-eval-runner/SKILL.md

npx skillsauth add chelch5/skilllibrary skill-eval-runner

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Skill Eval Runner

Executes a skill's eval suite and produces a structured quality report.

Procedure

1. Locate test suite

Check for eval files in the skill directory:

evals/triggers.yaml — trigger accuracy tests
evals/outputs.yaml — behavior correctness tests
evals/baselines.yaml — baseline comparison tests

Note which exist and which are missing.

2. Run trigger tests

For each case in triggers.yaml:

Positive cases: Does the skill fire? (expected: yes)
Negative cases: Does the skill NOT fire? (expected: no)

Record: prompt, expected, actual, Pass/Fail.

3. Run output tests

For each case in outputs.yaml:

Run the skill with the given input
Check against expected_sections, required_patterns, forbidden_patterns

Record: test name, checks passed/total, Pass/Fail.

4. Run baseline comparison

For each case in baselines.yaml:

Run with skill active vs without skill
Does the skill add value? (Yes if 2+ elements baseline would lack)

5. Aggregate results

Calculate:

Trigger precision: TP / (TP + FP)
Trigger recall: TP / (TP + FN)
Output pass rate: passed / total checks
Baseline win: Yes/No

6. Issue verdict

| Verdict | Criteria | |---------|----------| | Pass | All rates ≥80% AND baseline win | | Pass with issues | Any rate 60-79% | | Fail | Any rate <60% OR baseline lose |

Output contract

## Eval Report: [skill-name]
Date: [YYYY-MM-DD]

### Trigger Tests
| Prompt | Type | Expected | Actual | Result |
|--------|------|----------|--------|--------|
Precision: X%  Recall: Y%

### Output Tests
| Test | Checks Passed | Result |
|------|--------------|--------|
Pass rate: X%

### Baseline Comparison
Skill adds value: [Yes/No]

### Verdict: [Pass | Pass with issues | Fail]
Issues: [list or "None"]

Failure handling

Silent skill failure (no output): Record Fail, halt remaining tests for that case
Missing test files: Report which are missing, run available ones, note incomplete coverage
Flaky results: Run 3x, use majority result, note flakiness in report
All tests missing: Cannot evaluate — report "No eval suite found" and recommend building one

References

Anthropic eval overview: https://docs.anthropic.com/en/docs/test-and-evaluate/eval-overview
OpenAI evals: https://developers.openai.com/docs/evals/getting-started

chelch5/skill-eval-runner

01-package-scaffolding/skill-eval-runner/SKILL.md

Run trigger tests, behavior tests, and baseline comparisons for a skill's eval suite, then produce a structured quality verdict. Use when a skill has been modified and needs regression testing, when CI/pre-release validation requires documented eval results, or when measuring quality before catalog inclusion. Do not use when no evals exist yet (build them first) or for manual evaluation without test files.

development

Updated Apr 20, 2026

$ install --global

skillsauth

npx skillsauth add chelch5/skilllibrary skill-eval-runner

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 20, 2026, 3:19 AM16.8s11 files scanned

SKILL.md

name:: skill-eval-runner
description:: Run trigger tests, behavior tests, and baseline comparisons for a skill's eval suite, then produce a structured quality verdict. Use when a skill has been modified and needs regression testing, when CI/pre-release validation requires documented eval results, or when measuring quality before catalog inclusion. Do not use when no evals exist yet (build them first) or for manual evaluation without test files.
license:: Apache-2.0
clients:: [openai-codex, gemini-cli, opencode, github-copilot]

Skill Eval Runner

Executes a skill's eval suite and produces a structured quality report.

Procedure

1. Locate test suite

Check for eval files in the skill directory:

evals/triggers.yaml — trigger accuracy tests
evals/outputs.yaml — behavior correctness tests
evals/baselines.yaml — baseline comparison tests

Note which exist and which are missing.

2. Run trigger tests

For each case in triggers.yaml:

Positive cases: Does the skill fire? (expected: yes)
Negative cases: Does the skill NOT fire? (expected: no)

Record: prompt, expected, actual, Pass/Fail.

3. Run output tests

For each case in outputs.yaml:

Run the skill with the given input
Check against expected_sections, required_patterns, forbidden_patterns

Record: test name, checks passed/total, Pass/Fail.

4. Run baseline comparison

For each case in baselines.yaml:

Run with skill active vs without skill
Does the skill add value? (Yes if 2+ elements baseline would lack)

5. Aggregate results

Calculate:

Trigger precision: TP / (TP + FP)
Trigger recall: TP / (TP + FN)
Output pass rate: passed / total checks
Baseline win: Yes/No

6. Issue verdict

| Verdict | Criteria | |---------|----------| | Pass | All rates ≥80% AND baseline win | | Pass with issues | Any rate 60-79% | | Fail | Any rate <60% OR baseline lose |

Output contract

## Eval Report: [skill-name]
Date: [YYYY-MM-DD]

### Trigger Tests
| Prompt | Type | Expected | Actual | Result |
|--------|------|----------|--------|--------|
Precision: X%  Recall: Y%

### Output Tests
| Test | Checks Passed | Result |
|------|--------------|--------|
Pass rate: X%

### Baseline Comparison
Skill adds value: [Yes/No]

### Verdict: [Pass | Pass with issues | Fail]
Issues: [list or "None"]

Failure handling

Silent skill failure (no output): Record Fail, halt remaining tests for that case
Missing test files: Report which are missing, run available ones, note incomplete coverage
Flaky results: Run 3x, use majority result, note flakiness in report
All tests missing: Cannot evaluate — report "No eval suite found" and recommend building one

References

Anthropic eval overview: https://docs.anthropic.com/en/docs/test-and-evaluate/eval-overview
OpenAI evals: https://developers.openai.com/docs/evals/getting-started

Related Skills

chelch5/context-intelligence

testing

VerifiedTrustedCommunity

Manages context window budgets, loading strategies, and compaction techniques for AI-assisted coding sessions. Trigger on 'context window', 'what to load', 'context management', 'context overflow', 'token budget'. DO NOT USE for loading specific project docs into agent context (use project-context) or prompt wording and optimization (use prompt-crafting).

SKILL.mdUpdated Apr 20, 2026

chelch5/context-intelligence

chelch5/auth-patterns

development

VerifiedTrustedCommunity

Implements authentication, session, token, and authorization patterns for the current stack. Trigger on 'add auth', 'JWT', 'OAuth', 'login endpoint', 'session management', 'API key auth'. DO NOT USE for OWASP hardening checklists (use security-hardening), threat modeling (use security-threat-model), or secret rotation/storage (use security-best-practices).

SKILL.mdUpdated Apr 20, 2026

chelch5/auth-patterns

chelch5/api-schema

tools

VerifiedTrustedCommunity

Defines request/response shapes, versioning, validation, and compatibility rules for API-first work. Trigger on 'design API', 'OpenAPI spec', 'REST schema', 'API versioning', 'generate client SDK'. DO NOT USE for GraphQL schemas, gRPC/protobuf definitions (use stack-standards), auth endpoint logic (use auth-patterns), or external API client wrappers (use external-api-client).

SKILL.mdUpdated Apr 20, 2026

chelch5/ticket-pack-builder

development

VerifiedTrustedCommunity

Create a repo-local ticket system with an index, machine-readable manifest, board, and individual ticket files. Use when a repo needs task decomposition that autonomous agents can follow without re-planning the whole project each session. Do not use for executing tickets (use ticket-execution) or quick fixes that don't warrant formal tickets.

SKILL.mdUpdated Apr 20, 2026

chelch5/ticket-pack-builder

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/chelch5/skilllibrary.git

# Copy into Claude Code skills folder (global)
cp -r skilllibrary/01-package-scaffolding/skill-eval-runner ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

chelch5/skilllibrary

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT