Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

pskoett/eval-creator

Name: eval-creator
Author: pskoett

plugin/skills/eval-creator/SKILL.md

npx skillsauth add pskoett/pskoett-ai-skills eval-creator

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Eval Creator

Turns promoted learnings into permanent eval cases. Runs regression checks to verify promoted rules hold. This is the outer loop's regress-test step.

The blog says: "If a failure taught you something important, it should become a permanent test case. Otherwise the knowledge is still fragile."

When to Use

After harness-updater promotes a pattern — create an eval for it
On cadence — run all evals to check for regression
Before major releases — verify the harness is holding
When a promoted rule seems to have stopped working — diagnose with targeted eval run

Eval Directory Structure

.evals/
  EVAL_INDEX.md          # Index of all eval cases with status
  cases/
    eval-YYYYMMDD-001.md # Individual eval case
    eval-YYYYMMDD-002.md
    ...

Creating an Eval Case

Input

From harness-updater or manually:

Pattern-Key of the promoted learning
The rule that was added to the project instruction files (CLAUDE.md, AGENTS.md, .github/copilot-instructions.md)
What to test (the assertion)
Verification method

Eval Case Format

---
id: eval-YYYYMMDD-NNN
pattern-key: [from learning]
source: [LRN-YYYYMMDD-001, ERR-YYYYMMDD-003]
promoted-rule: "[the rule text in project instruction files]"
promoted-to: CLAUDE.md  # or AGENTS.md, .github/copilot-instructions.md, or equivalent
created: YYYY-MM-DD
last-run: YYYY-MM-DD
last-result: pass | fail | skip
---

## What This Tests

[One sentence: what failure this eval prevents from recurring]

## Precondition

[What must be true for this eval to be runnable]
- File X exists
- Project uses framework Y
- etc.

## Verification Method

[One of: grep-check, command-check, file-check, rule-check]

### grep-check
Search for a pattern that should (or should not) exist:

target: src/**/*.ts pattern: "hardcoded-secret-pattern" expect: not_found


### command-check
Run a command and check the exit code or output:

command: npm run typecheck expect_exit: 0


### file-check
Verify a file or section exists:

target: CLAUDE.md # or AGENTS.md, .github/copilot-instructions.md section: "## Verification" expect: exists


### rule-check
Verify a rule exists in an instruction file:

target: CLAUDE.md # or AGENTS.md, .github/copilot-instructions.md contains: "[the promoted rule text or key phrase]" expect: found


## Expected Result

**Pass:** [What "good" looks like]
**Fail:** [What regression looks like]

## Recovery Action

If this eval fails:
1. [Specific step to diagnose]
2. [Specific step to fix]
3. Re-run this eval to verify

Running Evals

Run All

Read .evals/EVAL_INDEX.md, iterate through all cases, execute each verification method.

Run by Pattern-Key

Filter to evals matching a specific pattern.

Run by Area

Filter to evals whose source files match an area (frontend, backend, etc.).

Execution

For each eval case:

Check precondition — if not met, mark as skip
Execute verification method:
- grep-check: Use Grep tool to search target files for the pattern
- command-check: Run the command via Bash, check exit code and/or output
- file-check: Use Read/Glob to verify file/section existence
- rule-check: Read the target file, search for the expected content
- skill-check: Run quick_validate.py on a skill directory (see Skill Validation below)
- script-check: Run a custom mcp-script by name (see Custom Verification Methods)
Compare result to expected
Update last-run and last-result in the eval case file
Update EVAL_INDEX.md with the result

Regression Report

## Eval Run: YYYY-MM-DD

**Total:** N evals
**Passed:** N
**Failed:** N
**Skipped:** N

### Failures

#### eval-YYYYMMDD-001 — [pattern-key]
- **What regressed:** [description]
- **Expected:** [X]
- **Got:** [Y]
- **Recovery action:** [from eval case]

### Summary
[All green / N regressions need attention]

Eval Index Format

.evals/EVAL_INDEX.md:

# Eval Index

| ID | Pattern-Key | Rule Summary | Last Run | Result | Created |
|----|-------------|-------------|----------|--------|---------|
| eval-YYYYMMDD-001 | auth-middleware-lock | Run migrations on test DB first | YYYY-MM-DD | pass | YYYY-MM-DD |
| eval-YYYYMMDD-002 | pnpm-not-npm | Use pnpm in this repo | YYYY-MM-DD | fail | YYYY-MM-DD |

Integration

Upstream

harness-updater flags eval candidates after promoting a pattern
learning-aggregator identifies patterns with clear pass/fail conditions

Downstream

Regression failures feed back into self-improvement as new error entries
Persistent failures may indicate the promoted rule needs refinement → feed back to harness-updater

Scheduled Use

For projects with a CI pipeline, eval-creator can run as a scheduled check:

Weekly: run all evals
Per-PR: run evals related to changed files
Post-promotion: run the newly created eval immediately

Custom Verification Methods (mcp-scripts)

Beyond the four built-in methods (grep-check, command-check, file-check, rule-check), projects can define custom verification tools as mcp-scripts for complex assertions that the built-ins can't express.

Example — an eval that verifies a promoted auth rule is enforced:

# In gh-aw workflow config
mcp-scripts:
  check-auth-middleware:
    lang: javascript
    description: "Verify all /admin routes have auth middleware"
    run: |
      const routes = require('./src/routes/admin');
      const unprotected = routes.filter(r => !r.auth);
      if (unprotected.length) {
        console.error('Unprotected admin routes:', unprotected.map(r => r.path));
        process.exit(1);
      }

Reference the script in an eval case as verification_method: script-check with the mcp-script name. This is an extension point — the built-in methods cover most cases, but mcp-scripts handle project-specific behavioral assertions.

Persistence

Eval cases live in .evals/ in the working directory. The skill does not integrate with external memory backends in interactive sessions. For CI-side durable storage, see eval-creator-ci, which can optionally back its run history with gh-aw's repo-memory.

Skill Validation (skill-check)

The Anthropic /skill-creator skill includes two validation systems that eval-creator can use:

Structural validation via `quick_validate.py`

The skill-check verification method runs the skill-creator's quick_validate.py script on a skill directory. It checks:

SKILL.md exists with valid YAML frontmatter
Only allowed frontmatter keys (name, description, license, allowed-tools, metadata, compatibility)
Name is kebab-case, max 64 chars, no leading/trailing/consecutive hyphens
Description has no angle brackets, max 1024 chars
Compatibility field max 500 chars if present

Eval case example:

---
id: eval-YYYYMMDD-NNN
pattern-key: skill-quality.verify-gate
verification_method: skill-check
target: skills/verify-gate
expect: valid
---

## What This Tests
Verify that the verify-gate skill passes structural validation after harness updates.

Execution: python .claude/skills/skill-creator/scripts/quick_validate.py <target>. Exit 0 = pass, exit 1 = fail.

Behavioral validation via `run_eval.py`

For deeper validation, the skill-creator's run_eval.py tests whether a skill's description causes Claude to invoke it for given queries. This is useful when harness-updater modifies a skill's description or the outer loop creates a new skill — the eval verifies the skill still triggers correctly.

This requires Claude CLI access and is expensive. Use it for high-value skills only, not as a routine CI check.

When to create skill-check evals

Two scenarios connect the outer loop to skill validation:

Harness-updater modifies a skill: When a promoted rule is inserted into a SKILL.md (rather than a project instruction file), create a skill-check eval to verify the skill remains structurally valid after the edit.
Self-improvement identifies a skill gap: When learning-aggregator classifies a pattern as skill_gap and recommends "create a new skill", the new skill should pass quick_validate.py before being committed. Create a skill-check eval for it that persists as a regression test.

This closes the loop: failure → learning → new/updated skill → eval verifies skill quality → regression prevents quality drift.

What This Skill Does NOT Do

Does not fix regressions (reports them for the agent or human to fix)
Does not promote learnings (that's harness-updater)
Does not analyze patterns (that's learning-aggregator)
Does not replace project test suites — evals test the harness, not the code

pskoett/eval-creator

plugin/skills/eval-creator/SKILL.md

[Beta] Creates permanent eval cases from promoted learnings and runs regression checks against them. Turns failures into test cases that prevent silent regression. This is the outer loop's regress-test step. Use when a learning is promoted and has a clear pass/fail condition, or on cadence to verify promoted rules still hold.

196 stars

testing

Updated Jun 2, 2026

$ install --global

skillsauth

npx skillsauth add pskoett/pskoett-ai-skills eval-creator

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 2, 2026, 5:14 AM228.7s1 file scanned

SKILL.md

name:: eval-creator
description:: [Beta] Creates permanent eval cases from promoted learnings and runs regression checks against them. Turns failures into test cases that prevent silent regression. This is the outer loop''s regress-test step. Use when a learning is promoted and has a clear pass/fail condition, or on cadence to verify promoted rules still hold.
user-invocable:: true
argument-hint:: [create | run | list] [--pattern-key KEY]

Eval Creator

Turns promoted learnings into permanent eval cases. Runs regression checks to verify promoted rules hold. This is the outer loop's regress-test step.

The blog says: "If a failure taught you something important, it should become a permanent test case. Otherwise the knowledge is still fragile."

When to Use

After harness-updater promotes a pattern — create an eval for it
On cadence — run all evals to check for regression
Before major releases — verify the harness is holding
When a promoted rule seems to have stopped working — diagnose with targeted eval run

Eval Directory Structure

.evals/
  EVAL_INDEX.md          # Index of all eval cases with status
  cases/
    eval-YYYYMMDD-001.md # Individual eval case
    eval-YYYYMMDD-002.md
    ...

Creating an Eval Case

Input

From harness-updater or manually:

Pattern-Key of the promoted learning
The rule that was added to the project instruction files (CLAUDE.md, AGENTS.md, .github/copilot-instructions.md)
What to test (the assertion)
Verification method

Eval Case Format

---
id: eval-YYYYMMDD-NNN
pattern-key: [from learning]
source: [LRN-YYYYMMDD-001, ERR-YYYYMMDD-003]
promoted-rule: "[the rule text in project instruction files]"
promoted-to: CLAUDE.md  # or AGENTS.md, .github/copilot-instructions.md, or equivalent
created: YYYY-MM-DD
last-run: YYYY-MM-DD
last-result: pass | fail | skip
---

## What This Tests

[One sentence: what failure this eval prevents from recurring]

## Precondition

[What must be true for this eval to be runnable]
- File X exists
- Project uses framework Y
- etc.

## Verification Method

[One of: grep-check, command-check, file-check, rule-check]

### grep-check
Search for a pattern that should (or should not) exist:

target: src/**/*.ts pattern: "hardcoded-secret-pattern" expect: not_found


### command-check
Run a command and check the exit code or output:

command: npm run typecheck expect_exit: 0


### file-check
Verify a file or section exists:

target: CLAUDE.md # or AGENTS.md, .github/copilot-instructions.md section: "## Verification" expect: exists


### rule-check
Verify a rule exists in an instruction file:

target: CLAUDE.md # or AGENTS.md, .github/copilot-instructions.md contains: "[the promoted rule text or key phrase]" expect: found


## Expected Result

**Pass:** [What "good" looks like]
**Fail:** [What regression looks like]

## Recovery Action

If this eval fails:
1. [Specific step to diagnose]
2. [Specific step to fix]
3. Re-run this eval to verify

Running Evals

Run All

Read .evals/EVAL_INDEX.md, iterate through all cases, execute each verification method.

Run by Pattern-Key

Filter to evals matching a specific pattern.

Run by Area

Filter to evals whose source files match an area (frontend, backend, etc.).

Execution

For each eval case:

Check precondition — if not met, mark as skip
Execute verification method:
- grep-check: Use Grep tool to search target files for the pattern
- command-check: Run the command via Bash, check exit code and/or output
- file-check: Use Read/Glob to verify file/section existence
- rule-check: Read the target file, search for the expected content
- skill-check: Run quick_validate.py on a skill directory (see Skill Validation below)
- script-check: Run a custom mcp-script by name (see Custom Verification Methods)
Compare result to expected
Update last-run and last-result in the eval case file
Update EVAL_INDEX.md with the result

Regression Report

## Eval Run: YYYY-MM-DD

**Total:** N evals
**Passed:** N
**Failed:** N
**Skipped:** N

### Failures

#### eval-YYYYMMDD-001 — [pattern-key]
- **What regressed:** [description]
- **Expected:** [X]
- **Got:** [Y]
- **Recovery action:** [from eval case]

### Summary
[All green / N regressions need attention]

Eval Index Format

.evals/EVAL_INDEX.md:

# Eval Index

| ID | Pattern-Key | Rule Summary | Last Run | Result | Created |
|----|-------------|-------------|----------|--------|---------|
| eval-YYYYMMDD-001 | auth-middleware-lock | Run migrations on test DB first | YYYY-MM-DD | pass | YYYY-MM-DD |
| eval-YYYYMMDD-002 | pnpm-not-npm | Use pnpm in this repo | YYYY-MM-DD | fail | YYYY-MM-DD |

Integration

Upstream

harness-updater flags eval candidates after promoting a pattern
learning-aggregator identifies patterns with clear pass/fail conditions

Downstream

Regression failures feed back into self-improvement as new error entries
Persistent failures may indicate the promoted rule needs refinement → feed back to harness-updater

Scheduled Use

For projects with a CI pipeline, eval-creator can run as a scheduled check:

Weekly: run all evals
Per-PR: run evals related to changed files
Post-promotion: run the newly created eval immediately

Custom Verification Methods (mcp-scripts)

Example — an eval that verifies a promoted auth rule is enforced:

# In gh-aw workflow config
mcp-scripts:
  check-auth-middleware:
    lang: javascript
    description: "Verify all /admin routes have auth middleware"
    run: |
      const routes = require('./src/routes/admin');
      const unprotected = routes.filter(r => !r.auth);
      if (unprotected.length) {
        console.error('Unprotected admin routes:', unprotected.map(r => r.path));
        process.exit(1);
      }

Persistence

Skill Validation (skill-check)

The Anthropic /skill-creator skill includes two validation systems that eval-creator can use:

Structural validation via `quick_validate.py`

The skill-check verification method runs the skill-creator's quick_validate.py script on a skill directory. It checks:

SKILL.md exists with valid YAML frontmatter
Only allowed frontmatter keys (name, description, license, allowed-tools, metadata, compatibility)
Name is kebab-case, max 64 chars, no leading/trailing/consecutive hyphens
Description has no angle brackets, max 1024 chars
Compatibility field max 500 chars if present

Eval case example:

---
id: eval-YYYYMMDD-NNN
pattern-key: skill-quality.verify-gate
verification_method: skill-check
target: skills/verify-gate
expect: valid
---

## What This Tests
Verify that the verify-gate skill passes structural validation after harness updates.

Execution: python .claude/skills/skill-creator/scripts/quick_validate.py <target>. Exit 0 = pass, exit 1 = fail.

Behavioral validation via `run_eval.py`

This requires Claude CLI access and is expensive. Use it for high-value skills only, not as a routine CI check.

When to create skill-check evals

Two scenarios connect the outer loop to skill validation:

Harness-updater modifies a skill: When a promoted rule is inserted into a SKILL.md (rather than a project instruction file), create a skill-check eval to verify the skill remains structurally valid after the edit.
Self-improvement identifies a skill gap: When learning-aggregator classifies a pattern as skill_gap and recommends "create a new skill", the new skill should pass quick_validate.py before being committed. Create a skill-check eval for it that persists as a regression test.

This closes the loop: failure → learning → new/updated skill → eval verifies skill quality → regression prevents quality drift.

What This Skill Does NOT Do

Does not fix regressions (reports them for the agent or human to fix)
Does not promote learnings (that's harness-updater)
Does not analyze patterns (that's learning-aggregator)
Does not replace project test suites — evals test the harness, not the code

Related Skills

pskoett/agent-teams-simplify-and-harden

development

VerifiedTrustedCommunity

Implementation + audit loop using parallel agent teams with structured simplify, harden, and document passes. Spawns implementation agents to do the work, then audit agents to find complexity, security gaps, and spec deviations, then loops until code compiles cleanly, all tests pass, and auditors find zero issues or the loop cap is reached. Use when: implementing features from a spec or plan, hardening existing code, fixing a batch of issues, or any multi-file task that benefits from a build-verify-fix cycle.

211SKILL.mdUpdated Apr 21, 2026

pskoett/agent-teams-simplify-and-harden

pskoett/self-healing

tools

VerifiedTrustedCommunity

Active runtime recovery for coding agents: when something breaks mid-task, diagnose the root cause, write a fix, VERIFY by re-running the broken thing, then file a `HEAL-` entry to `.learnings/HEALS.md` with proof. Use whenever a command, test, build, or lint fails or exits non-zero; on missing tooling, dependency/lockfile mismatch, wrong runtime version, venv or permission errors, port conflicts, dirty git state, or a missing `.env`; when the agent needs a helper or one-off script that doesn't exist yet; when an external API, tool, or MCP errors or rate-limits; or when a test flakes. Search `HEALS.md` by `Pattern-Key` first — most heals are recurrences, so increment `Recurrence-Count` instead of duplicating. Verify is mandatory: mark `pending-verify` honestly if sandboxed, `abandoned` if the fix can't be made to work. Pairs with `self-improvement` (which promotes recurring heals to durable memory) but owns the verify-before-persist discipline self-improvement doesn't.

199SKILL.mdUpdated Jun 2, 2026

pskoett/control-session-orchestrator

development

VerifiedTrustedCommunity

Control-plane workflow for coordinating multi-agent, multi-session project work from a single Codex, GitHub Copilot, or agent-app control session. Use this skill whenever the user asks to orchestrate agents, create or steer worker sessions, run a workflow-like effort, fan out audits/research/migrations, coordinate parallel implementation streams, monitor other project sessions, or compare this control-session pattern to Claude Code dynamic workflows. This skill is especially relevant when the current session can spawn persistent project sessions and those sessions can spawn their own subagents, creating a two-level orchestration hierarchy.

199SKILL.mdUpdated Jun 2, 2026

pskoett/control-session-orchestrator

pskoett/self-healing

tools

VerifiedTrustedCommunity

199SKILL.mdUpdated Jun 2, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/pskoett/pskoett-ai-skills.git

# Copy into Claude Code skills folder (global)
cp -r pskoett-ai-skills/plugin/skills/eval-creator ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

pskoett/pskoett-ai-skills

196 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

pskoett/eval-creator

$ install --global

Security Scan Results

SKILL.md

Eval Creator

When to Use

Eval Directory Structure

Creating an Eval Case

Input

Eval Case Format

Running Evals

Run All

Run by Pattern-Key

Run by Area

Execution

Regression Report

Eval Index Format

Integration

Upstream

Downstream

Scheduled Use

Custom Verification Methods (mcp-scripts)

Persistence

Skill Validation (skill-check)

Structural validation via quick_validate.py

Behavioral validation via run_eval.py

When to create skill-check evals

What This Skill Does NOT Do

Related Skills

pskoett/agent-teams-simplify-and-harden

pskoett/self-healing

pskoett/control-session-orchestrator

pskoett/self-healing

pskoett/eval-creator

$ install --global

Security Scan Results

SKILL.md

Eval Creator

When to Use

Eval Directory Structure

Creating an Eval Case

Input

Eval Case Format

Running Evals

Run All

Run by Pattern-Key

Run by Area

Execution

Regression Report

Eval Index Format

Integration

Upstream

Downstream

Scheduled Use

Custom Verification Methods (mcp-scripts)

Persistence

Skill Validation (skill-check)

Structural validation via quick_validate.py

Behavioral validation via run_eval.py

When to create skill-check evals

What This Skill Does NOT Do

Related Skills

pskoett/agent-teams-simplify-and-harden

pskoett/self-healing

pskoett/control-session-orchestrator

pskoett/self-healing

Structural validation via `quick_validate.py`

Behavioral validation via `run_eval.py`

Structural validation via `quick_validate.py`

Behavioral validation via `run_eval.py`