Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

lidge-jun/eval-harness

Name: eval-harness
Author: lidge-jun

eval-harness/SKILL.md

npx skillsauth add lidge-jun/cli-jaw-skills eval-harness

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Eval Harness

Evaluation framework for agent-assisted workflows. Define expected behavior before implementation, run evals continuously, and track regressions with pass@k metrics.

When to Use

Defining pass/fail criteria for agent task completion
Measuring agent reliability with pass@k metrics
Creating regression test suites for prompt or agent changes
Benchmarking agent performance across model versions

Eval Types

Capability Evals

Test whether an agent can accomplish something new:

[CAPABILITY EVAL: feature-name]
Task: Description of what the agent should accomplish
Success Criteria:
  - [ ] Criterion 1
  - [ ] Criterion 2
Expected Output: Description of expected result

Regression Evals

Verify existing functionality remains intact:

[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
  - existing-test-1: PASS/FAIL
  - existing-test-2: PASS/FAIL
Result: X/Y passed (previously Y/Y)

Grader Types

| Type | Use when | Example | |------|----------|---------| | Code grader | Deterministic assertions | grep -q "export function handleAuth" src/auth.ts | | Rule grader | Regex/schema constraints | JSON schema validation, pattern matching | | Model grader | Open-ended output quality | LLM-as-judge with 1–5 rubric | | Human grader | Ambiguous or security-sensitive | Manual review with risk level flag |

Prefer code graders where possible — deterministic results are more reliable than probabilistic ones.

Metrics

pass@k

"At least one success in k attempts"

pass@1: First attempt success rate
pass@3: Success within 3 attempts
Typical target: pass@3 ≥ 90%

pass^k

"All k trials succeed"

pass^3: 3 consecutive successes
Use for release-critical paths (target: 1.00)

Eval Workflow

1. Define (before coding)

## EVAL DEFINITION: feature-xyz

### Capability Evals
1. Can create new user account
2. Can validate email format

### Regression Evals
1. Existing login still works
2. Session management unchanged

### Success Metrics
- pass@3 ≥ 90% for capability evals
- pass^3 = 100% for regression evals

2. Implement

Write code to pass the defined evals.

3. Evaluate

Run each eval, record PASS/FAIL per item.

4. Report

EVAL REPORT: feature-xyz
========================
Capability:  2/2 passed (pass@3: 100%)
Regression:  2/2 passed (pass^3: 100%)
Overall:     READY FOR REVIEW

Eval Storage

evals/
  feature-xyz.md      # Eval definition
  feature-xyz.log     # Run history
  baseline.json       # Regression baselines

Best Practices

Define evals before coding — forces clear thinking about success criteria
Run evals frequently — catch regressions early
Track pass@k over time — monitor reliability trends
Prefer code graders — deterministic beats probabilistic
Keep human review for security — security checks benefit from manual judgment
Keep evals fast — slow evals get skipped
Version evals with code — evals are first-class artifacts

Pitfalls

Overfitting prompts to known eval examples
Measuring only happy-path outputs
Ignoring cost and latency drift while chasing pass rates
Allowing flaky graders in release gates

lidge-jun/eval-harness

eval-harness/SKILL.md

Evaluation framework for agent sessions implementing eval-driven development (EDD) principles.

4 stars

development

Updated Apr 24, 2026

$ install --global

skillsauth

npx skillsauth add lidge-jun/cli-jaw-skills eval-harness

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 5:37 PM139.5s1 file scanned

SKILL.md

name:: eval-harness
description:: Evaluation framework for agent sessions implementing eval-driven development (EDD) principles.

Eval Harness

Evaluation framework for agent-assisted workflows. Define expected behavior before implementation, run evals continuously, and track regressions with pass@k metrics.

When to Use

Defining pass/fail criteria for agent task completion
Measuring agent reliability with pass@k metrics
Creating regression test suites for prompt or agent changes
Benchmarking agent performance across model versions

Eval Types

Capability Evals

Test whether an agent can accomplish something new:

[CAPABILITY EVAL: feature-name]
Task: Description of what the agent should accomplish
Success Criteria:
  - [ ] Criterion 1
  - [ ] Criterion 2
Expected Output: Description of expected result

Regression Evals

Verify existing functionality remains intact:

[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
  - existing-test-1: PASS/FAIL
  - existing-test-2: PASS/FAIL
Result: X/Y passed (previously Y/Y)

Grader Types

Prefer code graders where possible — deterministic results are more reliable than probabilistic ones.

Metrics

pass@k

"At least one success in k attempts"

pass@1: First attempt success rate
pass@3: Success within 3 attempts
Typical target: pass@3 ≥ 90%

pass^k

"All k trials succeed"

pass^3: 3 consecutive successes
Use for release-critical paths (target: 1.00)

Eval Workflow

1. Define (before coding)

## EVAL DEFINITION: feature-xyz

### Capability Evals
1. Can create new user account
2. Can validate email format

### Regression Evals
1. Existing login still works
2. Session management unchanged

### Success Metrics
- pass@3 ≥ 90% for capability evals
- pass^3 = 100% for regression evals

2. Implement

Write code to pass the defined evals.

3. Evaluate

Run each eval, record PASS/FAIL per item.

4. Report

EVAL REPORT: feature-xyz
========================
Capability:  2/2 passed (pass@3: 100%)
Regression:  2/2 passed (pass^3: 100%)
Overall:     READY FOR REVIEW

Eval Storage

evals/
  feature-xyz.md      # Eval definition
  feature-xyz.log     # Run history
  baseline.json       # Regression baselines

Best Practices

Define evals before coding — forces clear thinking about success criteria
Run evals frequently — catch regressions early
Track pass@k over time — monitor reliability trends
Prefer code graders — deterministic beats probabilistic
Keep human review for security — security checks benefit from manual judgment
Keep evals fast — slow evals get skipped
Version evals with code — evals are first-class artifacts

Pitfalls

Overfitting prompts to known eval examples
Measuring only happy-path outputs
Ignoring cost and latency drift while chasing pass rates
Allowing flaky graders in release gates

Related Skills

lidge-jun/codex-imagegen

tools

VerifiedTrustedCommunity

Use only on the Codex CLI for native image generation or image editing without an API key. Save final PNG files under ~/.cli-jaw/uploads, report web-ready absolute-path markdown, and send to Telegram or Discord only when explicitly requested.

5SKILL.mdUpdated Jul 10, 2026

lidge-jun/codex-imagegen

lidge-jun/repo-map

tools

VerifiedTrustedCommunity

Ranked repository structure map via `cli-jaw map`. Use for codebase overview, structure map, symbol overview, unfamiliar codebase exploration, architecture orientation. Triggers: repo map, structure map, codebase overview, 와꾸, project structure, unfamiliar code.

5SKILL.mdUpdated Jul 7, 2026

lidge-jun/design

tools

VerifiedTrustedCommunity

cli-jaw Design workspace: create, preview, run, and export design pages from the right sidebar. Covers panel UX, direct-write workflow, artifact lifecycle, wireframe generation, design system, and Open Design adapter.

5SKILL.mdUpdated Jul 5, 2026

lidge-jun/dev-devops

development

VerifiedTrustedCommunity

MUST USE for infrastructure and delivery work — container builds, deploy pipelines, Kubernetes, Infrastructure as Code, SRE foundations, edge/serverless, ML infrastructure. Triggers: Dockerfile, K8s manifests, CI/CD pipeline, Terraform/IaC, release/deploy, devops/infra/deploy or release_cd task_tags.

5SKILL.mdUpdated Jun 19, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/lidge-jun/cli-jaw-skills.git

# Copy into Claude Code skills folder (global)
cp -r cli-jaw-skills/eval-harness ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

lidge-jun/cli-jaw-skills

4 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT