Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

kitfunso/eval-driven-dev

Name: eval-driven-dev
Author: kitfunso

skills/eval-driven-dev/SKILL.md

npx skillsauth add kitfunso/omniskill eval-driven-dev

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Eval-Driven Development

Evals are unit tests for AI development. Define success criteria BEFORE implementing, measure AFTER.

Two Eval Types

Capability Evals — Does the new thing work?

[CAPABILITY EVAL: feature-name]
Success Criteria:
  - [ ] Criterion 1 (objectively verifiable)
  - [ ] Criterion 2
  - [ ] Criterion 3

Regression Evals — Did we break anything?

[REGRESSION EVAL: feature-name]
Existing Behavior:
  - test-1: PASS/FAIL
  - test-2: PASS/FAIL

Grader Types

| Type | When | How | |------|------|-----| | Code-based | Deterministic outputs | grep patterns, run tests, build checks, file existence | | Model-based | Open-ended outputs | Claude evaluates against rubric (use separate context) | | Human-based | Subjective quality | Flag for manual review |

Metrics

pass@k — "At least one success in k attempts." Target: pass@3 > 90%
pass^k — "All k trials succeed." Higher bar for critical paths.

Workflow

Define evals BEFORE coding (what does success look like?)
Implement the feature
Evaluate — run evals, collect pass rates
Iterate — fix failures, re-evaluate
Report — document final pass rates with the change

Key Insight

The reviewer/grader should be a SEPARATE context from the implementer. Author bias means the implementer will rate their own work too favorably. Use a subagent or separate pass for grading.

kitfunso/eval-driven-dev

skills/eval-driven-dev/SKILL.md

Eval-Driven Development (EDD) — treating evals as unit tests for AI-assisted development. Use when building or modifying AI workflows, agent pipelines, prompt engineering, or any feature where output quality needs measurement. Define success criteria BEFORE implementing.

development

Updated Apr 22, 2026

$ install --global

skillsauth

npx skillsauth add kitfunso/omniskill eval-driven-dev

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 22, 2026, 10:37 PM332.4s2 files scanned

SKILL.md

name:: eval-driven-dev
description:: Eval-Driven Development (EDD) — treating evals as unit tests for AI-assisted development. Use when building or modifying AI workflows, agent pipelines, prompt engineering, or any feature where output quality needs measurement. Define success criteria BEFORE implementing.

Eval-Driven Development

Evals are unit tests for AI development. Define success criteria BEFORE implementing, measure AFTER.

Two Eval Types

Capability Evals — Does the new thing work?

[CAPABILITY EVAL: feature-name]
Success Criteria:
  - [ ] Criterion 1 (objectively verifiable)
  - [ ] Criterion 2
  - [ ] Criterion 3

Regression Evals — Did we break anything?

[REGRESSION EVAL: feature-name]
Existing Behavior:
  - test-1: PASS/FAIL
  - test-2: PASS/FAIL

Grader Types

Metrics

pass@k — "At least one success in k attempts." Target: pass@3 > 90%
pass^k — "All k trials succeed." Higher bar for critical paths.

Workflow

Define evals BEFORE coding (what does success look like?)
Implement the feature
Evaluate — run evals, collect pass rates
Iterate — fix failures, re-evaluate
Report — document final pass rates with the change

Key Insight

The reviewer/grader should be a SEPARATE context from the implementer. Author bias means the implementer will rate their own work too favorably. Use a subagent or separate pass for grading.

Related Skills

kitfunso/retro

development

VerifiedTrustedCommunity

Weekly engineering retrospective. Analyzes commit history, work patterns, and code quality metrics with persistent history and trend tracking. Team-aware: breaks down per-person contributions with praise and growth areas. Use when asked to "weekly retro", "what did we ship", or "engineering retrospective". Proactively suggest at the end of a work week or sprint.

SKILL.mdUpdated Apr 23, 2026

kitfunso/qa

development

VerifiedTrustedCommunity

Systematically QA test a web application and fix bugs found. Runs QA testing, then iteratively fixes bugs in source code, committing each fix atomically and re-verifying. Use when asked to "qa", "QA", "test this site", "find bugs", "test and fix", or "fix what's broken". Proactively suggest when the user says a feature is ready for testing or asks "does this work?". Three tiers: Quick (critical/high only), Standard (+ medium), Exhaustive (+ cosmetic). Produces before/after health scores, fix evidence, and a ship-readiness summary. For report-only mode, use /qa-only.

SKILL.mdUpdated Apr 23, 2026

kitfunso/qa-only

development

VerifiedTrustedCommunity

Report-only QA testing. Systematically tests a web application and produces a structured report with health score, screenshots, and repro steps — but never fixes anything. Use when asked to "just report bugs", "qa report only", or "test but don't fix". For the full test-fix-verify loop, use /qa instead. Proactively suggest when the user wants a bug report without any code changes.

SKILL.mdUpdated Apr 23, 2026

kitfunso/plan-eng-review

testing

VerifiedTrustedCommunity

Eng manager-mode plan review. Lock in the execution plan — architecture, data flow, diagrams, edge cases, test coverage, performance. Walks through issues interactively with opinionated recommendations. Use when asked to "review the architecture", "engineering review", or "lock in the plan". Proactively suggest when the user has a plan or design doc and is about to start coding — to catch architecture issues before implementation.

SKILL.mdUpdated Apr 23, 2026

kitfunso/plan-eng-review

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/kitfunso/omniskill.git

# Copy into Claude Code skills folder (global)
cp -r omniskill/skills/eval-driven-dev ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

kitfunso/omniskill

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT