skills/eval-driven-dev/SKILL.md
Eval-Driven Development (EDD) — treating evals as unit tests for AI-assisted development. Use when building or modifying AI workflows, agent pipelines, prompt engineering, or any feature where output quality needs measurement. Define success criteria BEFORE implementing.
npx skillsauth add kitfunso/omniskill eval-driven-devInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Evals are unit tests for AI development. Define success criteria BEFORE implementing, measure AFTER.
[CAPABILITY EVAL: feature-name]
Success Criteria:
- [ ] Criterion 1 (objectively verifiable)
- [ ] Criterion 2
- [ ] Criterion 3
[REGRESSION EVAL: feature-name]
Existing Behavior:
- test-1: PASS/FAIL
- test-2: PASS/FAIL
| Type | When | How | |------|------|-----| | Code-based | Deterministic outputs | grep patterns, run tests, build checks, file existence | | Model-based | Open-ended outputs | Claude evaluates against rubric (use separate context) | | Human-based | Subjective quality | Flag for manual review |
The reviewer/grader should be a SEPARATE context from the implementer. Author bias means the implementer will rate their own work too favorably. Use a subagent or separate pass for grading.
development
Weekly engineering retrospective. Analyzes commit history, work patterns, and code quality metrics with persistent history and trend tracking. Team-aware: breaks down per-person contributions with praise and growth areas. Use when asked to "weekly retro", "what did we ship", or "engineering retrospective". Proactively suggest at the end of a work week or sprint.
development
Systematically QA test a web application and fix bugs found. Runs QA testing, then iteratively fixes bugs in source code, committing each fix atomically and re-verifying. Use when asked to "qa", "QA", "test this site", "find bugs", "test and fix", or "fix what's broken". Proactively suggest when the user says a feature is ready for testing or asks "does this work?". Three tiers: Quick (critical/high only), Standard (+ medium), Exhaustive (+ cosmetic). Produces before/after health scores, fix evidence, and a ship-readiness summary. For report-only mode, use /qa-only.
development
Report-only QA testing. Systematically tests a web application and produces a structured report with health score, screenshots, and repro steps — but never fixes anything. Use when asked to "just report bugs", "qa report only", or "test but don't fix". For the full test-fix-verify loop, use /qa instead. Proactively suggest when the user wants a bug report without any code changes.
testing
Eng manager-mode plan review. Lock in the execution plan — architecture, data flow, diagrams, edge cases, test coverage, performance. Walks through issues interactively with opinionated recommendations. Use when asked to "review the architecture", "engineering review", or "lock in the plan". Proactively suggest when the user has a plan or design doc and is about to start coding — to catch architecture issues before implementation.