eval-harness/SKILL.md
Evaluation framework for agent sessions implementing eval-driven development (EDD) principles.
npx skillsauth add lidge-jun/cli-jaw-skills eval-harnessInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Evaluation framework for agent-assisted workflows. Define expected behavior before implementation, run evals continuously, and track regressions with pass@k metrics.
Test whether an agent can accomplish something new:
[CAPABILITY EVAL: feature-name]
Task: Description of what the agent should accomplish
Success Criteria:
- [ ] Criterion 1
- [ ] Criterion 2
Expected Output: Description of expected result
Verify existing functionality remains intact:
[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
- existing-test-1: PASS/FAIL
- existing-test-2: PASS/FAIL
Result: X/Y passed (previously Y/Y)
| Type | Use when | Example |
|------|----------|---------|
| Code grader | Deterministic assertions | grep -q "export function handleAuth" src/auth.ts |
| Rule grader | Regex/schema constraints | JSON schema validation, pattern matching |
| Model grader | Open-ended output quality | LLM-as-judge with 1–5 rubric |
| Human grader | Ambiguous or security-sensitive | Manual review with risk level flag |
Prefer code graders where possible — deterministic results are more reliable than probabilistic ones.
"At least one success in k attempts"
"All k trials succeed"
## EVAL DEFINITION: feature-xyz
### Capability Evals
1. Can create new user account
2. Can validate email format
### Regression Evals
1. Existing login still works
2. Session management unchanged
### Success Metrics
- pass@3 ≥ 90% for capability evals
- pass^3 = 100% for regression evals
Write code to pass the defined evals.
Run each eval, record PASS/FAIL per item.
EVAL REPORT: feature-xyz
========================
Capability: 2/2 passed (pass@3: 100%)
Regression: 2/2 passed (pass^3: 100%)
Overall: READY FOR REVIEW
evals/
feature-xyz.md # Eval definition
feature-xyz.log # Run history
baseline.json # Regression baselines
development
Native Web UI structured renderer schemas for compose-block drafts, search-results cards, dataframe tables, chart-json charts, and diff output
tools
Unified search hub. Route any web/real-time/X lookup through a 4-tier escalation: built-in web search → cli-jaw browser CDP → progrok Grok OAuth → web-ai (Grok Expert / GPT Pro). Use for: search, 검색, web search, latest news, real-time info, X/Twitter, fact lookup, deep research.
development
UI/UX intent discovery, design vocabulary, product personalities, UX state patterns, typography line break judgment, favicon/product logo design, and logo trust section design. Use when user design direction is vague, when building onboarding/empty/error states, when setting up favicons or product logos, or when referencing a product aesthetic.
development
Canonical owner of module boundary rules, circular dependency detection/prevention, implicit coupling taxonomy, barrel/re-export discipline, and boundary-only defensive programming. Referenced by dev, dev-code-reviewer, dev-backend, dev-frontend stubs.