skills/eval-harness/SKILL.md
--- name: eval-harness description: Eval-Driven Development (EDD) framework for AI development. Use when implementing AI features, prompt engineering, or LLM integration. Keywords: eval, evaluation, test, AI, LLM, prompt, benchmark, quality, EDD. --- # Eval Harness Skill (Eval-Driven Development) ## Purpose Apply EDD (Eval-Driven Development) methodology when developing AI features: "Define evaluation first, then implement to pass the eval." **Core Principle**: Eval is the unit test of AI dev
npx skillsauth add excatt/superclaude-plusplus skills/eval-harnessInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Apply EDD (Eval-Driven Development) methodology when developing AI features: "Define evaluation first, then implement to pass the eval."
Core Principle: Eval is the unit test of AI development → Define success criteria first → Implement → Evaluate → Iterate
/eval, evaluation setup, benchmarkTraditional Dev AI Dev (EDD)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Unit Test Eval
Expected Output Success Criteria
assert(result == x) score >= threshold
Deterministic Probabilistic
100% pass required pass@k metric
| Type | Purpose | Example | |------|---------|---------| | Capability Eval | Test new feature | "Does summary include key points?" | | Regression Eval | Maintain existing | "Does classification accuracy hold?" | | Safety Eval | Verify safety | "Does it reject harmful content?" |
/eval define <feature-name>
📝 Eval Definition
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Feature: document-summarizer
Type: Capability
Success Criteria:
□ Summary length < 30% of original
□ Contains 80%+ key keywords
□ Grammatically correct sentences
□ No hallucinations
Test Cases:
1. Short news article (200 words)
2. Long tech doc (2000 words)
3. Structured report (with tables)
4. Multilingual doc (EN/KR mixed)
Grading Method: Model-based + Code-based hybrid
Target: pass@3 > 90%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Implement feature according to eval definition
/eval run <feature-name>
🧪 Running Evals...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Feature: document-summarizer
Test Cases: 4
Trials per case: 3
Results:
┌─────────────────┬─────────┬─────────┬─────────┐
│ Test Case │ Trial 1 │ Trial 2 │ Trial 3 │
├─────────────────┼─────────┼─────────┼─────────┤
│ Short news │ ✅ PASS │ ✅ PASS │ ✅ PASS │
│ Long tech doc │ ✅ PASS │ ❌ FAIL │ ✅ PASS │
│ Structured │ ✅ PASS │ ✅ PASS │ ✅ PASS │
│ Multilingual │ ❌ FAIL │ ✅ PASS │ ✅ PASS │
└─────────────────┴─────────┴─────────┴─────────┘
Metrics:
- pass@1: 75% (3/4)
- pass@3: 100% (4/4) ✅
- Average score: 0.87
Status: ✅ PASSED (pass@3 > 90%)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
/eval report
📊 Eval Report: document-summarizer
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Version: 1.0.0
Date: 2025-01-26
Baseline: v0.9.0
Performance:
┌────────────────┬─────────┬─────────┬────────┐
│ Metric │ Target │ Actual │ Status │
├────────────────┼─────────┼─────────┼────────┤
│ pass@3 │ > 90% │ 100% │ ✅ │
│ Avg Score │ > 0.8 │ 0.87 │ ✅ │
│ Latency (p50) │ < 2s │ 1.2s │ ✅ │
│ Latency (p99) │ < 5s │ 3.8s │ ✅ │
└────────────────┴─────────┴─────────┴────────┘
Regression Check:
- Classification: ✅ No regression
- Entity extraction: ✅ No regression
- Sentiment: ⚠️ -2% (within tolerance)
Recommendation: Ready for production
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Length verification
def check_length(summary, original):
return len(summary) < len(original) * 0.3
# Keyword inclusion verification
def check_keywords(summary, keywords):
found = sum(1 for k in keywords if k in summary)
return found / len(keywords) >= 0.8
# Build success verification
def check_build():
return subprocess.run(["npm", "run", "build"]).returncode == 0
JUDGE_PROMPT = """
Evaluate the following summary:
Original: {original}
Summary: {summary}
Evaluation criteria:
1. Key information inclusion (1-5)
2. Conciseness (1-5)
3. Accuracy (1-5)
4. Readability (1-5)
Provide scores and reasons in JSON format.
"""
def model_grade(original, summary):
response = llm.generate(JUDGE_PROMPT.format(...))
scores = json.loads(response)
return sum(scores.values()) / 20 # Normalize to 0-1
human_review:
required_for:
- safety_critical_decisions
- edge_cases
- low_confidence_results
interface:
- show: [input, output, criteria]
- collect: [pass/fail, score, comments]
pass@k: At least 1 success in k trials
pass@1 = Single trial success rate (strict)
pass@3 = At least 1 success in 3 trials (common)
pass@5 = At least 1 success in 5 trials (lenient)
pass^k: All k trials must succeed
Use for safety-critical features:
- Harmful content filtering
- PII detection
- Security validation
score >= threshold
Continuous score evaluation:
- Quality score
- Similarity score
- Confidence score
.claude/evals/
├── capability/
│ ├── document-summarizer.eval.md
│ └── code-generator.eval.md
├── regression/
│ ├── classification.eval.md
│ └── extraction.eval.md
├── safety/
│ └── content-filter.eval.md
└── baselines/
└── v1.0.0.json
---
name: document-summarizer
type: capability
version: 1.0.0
created: 2025-01-26
---
# Document Summarizer Eval
## Success Criteria
- [ ] Summary length < 30% of original
- [ ] Contains 80%+ key concepts
- [ ] Grammatically correct
- [ ] No hallucinations
## Test Cases
### Case 1: Short News Article
**Input**: [news_article.txt]
**Expected**: Summary with main event, actors, outcome
**Grader**: model-based
### Case 2: Technical Documentation
**Input**: [tech_doc.md]
**Expected**: Summary with key concepts, no code
**Grader**: hybrid (code + model)
## Grading Configuration
```yaml
method: hybrid
code_checks:
- length_ratio: 0.3
- keyword_coverage: 0.8
model_judge:
criteria: [accuracy, completeness, clarity]
threshold: 0.8
---
## Integration
### With `/verify`
/verify → General code quality /eval → AI feature quality
Both must pass for PR Ready
### With `/feature-planner`
AI feature phase structure:
### With CI/CD
```yaml
# .github/workflows/eval.yml
eval:
runs-on: ubuntu-latest
steps:
- name: Run Evals
run: claude eval run --all
- name: Check Regression
run: claude eval regression --baseline v1.0.0
- name: Upload Report
uses: actions/upload-artifact@v3
with:
name: eval-report
path: .claude/evals/reports/
| Command | Description |
|---------|-------------|
| /eval define <name> | Define new evaluation |
| /eval run <name> | Run evaluation |
| /eval run --all | Run all evaluations |
| /eval report | Generate eval report |
| /eval regression | Run regression tests |
| /eval baseline <version> | Save baseline |
❌ "Result should be good" → Not measurable
❌ Single test case → Overfitting risk
❌ pass@1 > 99% target → Unrealistic
❌ Everything model-based → Cost and consistency issues
testing
사용자 계획을 기존 도메인 모델에 대해 stress-test하는 인터뷰 세션. 용어를 날카롭게 다듬고, 결정이 굳어질 때마다 CONTEXT.md(도메인 어휘 사전)와 ADR을 인라인으로 갱신한다. 새 기능 요구사항 탐색은 `/brainstorm`을, 기존 도메인 모델·용어와의 정합성 점검은 이 스킬을 사용한다.
development
# Excel (XLSX) Spreadsheet Skill Claude Code supports comprehensive spreadsheet operations through the **xlsx** skill, enabling creation, editing, and analysis of Excel files (.xlsx, .xlsm, .csv, .tsv). ## Trigger - When user needs Excel spreadsheet creation or editing - Financial modeling or data analysis required - Spreadsheet formulas and calculations needed - Data import from CSV/TSV files ## Core Capabilities **Primary functions include:** - Creating new spreadsheets with formulas and f
tools
Generate structured implementation workflows from PRDs and feature requirements
development
실시간 통신 설계 가이드를 실행합니다.