docs/ko-KR/skills/eval-harness/SKILL.md
평가 주도 개발(EDD) 원칙을 구현하는 Claude Code 세션용 공식 평가 프레임워크
npx skillsauth add SiniyaYousuf/everything_claudecode eval-harnessInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Claude Code 세션을 위한 공식 평가 프레임워크로, 평가 주도 개발(EDD) 원칙을 구현합니다.
평가 주도 개발은 평가를 "AI 개발의 단위 테스트"로 취급합니다:
Claude가 이전에 할 수 없었던 것을 할 수 있는지 테스트:
[CAPABILITY EVAL: feature-name]
Task: Description of what Claude should accomplish
Success Criteria:
- [ ] Criterion 1
- [ ] Criterion 2
- [ ] Criterion 3
Expected Output: Description of expected result
변경 사항이 기존 기능을 손상시키지 않는지 확인:
[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
- existing-test-1: PASS/FAIL
- existing-test-2: PASS/FAIL
- existing-test-3: PASS/FAIL
Result: X/Y passed (previously Y/Y)
코드를 사용한 결정론적 검사:
# Check if file contains expected pattern
grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"
# Check if tests pass
npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
# Check if build succeeds
npm run build && echo "PASS" || echo "FAIL"
Claude를 사용하여 개방형 출력 평가:
[MODEL GRADER PROMPT]
Evaluate the following code change:
1. Does it solve the stated problem?
2. Is it well-structured?
3. Are edge cases handled?
4. Is error handling appropriate?
Score: 1-5 (1=poor, 5=excellent)
Reasoning: [explanation]
수동 검토 플래그:
[HUMAN REVIEW REQUIRED]
Change: Description of what changed
Reason: Why human review is needed
Risk Level: LOW/MEDIUM/HIGH
"k번 시도 중 최소 한 번 성공"
"k번 시행 모두 성공"
## EVAL DEFINITION: feature-xyz
### Capability Evals
1. Can create new user account
2. Can validate email format
3. Can hash password securely
### Regression Evals
1. Existing login still works
2. Session management unchanged
3. Logout flow intact
### Success Metrics
- pass@3 > 90% for capability evals
- pass^3 = 100% for regression evals
정의된 평가를 통과하기 위한 코드 작성.
# Run capability evals
[Run each capability eval, record PASS/FAIL]
# Run regression evals
npm test -- --testPathPattern="existing"
# Generate report
EVAL REPORT: feature-xyz
========================
Capability Evals:
create-user: PASS (pass@1)
validate-email: PASS (pass@2)
hash-password: PASS (pass@1)
Overall: 3/3 passed
Regression Evals:
login-flow: PASS
session-mgmt: PASS
logout-flow: PASS
Overall: 3/3 passed
Metrics:
pass@1: 67% (2/3)
pass@3: 100% (3/3)
Status: READY FOR REVIEW
/eval define feature-name
.claude/evals/feature-name.md에 평가 정의 파일 생성
/eval check feature-name
현재 평가를 실행하고 상태 보고
/eval report feature-name
전체 평가 보고서 생성
프로젝트에 평가 저장:
.claude/
evals/
feature-xyz.md # 평가 정의
feature-xyz.log # 평가 실행 이력
baseline.json # 회귀 베이스라인
## EVAL: add-authentication
### Phase 1: 정의 (10분)
Capability Evals:
- [ ] User can register with email/password
- [ ] User can login with valid credentials
- [ ] Invalid credentials rejected with proper error
- [ ] Sessions persist across page reloads
- [ ] Logout clears session
Regression Evals:
- [ ] Public routes still accessible
- [ ] API responses unchanged
- [ ] Database schema compatible
### Phase 2: 구현 (가변)
[Write code]
### Phase 3: 평가
Run: /eval check add-authentication
### Phase 4: 보고서
EVAL REPORT: add-authentication
==============================
Capability: 5/5 passed (pass@3: 100%)
Regression: 3/3 passed (pass^3: 100%)
Status: SHIP IT
행동 품질을 단위 테스트만으로 포착할 수 없을 때 제품 평가를 사용하세요.
pass@1: 직접 신뢰성pass@3: 제어된 재시도 하에서의 실용적 신뢰성pass^3: 안정성 테스트 (3회 모두 통과해야 함)권장 임계값:
.claude/evals/<feature>.md 정의.claude/evals/<feature>.log 실행 이력docs/releases/<version>/eval-summary.md 릴리스 스냅샷development
X/Twitter API integration for posting tweets, threads, reading timelines, search, and analytics. Covers OAuth auth patterns, rate limits, and platform-native content posting. Use when the user wants to interact with X programmatically.
documentation
Translate visa application documents (images) to English and create a bilingual PDF with original and translation
tools
See, Understand, Act on video and audio. See- ingest from local files, URLs, RTSP/live feeds, or live record desktop; return realtime context and playable stream links. Understand- extract frames, build visual/semantic/temporal indexes, and search moments with timestamps and auto-clips. Act- transcode and normalize (codec, fps, resolution, aspect ratio), perform timeline edits (subtitles, text/image overlays, branding, audio overlays, dubbing, translation), generate media assets (image, audio, video), and create real time alerts for events from live streams or desktop capture.
development
AI-assisted video editing workflows for cutting, structuring, and augmenting real footage. Covers the full pipeline from raw capture through FFmpeg, Remotion, ElevenLabs, fal.ai, and final polish in Descript or CapCut. Use when the user wants to edit video, cut footage, create vlogs, or build video content.