Tool/everything-claude-code/docs/zh-CN/skills/eval-harness/SKILL.md
克劳德代码会话的正式评估框架,实施评估驱动开发(EDD)原则
npx skillsauth add lyxjack/toolbox eval-harnessInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
一个用于 Claude Code 会话的正式评估框架,实现了评估驱动开发 (EDD) 原则。
评估驱动开发将评估视为 "AI 开发的单元测试":
测试 Claude 是否能完成之前无法完成的事情:
[能力评估:功能名称]
任务:描述 Claude 应完成的工作
成功标准:
- [ ] 标准 1
- [ ] 标准 2
- [ ] 标准 标准 3
预期输出:对预期结果的描述
确保更改不会破坏现有功能:
[回归评估:功能名称]
基线:SHA 或检查点名称
测试:
- 现有测试-1:通过/失败
- 现有测试-2:通过/失败
- 现有测试-3:通过/失败
结果:X/Y 通过(之前为 Y/Y)
使用代码进行确定性检查:
# Check if file contains expected pattern
grep -q "export function handleAuth" src/auth.ts && echo "PASS" || echo "FAIL"
# Check if tests pass
npm test -- --testPathPattern="auth" && echo "PASS" || echo "FAIL"
# Check if build succeeds
npm run build && echo "PASS" || echo "FAIL"
使用 Claude 来评估开放式输出:
[MODEL GRADER PROMPT]
评估以下代码变更:
1. 它是否解决了所述问题?
2. 它的结构是否良好?
3. 是否处理了边界情况?
4. 错误处理是否恰当?
评分:1-5 (1=差,5=优秀)
推理:[解释]
标记为需要手动审查:
[HUMAN REVIEW REQUIRED]
变更:对更改内容的描述
原因:为何需要人工审核
风险等级:低/中/高
"k 次尝试中至少成功一次"
"所有 k 次试验都成功"
## 评估定义:功能-xyz
### 能力评估
1. 可以创建新用户账户
2. 可以验证电子邮件格式
3. 可以安全地哈希密码
### 回归评估
1. 现有登录功能仍然有效
2. 会话管理未改变
3. 注销流程完整
### 成功指标
- 能力评估的 pass@3 > 90%
- 回归评估的 pass^3 = 100%
编写代码以通过已定义的评估。
# Run capability evals
[Run each capability eval, record PASS/FAIL]
# Run regression evals
npm test -- --testPathPattern="existing"
# Generate report
评估报告:功能-xyz
========================
能力评估:
创建用户: 通过(通过@1)
验证邮箱: 通过(通过@2)
哈希密码: 通过(通过@1)
总计: 3/3 通过
回归评估:
登录流程: 通过
会话管理: 通过
登出流程: 通过
总计: 3/3 通过
指标:
通过@1: 67% (2/3)
通过@3: 100% (3/3)
状态:准备就绪,待审核
/eval define feature-name
在 .claude/evals/feature-name.md 处创建评估定义文件
/eval check feature-name
运行当前评估并报告状态
/eval report feature-name
生成完整的评估报告
将评估存储在项目中:
.claude/
evals/
feature-xyz.md # Eval definition
feature-xyz.log # Eval run history
baseline.json # Regression baselines
## EVAL:添加身份验证
### 第 1 阶段:定义 (10 分钟)
能力评估:
- [ ] 用户可以使用邮箱/密码注册
- [ ] 用户可以使用有效凭证登录
- [ ] 无效凭证被拒绝并显示适当的错误
- [ ] 会话在页面重新加载后保持
- [ ] 登出操作清除会话
回归评估:
- [ ] 公共路由仍可访问
- [ ] API 响应未改变
- [ ] 数据库模式兼容
### 第 2 阶段:实施 (时间不定)
[编写代码]
### 第 3 阶段:评估
运行:/eval check add-authentication
### 第 4 阶段:报告
评估报告:添加身份验证
==============================
能力:5/5 通过 (pass@3: 100%)
回归:3/3 通过 (pass^3: 100%)
状态:可以发布
当单元测试无法单独捕获行为质量时,使用产品评估。
pass@1:直接可靠性pass@3:受控重试下的实际可靠性pass^3:稳定性测试(所有 3 次运行必须通过)推荐阈值:
.claude/evals/<feature>.md 定义.claude/evals/<feature>.log 运行历史docs/releases/<version>/eval-summary.md 发布快照development
React Native and Expo best practices for building performant mobile apps. Use when building React Native components, optimizing list performance, implementing animations, or working with native modules. Triggers on tasks involving React Native, Expo, mobile performance, or native platform APIs.
development
Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, artifacts, posters, or applications (examples include websites, landing pages, dashboards, React components, HTML/CSS layouts, or when styling/beautifying any web UI). Generates creative, polished code and UI design that avoids generic AI aesthetics.
data-ai
Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.
development
X/Twitter API integration for posting tweets, threads, reading timelines, search, and analytics. Covers OAuth auth patterns, rate limits, and platform-native content posting. Use when the user wants to interact with X programmatically.