Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

drixxodev/agent-eval

Name: agent-eval
Author: drixxodev

docs/zh-CN/skills/agent-eval/SKILL.md

npx skillsauth add drixxodev/drixxodev-ai agent-eval

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Agent Eval 技能

一个轻量级 CLI 工具，用于在可复现的任务上对编码代理进行头对头比较。每个“哪个编码代理最好？”的比较都基于感觉——本工具将其系统化。

何时使用

在你自己的代码库上比较编码代理（Claude Code、Aider、Codex 等）
在采用新工具或模型之前衡量代理性能
当代理更新其模型或工具时运行回归检查
为团队做出数据支持的代理选择决策

安装

# pinned to v0.1.0 — latest stable commit
pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b

核心概念

YAML 任务定义

以声明方式定义任务。每个任务指定要做什么、要修改哪些文件以及如何判断成功：

name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
  - src/http_client.py
prompt: |
  Add retry logic with exponential backoff to all HTTP requests.
  Max 3 retries. Initial delay 1s, max delay 30s.
judge:
  - type: pytest
    command: pytest tests/test_http_client.py -v
  - type: grep
    pattern: "exponential_backoff|retry"
    files: src/http_client.py
commit: "abc1234"  # pin to specific commit for reproducibility

Git 工作树隔离

每个代理运行都获得自己的 git 工作树——无需 Docker。这提供了可复现的隔离，使得代理之间不会相互干扰或损坏基础仓库。

收集的指标

| 指标 | 衡量内容 | |--------|-----------------| | 通过率 | 代理生成的代码是否通过了判断？ | | 成本 | 每个任务的 API 花费（如果可用） | | 时间 | 完成所需的挂钟秒数 | | 一致性 | 跨重复运行的通过率（例如，3/3 = 100%） |

工作流程

1. 定义任务

创建一个 tasks/ 目录，其中包含 YAML 文件，每个任务一个文件：

mkdir tasks
# Write task definitions (see template above)

2. 运行代理

针对你的任务执行代理：

agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3

每次运行：

从指定的提交创建一个新的 git 工作树
将提示交给代理
运行判断标准
记录通过/失败、成本和时间

3. 比较结果

生成比较报告：

agent-eval report --format table

Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │
│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │
└──────────────┴───────────┴────────┴────────┴─────────────┘

判断类型

基于代码（确定性）

judge:
  - type: pytest
    command: pytest tests/ -v
  - type: command
    command: npm run build

基于模式

judge:
  - type: grep
    pattern: "class.*Retry"
    files: src/**/*.py

基于模型（LLM 作为判断器）

judge:
  - type: llm
    prompt: |
      Does this implementation correctly handle exponential backoff?
      Check for: max retries, increasing delays, jitter.

最佳实践

从 3-5 个任务开始，这些任务代表你的真实工作负载，而非玩具示例
每个代理至少运行 3 次试验以捕捉方差——代理是非确定性的
在你的任务 YAML 中固定提交，以便结果在数天/数周内可复现
每个任务至少包含一个确定性判断器（测试、构建）——LLM 判断器会增加噪音
跟踪成本与通过率——一个通过率 95% 但成本高出 10 倍的代理可能不是正确的选择
对你的任务定义进行版本控制——它们是测试夹具，应将其视为代码

链接

仓库：github.com/joaquinhuigomez/agent-eval

drixxodev/agent-eval

docs/zh-CN/skills/agent-eval/SKILL.md

编码代理（Claude Code、Aider、Codex等）在自定义任务上的直接比较，包含通过率、成本、时间和一致性指标

1 stars

development

Updated May 28, 2026

$ install --global

skillsauth

npx skillsauth add drixxodev/drixxodev-ai agent-eval

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Mar 29, 2026, 5:51 AM37.6s1 file scanned

SKILL.md

name:: agent-eval
description:: 编码代理（Claude Code、Aider、Codex等）在自定义任务上的直接比较，包含通过率、成本、时间和一致性指标
origin:: ECC
tools:: Read, Write, Edit, Bash, Grep, Glob

Agent Eval 技能

一个轻量级 CLI 工具，用于在可复现的任务上对编码代理进行头对头比较。每个“哪个编码代理最好？”的比较都基于感觉——本工具将其系统化。

何时使用

在你自己的代码库上比较编码代理（Claude Code、Aider、Codex 等）
在采用新工具或模型之前衡量代理性能
当代理更新其模型或工具时运行回归检查
为团队做出数据支持的代理选择决策

安装

# pinned to v0.1.0 — latest stable commit
pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b

核心概念

YAML 任务定义

以声明方式定义任务。每个任务指定要做什么、要修改哪些文件以及如何判断成功：

name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
  - src/http_client.py
prompt: |
  Add retry logic with exponential backoff to all HTTP requests.
  Max 3 retries. Initial delay 1s, max delay 30s.
judge:
  - type: pytest
    command: pytest tests/test_http_client.py -v
  - type: grep
    pattern: "exponential_backoff|retry"
    files: src/http_client.py
commit: "abc1234"  # pin to specific commit for reproducibility

Git 工作树隔离

每个代理运行都获得自己的 git 工作树——无需 Docker。这提供了可复现的隔离，使得代理之间不会相互干扰或损坏基础仓库。

收集的指标

工作流程

1. 定义任务

创建一个 tasks/ 目录，其中包含 YAML 文件，每个任务一个文件：

mkdir tasks
# Write task definitions (see template above)

2. 运行代理

针对你的任务执行代理：

agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3

每次运行：

从指定的提交创建一个新的 git 工作树
将提示交给代理
运行判断标准
记录通过/失败、成本和时间

3. 比较结果

生成比较报告：

agent-eval report --format table

Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │
│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │
└──────────────┴───────────┴────────┴────────┴─────────────┘

判断类型

基于代码（确定性）

judge:
  - type: pytest
    command: pytest tests/ -v
  - type: command
    command: npm run build

基于模式

judge:
  - type: grep
    pattern: "class.*Retry"
    files: src/**/*.py

基于模型（LLM 作为判断器）

judge:
  - type: llm
    prompt: |
      Does this implementation correctly handle exponential backoff?
      Check for: max retries, increasing delays, jitter.

最佳实践

从 3-5 个任务开始，这些任务代表你的真实工作负载，而非玩具示例
每个代理至少运行 3 次试验以捕捉方差——代理是非确定性的
在你的任务 YAML 中固定提交，以便结果在数天/数周内可复现
每个任务至少包含一个确定性判断器（测试、构建）——LLM 判断器会增加噪音
跟踪成本与通过率——一个通过率 95% 但成本高出 10 倍的代理可能不是正确的选择
对你的任务定义进行版本控制——它们是测试夹具，应将其视为代码

链接

仓库：github.com/joaquinhuigomez/agent-eval

Related Skills

drixxodev/tdd-workflow

tools

VerifiedTrustedCommunity

在编写新功能、修复错误或重构代码时使用此技能。强制执行测试驱动开发，确保单元测试、集成测试和端到端测试的覆盖率超过80%。

1SKILL.mdUpdated May 28, 2026

drixxodev/tdd-workflow

drixxodev/swiftui-patterns

tools

VerifiedTrustedCommunity

SwiftUI 架构模式，使用 @Observable 进行状态管理，视图组合，导航，性能优化，以及现代 iOS/macOS UI 最佳实践。

1SKILL.mdUpdated May 28, 2026

drixxodev/swiftui-patterns

drixxodev/swift-protocol-di-testing

development

VerifiedTrustedCommunity

基于协议的依赖注入，用于可测试的Swift代码——使用聚焦协议和Swift Testing模拟文件系统、网络和外部API。

1SKILL.mdUpdated May 28, 2026

drixxodev/swift-protocol-di-testing

drixxodev/swift-concurrency-6-2

tools

VerifiedTrustedCommunity

Swift 6.2 可接近的并发性 — 默认单线程，@concurrent 用于显式后台卸载，隔离一致性用于主 actor 类型。

1SKILL.mdUpdated May 28, 2026

drixxodev/swift-concurrency-6-2

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/drixxodev/drixxodev-ai.git

# Copy into Claude Code skills folder (global)
cp -r drixxodev-ai/docs/zh-CN/skills/agent-eval ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

drixxodev/drixxodev-ai

1 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT