Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

affaan-m/agent-eval

Name: agent-eval
Author: affaan-m

docs/ja-JP/skills/agent-eval/SKILL.md

npx skillsauth add affaan-m/everything-claude-code agent-eval

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

エージェント評価スキル

再現可能なタスクでコーディングエージェントをヘッドツーヘッドで比較するための軽量 CLI ツールです。「どのコーディングエージェントが最適か？」という比較はすべて感覚に頼りがちです — このツールはそれを体系化します。

起動タイミング

自分のコードベースでコーディングエージェント（Claude Code、Aider、Codex など）を比較する
新しいツールやモデルを採用する前にエージェントパフォーマンスを測定する
エージェントがモデルやツールを更新した際にリグレッションチェックを実行する
チームにデータに基づいたエージェント選択の判断を提供する

インストール

注意: agent-eval はソースを確認した後、リポジトリからインストールしてください。

コアコンセプト

YAML タスク定義

タスクを宣言的に定義します。各タスクは何をするか、どのファイルを操作するか、成功をどう判定するかを指定します：

name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
  - src/http_client.py
prompt: |
  Add retry logic with exponential backoff to all HTTP requests.
  Max 3 retries. Initial delay 1s, max delay 30s.
judge:
  - type: pytest
    command: pytest tests/test_http_client.py -v
  - type: grep
    pattern: "exponential_backoff|retry"
    files: src/http_client.py
commit: "abc1234"  # 再現性のために特定コミットに固定

Git ワークツリー分離

各エージェント実行は独自の git ワークツリーを取得します — Docker 不要。これにより再現性の分離が提供され、エージェントが互いに干渉したりベースリポジトリを破壊したりしません。

収集メトリクス

| メトリクス | 測定内容 | |--------|-----------------| | 合格率 | エージェントはジャッジをパスするコードを生成できたか？ | | コスト | タスクあたりの API 費用（利用可能な場合） | | 時間 | 完了までのウォールクロック秒数 | | 一貫性 | 繰り返し実行での合格率（例：3/3 = 100%） |

ワークフロー

1. タスクの定義

タスクごとに 1 つの YAML ファイルを持つ tasks/ ディレクトリを作成します：

mkdir tasks
# タスク定義を作成（上記のテンプレートを参照）

2. エージェントの実行

タスクに対してエージェントを実行します：

agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3

各実行：

指定されたコミットから新しい git ワークツリーを作成
エージェントにプロンプトを渡す
ジャッジ基準を実行
合格・不合格、コスト、時間を記録

3. 結果の比較

比較レポートを生成します：

agent-eval report --format table

Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │
│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │
└──────────────┴───────────┴────────┴────────┴─────────────┘

ジャッジタイプ

コードベース（決定論的）

judge:
  - type: pytest
    command: pytest tests/ -v
  - type: command
    command: npm run build

パターンベース

judge:
  - type: grep
    pattern: "class.*Retry"
    files: src/**/*.py

モデルベース（LLM-as-judge）

judge:
  - type: llm
    prompt: |
      Does this implementation correctly handle exponential backoff?
      Check for: max retries, increasing delays, jitter.

ベストプラクティス

3〜5 タスクから始める — おもちゃの例ではなく、実際のワークロードを代表するタスク
エージェントごとに少なくとも 3 試行実行する — エージェントは非決定論的なので分散を把握する
タスク YAML でコミットを固定する — 日や週をまたいで結果が再現可能になる
タスクごとに少なくとも 1 つの決定論的ジャッジを含める（テスト、ビルド）— LLM ジャッジはノイズを加える
合格率と一緒にコストを追跡する — 10 倍のコストで 95% のエージェントが正しい選択でない場合もある
タスク定義をバージョン管理する — それらはテストフィクスチャであり、コードとして扱う

リンク

リポジトリ: github.com/joaquinhuigomez/agent-eval

affaan-m/agent-eval

docs/ja-JP/skills/agent-eval/SKILL.md

カスタムタスクでコーディングエージェント（Claude Code、Aider、Codex など）をヘッドツーヘッドで比較し、合格率、コスト、時間、一貫性のメトリクスを測定します

185,766 stars

development

Updated May 18, 2026

$ install --global

skillsauth

npx skillsauth add affaan-m/everything-claude-code agent-eval

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 18, 2026, 4:00 AM157.5s1 file scanned

SKILL.md

name:: agent-eval
description:: カスタムタスクでコーディングエージェント（Claude Code、Aider、Codex など）をヘッドツーヘッドで比較し、合格率、コスト、時間、一貫性のメトリクスを測定します
origin:: ECC
tools:: Read, Write, Edit, Bash, Grep, Glob

エージェント評価スキル

起動タイミング

自分のコードベースでコーディングエージェント（Claude Code、Aider、Codex など）を比較する
新しいツールやモデルを採用する前にエージェントパフォーマンスを測定する
エージェントがモデルやツールを更新した際にリグレッションチェックを実行する
チームにデータに基づいたエージェント選択の判断を提供する

インストール

注意: agent-eval はソースを確認した後、リポジトリからインストールしてください。

コアコンセプト

YAML タスク定義

タスクを宣言的に定義します。各タスクは何をするか、どのファイルを操作するか、成功をどう判定するかを指定します：

name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
  - src/http_client.py
prompt: |
  Add retry logic with exponential backoff to all HTTP requests.
  Max 3 retries. Initial delay 1s, max delay 30s.
judge:
  - type: pytest
    command: pytest tests/test_http_client.py -v
  - type: grep
    pattern: "exponential_backoff|retry"
    files: src/http_client.py
commit: "abc1234"  # 再現性のために特定コミットに固定

Git ワークツリー分離

収集メトリクス

ワークフロー

1. タスクの定義

タスクごとに 1 つの YAML ファイルを持つ tasks/ ディレクトリを作成します：

mkdir tasks
# タスク定義を作成（上記のテンプレートを参照）

2. エージェントの実行

タスクに対してエージェントを実行します：

agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3

各実行：

指定されたコミットから新しい git ワークツリーを作成
エージェントにプロンプトを渡す
ジャッジ基準を実行
合格・不合格、コスト、時間を記録

3. 結果の比較

比較レポートを生成します：

agent-eval report --format table

Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │
│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │
└──────────────┴───────────┴────────┴────────┴─────────────┘

ジャッジタイプ

コードベース（決定論的）

judge:
  - type: pytest
    command: pytest tests/ -v
  - type: command
    command: npm run build

パターンベース

judge:
  - type: grep
    pattern: "class.*Retry"
    files: src/**/*.py

モデルベース（LLM-as-judge）

judge:
  - type: llm
    prompt: |
      Does this implementation correctly handle exponential backoff?
      Check for: max retries, increasing delays, jitter.

ベストプラクティス

3〜5 タスクから始める — おもちゃの例ではなく、実際のワークロードを代表するタスク
エージェントごとに少なくとも 3 試行実行する — エージェントは非決定論的なので分散を把握する
タスク YAML でコミットを固定する — 日や週をまたいで結果が再現可能になる
タスクごとに少なくとも 1 つの決定論的ジャッジを含める（テスト、ビルド）— LLM ジャッジはノイズを加える
合格率と一緒にコストを追跡する — 10 倍のコストで 95% のエージェントが正しい選択でない場合もある
タスク定義をバージョン管理する — それらはテストフィクスチャであり、コードとして扱う

リンク

リポジトリ: github.com/joaquinhuigomez/agent-eval

Related Skills

affaan-m/unified-memory

development

VerifiedTrustedCommunity

Share durable, inspectable context and handoffs between Claude, Codex, Hermes, Cursor, OpenCode, and other agents through the local ECC Memory Vault. Use when an agent must save work state, transfer context, resume another agent's task, or search shared project knowledge.

234,238SKILL.mdUpdated Jul 28, 2026

affaan-m/unified-memory

affaan-m/contract-first

development

VerifiedTrustedCommunity

Use when multiple consumers and providers must evolve an API or event schema without field drift, integration surprises, or one side silently redefining the interface.

234,238SKILL.mdUpdated Jul 28, 2026

affaan-m/contract-first

affaan-m/ito-compute

tools

VerifiedTrustedCommunity

Query live GPU inventory, submit an authenticated Itô fixed-rate RFQ, inspect RFQ or procurement status, and run explicitly gated node qualification through the separately installed canonical CLI. Use when a user asks to find H100/H200 capacity, request a fixed compute rate, check Itô compute status, or validate GPU nodes.

234,238SKILL.mdUpdated Jul 24, 2026

affaan-m/continuous-learning-v2

data-ai

VerifiedTrustedCommunity

Instinct-based learning system that observes sessions via hooks, creates atomic instincts with confidence scoring, and evolves them into skills/commands/agents. v2.1 adds project-scoped instincts to prevent cross-project contamination.

234,238SKILL.mdUpdated Mar 27, 2026

affaan-m/continuous-learning-v2

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/affaan-m/everything-claude-code.git

# Copy into Claude Code skills folder (global)
cp -r everything-claude-code/docs/ja-JP/skills/agent-eval ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

affaan-m/everything-claude-code

185,766 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT