Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

takmczk/agent-eval

Name: agent-eval
Author: takmczk

.github/skills/agent-eval/SKILL.md

npx skillsauth add takmczk/copilot-cli-ecc agent-eval

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Agent Eval Skill

A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.

When to Activate

Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase
Measuring agent performance before adopting a new tool or model
Running regression checks when an agent updates its model or tooling
Producing data-backed agent selection decisions for a team

Installation

Note: Install agent-eval from its repository after reviewing the source.

Core Concepts

YAML Task Definitions

Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:

name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
  - src/http_client.py
prompt: |
  Add retry logic with exponential backoff to all HTTP requests.
  Max 3 retries. Initial delay 1s, max delay 30s.
judge:
  - type: pytest
    command: pytest tests/test_http_client.py -v
  - type: grep
    pattern: "exponential_backoff|retry"
    files: src/http_client.py
commit: "abc1234"  # pin to specific commit for reproducibility

Git Worktree Isolation

Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.

Metrics Collected

| Metric | What It Measures | |--------|-----------------| | Pass rate | Did the agent produce code that passes the judge? | | Cost | API spend per task (when available) | | Time | Wall-clock seconds to completion | | Consistency | Pass rate across repeated runs (e.g., 3/3 = 100%) |

Workflow

1. Define Tasks

Create a tasks/ directory with YAML files, one per task:

mkdir tasks
# Write task definitions (see template above)

2. Run Agents

Execute agents against your tasks:

agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3

Each run:

Creates a fresh git worktree from the specified commit
Hands the prompt to the agent
Runs the judge criteria
Records pass/fail, cost, and time

3. Compare Results

Generate a comparison report:

agent-eval report --format table

Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │
│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │
└──────────────┴───────────┴────────┴────────┴─────────────┘

Judge Types

Code-Based (deterministic)

judge:
  - type: pytest
    command: pytest tests/ -v
  - type: command
    command: npm run build

Pattern-Based

judge:
  - type: grep
    pattern: "class.*Retry"
    files: src/**/*.py

Model-Based (LLM-as-judge)

judge:
  - type: llm
    prompt: |
      Does this implementation correctly handle exponential backoff?
      Check for: max retries, increasing delays, jitter.

Best Practices

Start with 3-5 tasks that represent your real workload, not toy examples
Run at least 3 trials per agent to capture variance — agents are non-deterministic
Pin the commit in your task YAML so results are reproducible across days/weeks
Include at least one deterministic judge (tests, build) per task — LLM judges add noise
Track cost alongside pass rate — a 95% agent at 10x the cost may not be the right choice
Version your task definitions — they are test fixtures, treat them as code

takmczk/agent-eval

.github/skills/agent-eval/SKILL.md

Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics

1 stars

development

Updated Apr 15, 2026

$ install --global

skillsauth

npx skillsauth add takmczk/copilot-cli-ecc agent-eval

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Mar 29, 2026, 7:12 AM36.6s1 file scanned

SKILL.md

name:: agent-eval
description:: Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics
origin:: ECC
tools:: Read, Write, Edit, Bash, Grep, Glob

Agent Eval Skill

A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.

When to Activate

Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase
Measuring agent performance before adopting a new tool or model
Running regression checks when an agent updates its model or tooling
Producing data-backed agent selection decisions for a team

Installation

Note: Install agent-eval from its repository after reviewing the source.

Core Concepts

YAML Task Definitions

Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:

name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
  - src/http_client.py
prompt: |
  Add retry logic with exponential backoff to all HTTP requests.
  Max 3 retries. Initial delay 1s, max delay 30s.
judge:
  - type: pytest
    command: pytest tests/test_http_client.py -v
  - type: grep
    pattern: "exponential_backoff|retry"
    files: src/http_client.py
commit: "abc1234"  # pin to specific commit for reproducibility

Git Worktree Isolation

Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.

Metrics Collected

Workflow

1. Define Tasks

Create a tasks/ directory with YAML files, one per task:

mkdir tasks
# Write task definitions (see template above)

2. Run Agents

Execute agents against your tasks:

agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3

Each run:

Creates a fresh git worktree from the specified commit
Hands the prompt to the agent
Runs the judge criteria
Records pass/fail, cost, and time

3. Compare Results

Generate a comparison report:

agent-eval report --format table

Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │
│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │
└──────────────┴───────────┴────────┴────────┴─────────────┘

Judge Types

Code-Based (deterministic)

judge:
  - type: pytest
    command: pytest tests/ -v
  - type: command
    command: npm run build

Pattern-Based

judge:
  - type: grep
    pattern: "class.*Retry"
    files: src/**/*.py

Model-Based (LLM-as-judge)

judge:
  - type: llm
    prompt: |
      Does this implementation correctly handle exponential backoff?
      Check for: max retries, increasing delays, jitter.

Best Practices

Start with 3-5 tasks that represent your real workload, not toy examples
Run at least 3 trials per agent to capture variance — agents are non-deterministic
Pin the commit in your task YAML so results are reproducible across days/weeks
Include at least one deterministic judge (tests, build) per task — LLM judges add noise
Track cost alongside pass rate — a 95% agent at 10x the cost may not be the right choice
Version your task definitions — they are test fixtures, treat them as code

Related Skills

takmczk/x-api

development

VerifiedTrustedCommunity

X/Twitter API integration for posting tweets, threads, reading timelines, search, and analytics. Covers OAuth auth patterns, rate limits, and platform-native content posting. Use when the user wants to interact with X programmatically.

1SKILL.mdUpdated Apr 15, 2026

takmczk/visa-doc-translate

documentation

VerifiedTrustedCommunity

Translate visa application documents (images) to English and create a bilingual PDF with original and translation

1SKILL.mdUpdated Apr 15, 2026

takmczk/visa-doc-translate

takmczk/videodb

tools

VerifiedTrustedCommunity

See, Understand, Act on video and audio. See- ingest from local files, URLs, RTSP/live feeds, or live record desktop; return realtime context and playable stream links. Understand- extract frames, build visual/semantic/temporal indexes, and search moments with timestamps and auto-clips. Act- transcode and normalize (codec, fps, resolution, aspect ratio), perform timeline edits (subtitles, text/image overlays, branding, audio overlays, dubbing, translation), generate media assets (image, audio, video), and create real time alerts for events from live streams or desktop capture.

1SKILL.mdUpdated Apr 15, 2026

takmczk/video-editing

development

VerifiedTrustedCommunity

AI-assisted video editing workflows for cutting, structuring, and augmenting real footage. Covers the full pipeline from raw capture through FFmpeg, Remotion, ElevenLabs, fal.ai, and final polish in Descript or CapCut. Use when the user wants to edit video, cut footage, create vlogs, or build video content.

1SKILL.mdUpdated Apr 15, 2026

takmczk/video-editing

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/takmczk/copilot-cli-ecc.git

# Copy into Claude Code skills folder (global)
cp -r copilot-cli-ecc/.github/skills/agent-eval ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

takmczk/copilot-cli-ecc

1 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

takmczk/agent-eval

$ install --global

Security Scan Results

SKILL.md

Agent Eval Skill

When to Activate

Installation

Core Concepts

YAML Task Definitions

Git Worktree Isolation

Metrics Collected

Workflow

1. Define Tasks

2. Run Agents

3. Compare Results

Judge Types

Code-Based (deterministic)

Pattern-Based

Model-Based (LLM-as-judge)

Best Practices

Links

Related Skills

takmczk/x-api

takmczk/visa-doc-translate

takmczk/videodb

takmczk/video-editing

takmczk/agent-eval

$ install --global

Security Scan Results

SKILL.md

Agent Eval Skill

When to Activate

Installation

Core Concepts

YAML Task Definitions

Git Worktree Isolation

Metrics Collected

Workflow

1. Define Tasks

2. Run Agents

3. Compare Results

Judge Types

Code-Based (deterministic)

Pattern-Based

Model-Based (LLM-as-judge)

Best Practices

Links

Related Skills

takmczk/x-api

takmczk/visa-doc-translate

takmczk/videodb

takmczk/video-editing