.github/skills/agent-eval/SKILL.md
Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics
npx skillsauth add takmczk/copilot-cli-ecc agent-evalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.
Note: Install agent-eval from its repository after reviewing the source.
Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:
name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
- src/http_client.py
prompt: |
Add retry logic with exponential backoff to all HTTP requests.
Max 3 retries. Initial delay 1s, max delay 30s.
judge:
- type: pytest
command: pytest tests/test_http_client.py -v
- type: grep
pattern: "exponential_backoff|retry"
files: src/http_client.py
commit: "abc1234" # pin to specific commit for reproducibility
Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.
| Metric | What It Measures | |--------|-----------------| | Pass rate | Did the agent produce code that passes the judge? | | Cost | API spend per task (when available) | | Time | Wall-clock seconds to completion | | Consistency | Pass rate across repeated runs (e.g., 3/3 = 100%) |
Create a tasks/ directory with YAML files, one per task:
mkdir tasks
# Write task definitions (see template above)
Execute agents against your tasks:
agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3
Each run:
Generate a comparison report:
agent-eval report --format table
Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent │ Pass Rate │ Cost │ Time │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code │ 3/3 │ $0.12 │ 45s │ 100% │
│ aider │ 2/3 │ $0.08 │ 38s │ 67% │
└──────────────┴───────────┴────────┴────────┴─────────────┘
judge:
- type: pytest
command: pytest tests/ -v
- type: command
command: npm run build
judge:
- type: grep
pattern: "class.*Retry"
files: src/**/*.py
judge:
- type: llm
prompt: |
Does this implementation correctly handle exponential backoff?
Check for: max retries, increasing delays, jitter.
development
X/Twitter API integration for posting tweets, threads, reading timelines, search, and analytics. Covers OAuth auth patterns, rate limits, and platform-native content posting. Use when the user wants to interact with X programmatically.
documentation
Translate visa application documents (images) to English and create a bilingual PDF with original and translation
tools
See, Understand, Act on video and audio. See- ingest from local files, URLs, RTSP/live feeds, or live record desktop; return realtime context and playable stream links. Understand- extract frames, build visual/semantic/temporal indexes, and search moments with timestamps and auto-clips. Act- transcode and normalize (codec, fps, resolution, aspect ratio), perform timeline edits (subtitles, text/image overlays, branding, audio overlays, dubbing, translation), generate media assets (image, audio, video), and create real time alerts for events from live streams or desktop capture.
development
AI-assisted video editing workflows for cutting, structuring, and augmenting real footage. Covers the full pipeline from raw capture through FFmpeg, Remotion, ElevenLabs, fal.ai, and final polish in Descript or CapCut. Use when the user wants to edit video, cut footage, create vlogs, or build video content.