skills/agents-md-evals/SKILL.md
Evaluate and optimize AGENTS.md/CLAUDE.md instruction files through A/B testing. Use when the user says "eval my agents.md", "test my claude.md", "optimize my instructions", "which rules matter", "trim my agents file", or wants to know which instructions in their global config actually change model behavior vs. are dead weight.
npx skillsauth add vltansky/skills agents-md-evalsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Evaluate whether rules in instruction files actually change model behavior, identify non-discriminating rules (model already does this by default), and optimize the file for maximum impact per token.
Most AGENTS.md/CLAUDE.md files contain rules the model already follows without being told. These waste context tokens every conversation. This skill runs controlled A/B tests — with the instruction file vs. without it — to identify which rules earn their place and which can be cut.
Well-structured codebases make most instruction rules redundant. The model explores existing code — reads package.json, scans existing components, follows import patterns — and matches conventions automatically. In empirical testing on a 755-line CLAUDE.md across a monorepo with 3 frontend stacks, 25 of 26 assertions passed identically with or without the instruction file. The model discovered React patterns, shadcn/ui conventions, Recharts usage, Hebrew labels, Firestore helpers, Zod validation, and serverTimestamp — all from existing code.
The only assertion that discriminated was pure domain knowledge the codebase couldn't teach: parallel components in separate directories that must always be changed together.
This means your CLAUDE.md is probably 80-95% redundant. The eval process will reveal exactly which rules survive.
Read ALL instruction files in scope. For project-level evals, find every instruction file recursively:
find {project-root} -name "CLAUDE.md" -o -name "AGENTS.md" | grep -v node_modules | grep -v .git
For global evals, check all config directories:
ls -la ~/.claude/CLAUDE.md ~/.codex/AGENTS.md ~/.cursor/AGENTS.md 2>/dev/null
Categorize each rule:
| Category | Examples | Likely discriminating? |
|----------|---------|----------------------|
| Coding style | type over interface, early returns, no React.FC | Low — models default to modern patterns |
| Response format | Checklist style, tldr footer, separators | Low-Medium — some are model defaults |
| Workflow | TDD, validation steps, branch naming | High — models don't do these unprompted |
| Decision style | Single recommendation, epistemic separation | High — models default to option menus |
| Safety guardrails | No force-push, no rm -rf, no dev server | Can't A/B test safely — keep by default |
| Identity/preferences | Author name, email, no emojis | Keep — low cost, high consequence if wrong |
| Domain-specific | Parallel component parity, project-specific helpers | Keep — can't be known without instruction |
Rules in the last three categories should be kept regardless of eval results. Focus eval effort on coding style, response format, workflow, and decision style.
The most effective eval prompts come from analyzing real project history. Generic prompts ("write a React component") test patterns the model discovers from the codebase anyway.
Spawn a subagent to analyze recent commits:
Analyze the last 50-100 commits in {project-root} using git log.
Look for:
1. Patterns that appear in commits but aren't obvious from code structure alone
2. Mistakes that were fixed (these reveal non-obvious conventions)
3. Multi-file changes that suggest coupling rules
4. Domain-specific patterns (naming, helpers, config)
For each pattern, write a realistic eval prompt that would test whether
the model knows about it WITHOUT being told. Save to {workspace}/evals/commit-analysis.md
Each eval needs 3-6 assertions. Mix automated (grep-able) and manual:
"check": "companyCollection" — regex against outputA discriminating assertion:
If an assertion passes in both conditions, the rule it tests is dead weight.
Save prompts to {workspace}/evals/evals.json. Present to the user before running.
This is the most error-prone step. Claude Code auto-loads CLAUDE.md/AGENTS.md into every conversation, including subagents. A contaminated baseline produces false results — you'll think rules are "non-discriminating" when really both runs had them.
This protocol was refined through 3 failed attempts. Follow it exactly.
Phase 1: Discover all instruction and memory files
# Find ALL instruction files in the project (monorepos have them in packages)
find {project-root} -name "CLAUDE.md" -o -name "AGENTS.md" \
| grep -v node_modules | grep -v .git | sort
Also check user-level config and memory files — these are auto-loaded too:
ls -la ~/.claude/CLAUDE.md ~/.codex/AGENTS.md ~/.cursor/AGENTS.md 2>/dev/null
ls -d ~/.claude/projects/*/memory/ 2>/dev/null
Phase 2: Move ALL files to /tmp/ (not rename, not .bak)
Renaming to .bak in the same directory is weak — the model might still discover it. Move to /tmp/ with directory structure preserved. Include memory files — there is no CLI flag to disable memory loading.
BACKUP_DIR="/tmp/{project-name}-md-backup-$(date +%s)"
mkdir -p "$BACKUP_DIR"
# For each instruction file found in Phase 1:
for f in {list of files}; do
REL_PATH="${f#{project-root}/}"
mkdir -p "$BACKUP_DIR/$(dirname $REL_PATH)"
mv "$f" "$BACKUP_DIR/$REL_PATH"
done
# User-level files: ASK before touching. These are outside the project.
# "I found global instruction files and memory dirs that will contaminate baselines.
# OK to temporarily rename them? I'll restore immediately after."
# Only proceed with explicit user consent:
for f in ~/.claude/CLAUDE.md ~/.codex/AGENTS.md ~/.cursor/AGENTS.md; do
[ -f "$f" ] && mv "$f" "${f}.bak"
done
find ~/.claude/projects -type d -name "memory" -exec sh -c 'mv "$1" "${1}_bak"' _ {} \; 2>/dev/null
# If the user declines, note that baselines may be partially contaminated by global rules.
Phase 3: Verify removal
# Must return zero results:
find {project-root} -name "CLAUDE.md" -o -name "AGENTS.md" \
| grep -v node_modules | grep -v .git | wc -l
If ANY files remain, do not proceed.
Phase 4: Spawn all clean baseline agents
Launch all baseline agents NOW. They start in a world with zero instruction files.
Phase 5: WAIT for ALL baseline agents to complete
DO NOT restore files until every single baseline agent has finished and you have saved its timing data. This is critical — restoring early creates a race condition where agents that haven't fully initialized yet may pick up the restored files.
Phase 6: Restore all files (ALWAYS — even if baselines errored)
Restoration is non-negotiable. If agents fail, time out, or you hit any error — restore FIRST, debug second. Leaving user config renamed breaks all future Claude sessions.
# Only after ALL baselines are done (or failed):
for f in $(find "$BACKUP_DIR" -type f); do
REL_PATH="${f#$BACKUP_DIR/}"
mv "$f" "{project-root}/$REL_PATH"
done
# Restore user-level files (if they were renamed with user consent):
for f in ~/.claude/CLAUDE.md.bak ~/.codex/AGENTS.md.bak ~/.cursor/AGENTS.md.bak; do
[ -f "$f" ] && mv "$f" "${f%.bak}"
done
# Restore memory dirs:
find ~/.claude/projects -type d -name "memory_bak" -exec sh -c 'mv "$1" "${1%_bak}"' _ {} \; 2>/dev/null
Phase 7: Git checkout
Baseline agents may have modified project files during their runs:
cd {project-root} && git checkout .
For with-skill runs, all instruction files are on disk (auto-loaded). Additionally have the agent read the root file explicitly:
You are being evaluated. Read the file {path-to-instruction-file} first,
then follow ALL instructions in it exactly.
Your task: {eval prompt}
Save outputs to {workspace}/iteration-{N}/{eval-name}/with_skill/outputs/
Launch ALL runs — all with-skill and all clean-baseline agents — but in two batches:
Within each batch, launch all agents in a single turn for parallelism.
Each agent should save:
outputs/outputs/response.mdCapture total_tokens and duration_ms from each task notification into timing.json.
After all runs complete:
For each output, evaluate assertions. Prefer scripted checks over eyeballing:
# Automated assertion check example
grep -cP 'companyCollection' outputs/response.md
Save results to grading.json in each run directory. The eval viewer expects this exact schema:
{
"eval_id": 6,
"eval_name": "descriptive-name",
"config": "with_skill",
"expectations": [
{ "text": "Assertion description", "passed": true, "evidence": "How you verified" }
]
}
The expectations array MUST use fields text, passed, and evidence — not name/met/details.
Create benchmark.json. The eval viewer expects this exact schema:
{
"metadata": { "skill_name": "agents-md", ... },
"runs": [
{
"eval_id": 6,
"eval_name": "descriptive-name",
"configuration": "with_skill",
"run_number": 1,
"result": {
"pass_rate": 0.833,
"passed": 5,
"failed": 1,
"total": 6,
"time_seconds": 162.5,
"tokens": 76318,
"tool_calls": 0,
"errors": 0
},
"expectations": [
{ "text": "...", "passed": true, "evidence": "..." }
]
}
],
"run_summary": {
"with_skill": {
"pass_rate": { "mean": 0.793, "stddev": 0.124, "min": 0.667, "max": 1.0 },
"time_seconds": { "mean": 169.5, "stddev": 31.5, ... },
"tokens": { "mean": 69800, "stddev": 12502, ... }
},
"without_skill": { ... },
"delta": { "pass_rate": "+0.067", "time_seconds": "-13.9", "tokens": "+13460" }
}
}
Critical: configuration must be "with_skill" or "without_skill" — the viewer keys on these exact strings. Put each with_skill entry before its without_skill counterpart.
Aggregate with the bundled script:
python {this-skill-path}/scripts/aggregate_benchmark.py {workspace}/iteration-N --skill-name agents-md
For each assertion, classify it:
Due to the codebase-teaches-patterns effect, most project-specific rules will be non-discriminating. The model reads existing files and matches conventions. This is the expected result, not a test failure. Common non-discriminating patterns:
package.jsontsconfig.json and existing filesRules that discriminate tend to be:
For each surviving rule, apply this filter — it must pass at least ONE criterion:
| Criterion | Question | |-----------|----------| | Specificity | Does it describe a project-unique decision? | | Behavior change | Would the agent do something different without it? | | Surprise | Would it surprise a competent dev joining the project? | | Pain | Did it come from a real mistake or recurring PR feedback? | | Frequency | Does it apply to nearly every task AND cost few tokens? | | No feedback loop | Can't be enforced by tests, types, linters, or CI? |
If a rule fails all six AND is non-discriminating in A/B tests, it's dead weight.
Don't delete based on A/B tests:
Calculate: how many tokens does the instruction file add per conversation vs. behavioral improvement? A 755-line file that produces +3.8% improvement is a poor trade. A 50-line file with the same improvement is excellent.
Once you know which rules survive, restructure the file:
docs/ — detailed conventions, architecture notes, testing patternspackages/api/AGENTS.md for API-only conventionspython {this-skill-path}/eval-viewer/generate_review.py \
{workspace}/iteration-N \
--skill-name "agents-md" \
--benchmark {workspace}/iteration-N/benchmark.json \
--static /tmp/agents-md-eval-review.html
The viewer defaults to "with skill"/"without skill" labels. For agents.md evals, post-process the HTML to show "with agents.md"/"without agents.md":
html = open("/tmp/agents-md-eval-review.html").read()
LABEL_MAP = '{"with_skill": "with agents.md", "without_skill": "without agents.md"}[config] || config.replace(/_/g, " ");'
html = html.replace(
'badge.textContent = config.replace(/_/g, " ");',
'badge.textContent = ' + LABEL_MAP
)
# Also fix benchmark table labels (labelA, labelB, configLabel)
for var in ['labelA', 'labelB', 'configLabel']:
orig = f'const {var} = config'
if var == 'labelA':
orig = 'const labelA = configA.replace(/_/g, " ").replace(/\\b\\w/g, c => c.toUpperCase());'
repl = 'const labelA = configA === "with_skill" ? "With agents.md" : configA === "without_skill" ? "Without agents.md" : configA.replace(/_/g, " ").replace(/\\b\\w/g, c => c.toUpperCase());'
html = html.replace(orig, repl)
elif var == 'labelB':
orig = 'const labelB = configB.replace(/_/g, " ").replace(/\\b\\w/g, c => c.toUpperCase());'
repl = 'const labelB = configB === "with_skill" ? "With agents.md" : configB === "without_skill" ? "Without agents.md" : configB.replace(/_/g, " ").replace(/\\b\\w/g, c => c.toUpperCase());'
html = html.replace(orig, repl)
elif var == 'configLabel':
orig = 'const configLabel = config.replace(/_/g, " ").replace(/\\b\\w/g, c => c.toUpperCase());'
repl = 'const configLabel = config === "with_skill" ? "With agents.md" : config === "without_skill" ? "Without agents.md" : config.replace(/_/g, " ").replace(/\\b\\w/g, c => c.toUpperCase());'
html = html.replace(orig, repl)
open("/tmp/agents-md-eval-review.html", "w").write(html)
For iteration 2+, pass --previous-workspace to show diffs.
After the user reviews:
iteration-{N+1}/ to verify:
Keep iterating until:
{workspace}/
├── evals/
│ ├── evals.json (initial eval prompts)
│ ├── evals-v2.json (commit-derived prompts)
│ └── commit-analysis.md (git log findings)
├── iteration-1/
│ ├── {eval-name}/
│ │ ├── eval_metadata.json
│ │ ├── with_skill/
│ │ │ ├── outputs/
│ │ │ ├── timing.json
│ │ │ └── grading.json
│ │ └── clean_baseline/
│ │ ├── outputs/
│ │ ├── timing.json
│ │ └── grading.json
│ ├── benchmark.json
│ └── benchmark.md
├── iteration-2/
│ └── ...
└── evals.json
Contaminated baselines: The #1 failure mode. Claude Code auto-loads CLAUDE.md for ALL conversations including subagents. Simply telling an agent "don't read CLAUDE.md" does nothing — the file must be physically absent.
Restoring files too early: Spawning agents then immediately restoring creates a race condition. Wait for ALL agents to finish.
Missing nested instruction files: Monorepos have package-level CLAUDE.md files (e.g., packages/functions/CLAUDE.md). Move ALL of them, not just the root one.
Generic eval prompts: "Write a React component" tests patterns the model discovers from package.json anyway. Use commit-derived prompts that target non-obvious domain knowledge.
Wrong benchmark schema: The eval viewer expects a specific JSON structure with runs[] array and run_summary with mean/stddev. Custom schemas produce empty viewers.
Expecting high deltas: Due to the codebase-teaches-patterns effect, +3-10% is a realistic delta for a well-structured codebase. +30%+ deltas suggest contaminated baselines.
All bundled within this skill — no external dependencies:
eval-viewer/generate_review.py — generates the HTML review viewerscripts/aggregate_benchmark.py — aggregates grading into benchmark.jsonagents/grader.md — prompt for grading subagentagents/analyzer.md — prompt for analysis subagentreferences/schemas.md — JSON schemas for evals, grading, benchmarkreferences/eval-categories.md — rule categories, empirical findings, commit-derived eval guidetools
Prepare a Hetzner Cloud VPS for secure Codex remote SSH access. Use when the user wants to create or configure a Hetzner server for Codex remote control, fix "No codex found in PATH" on a remote machine, install agent development tooling on a VPS, harden SSH access to a Hetzner server, or connect the server through Codex Settings, Connections, Add SSH.
data-ai
Summarize your GitHub activity from the last 24 hours across all repos. Use when user says "what did I do", "my activity", "standup", "recap", "summarize my day", "what-i-did", "git activity", "daily summary".
development
Test-driven development loop. Write failing test first, then implement to make it pass. Use when the user says 'tdd', 'test first', 'write the test first', 'failing test', 'red green refactor', or for any bug fix where the fix should be proven by a test. Also use when autopilot or other skills need test-first execution.
development
Review changed code for reuse, quality, and efficiency, then fix any issues found. Use when the user says "simplify", "simplify this", "review changes", "clean up my code", "check for duplicates", "code reuse review", or wants a post-change quality sweep.