skills/agent-evaluation/SKILL.md
Use when testing or benchmarking LLM agents before deployment. Use when designing eval suites for agent correctness, reliability, or safety. Use when setting up LLM-as-judge scoring pipelines. Use when debugging why an agent 'feels worse' but metrics don't show it. Use when choosing between human evaluation, automated rubrics, or A/B testing. NEVER for evaluating Claude Code skills specifically — use skill-judge for that.
npx skillsauth add sharkitect-solutions/sharkitect-claude-toolkit agent-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Think like a quality engineer who has seen agents ace benchmarks then fail spectacularly in production. The fundamental challenge: LLM agents are non-deterministic. Same input produces different outputs. Traditional testing (assert output == expected) breaks immediately.
Before designing any eval:
What do you need to evaluate?
│
├─ Does the agent produce correct answers?
│ └─ Functional Correctness Evals
│ Method: Reference-based for factual tasks, rubric-based for open-ended
│ Trap: "Correct" often has no single answer. Don't use exact match
│ unless the task is truly deterministic (math, code output).
│
├─ Does the agent behave safely regardless of input?
│ └─ Behavioral Invariant Tests
│ Method: Define rules that must ALWAYS hold (no PII leakage, stays on
│ topic, no hallucinated citations). Test with adversarial inputs.
│ Key: These are binary pass/fail — the easiest to automate.
│
├─ Is the agent consistent across runs?
│ └─ Reliability/Consistency Evals
│ Method: Run same input 5-10 times. Measure variance in quality scores.
│ Trap: High average + high variance = unreliable agent. Users hate
│ inconsistency more than consistently mediocre quality.
│
├─ Where does the agent break?
│ └─ Capability Boundary Testing
│ Method: Progressively harder inputs in each capability dimension.
│ Find the cliff where quality drops off. Document it.
│ Key: Know your limits. Don't promise what you can't deliver.
│
└─ Is version B better than version A?
└─ Comparative Evaluation (A/B)
Method: Same test set, both versions, LLM-as-judge or human.
Trap: Need statistical significance. 50 test cases minimum.
Online A/B: real users, real tasks. Gold standard but slow.
| Method | Best For | Cost | Pitfall | |--------|----------|------|---------| | Exact match | Math, code output, classification | Free | Fails on semantically equivalent answers ("NYC" vs "New York City") | | Rubric-based scoring | Open-ended tasks with clear criteria | Low | Rubric quality determines eval quality — invest in rubric design | | LLM-as-judge | Fast automated eval at scale | Medium | Has systematic biases (see below). Calibrate against human judges. | | Human evaluation | High-stakes, subjective quality | High | Slow, expensive, but ground truth. Use to calibrate automated evals. | | A/B testing (online) | Production comparison of two versions | High | Requires real traffic. Need weeks for statistical significance. |
LLM judges are powerful but have systematic biases that silently corrupt your eval results:
| Bias | What Happens | Fix | |------|-------------|-----| | Position bias | Judge prefers the first response shown | Randomize order, run twice with swapped positions, average scores | | Verbosity bias | Judge rates longer responses higher regardless of quality | Include "conciseness" in rubric, penalize unnecessary padding | | Self-enhancement | Judge prefers outputs from same model family | Use a different model family as judge when possible | | Sycophancy | Judge agrees with confident-sounding but wrong answers | Include factual verification criteria in rubric | | Format bias | Judge prefers markdown, bullet points, structured output | Normalize formatting before judging, or score content separately from format |
LLM-as-judge calibration protocol:
Every LLM eval must account for output variance. A test that passes 7/10 times is NOT a passing test — it's a test revealing an unreliable behavior.
Statistical evaluation protocol:
Flaky test triage:
Development Cycle:
Code change → Regression suite (fast, ~50 tests, 5 runs each)
→ If pass: Capability suite (deep, ~200 tests, 10 runs each)
→ If pass: Deploy to staging
Production:
Every request → Log inputs/outputs
→ Sample 5% for automated quality scoring
→ Weekly human review of low-scoring samples
→ Monthly full eval suite re-run (detect drift)
"When a measure becomes a target, it ceases to be a good measure."
This is the single most dangerous failure mode in agent evaluation. Examples:
Defense:
| Rationalization | When It Appears | Why It's Wrong | |----------------|-----------------|----------------| | "We'll add evals later, let's ship first" | MVP pressure | Shipping without evals means you can't detect regressions. The first user bug report is your eval suite telling you it's too late. | | "The agent works fine when I test it manually" | Developer testing | You're testing happy paths with well-formed inputs. Manual testing catches <20% of production failures. | | "Our benchmark score is 92%, we're good" | After benchmarking | Benchmark-production gap is real. Agents scoring 90%+ on benchmarks often achieve <50% on real-world tasks. Always eval with production-like inputs. | | "Single-run test passed, ship it" | Fast iteration | Non-deterministic agent + single run = coin flip. A test that passes once might fail 40% of the time. Run N times. | | "We'll use GPT-4 to judge our outputs" | Setting up automated eval | LLM-as-judge has systematic biases. Without calibration against human judges (rho > 0.8), your automated scores may not correlate with actual quality. |
When setting up agent evaluation from scratch:
development
When the user wants help with paid advertising campaigns on Google Ads, Meta (Facebook/Instagram), LinkedIn, Twitter/X, or other ad platforms. Also use when the user mentions 'PPC,' 'paid media,' 'ad copy,' 'ad creative,' 'ROAS,' 'CPA,' 'ad campaign,' 'retargeting,' or 'audience targeting.' This skill covers campaign strategy, ad creation, audience targeting, and optimization.
testing
--- name: using-sharkitect-methodology description: Use when starting any conversation in a Sharkitect workspace OR before any task involving NEW pricing, positioning, proposal, strategy, plan-execution, or schema-design work — mandates invocation of Sharkitect-specific methodology skills (pricing-strategy, marketing-strategy-pmm, smb-cfo, hq-revenue-ops, executing-plans, brainstorming) under the same anti-rationalization discipline as using-superpowers. Documentation has failed 4 times across H
testing
Use when user says 'end session', 'wrap up', 'stop for the day', 'done for today', 'close out', 'save session', 'wrapping up', or invokes /end-session. Runs the full 9-step end-of-session protocol: resource audit, MEMORY.md update, lessons capture, plan status, pending items, workspace checklist, .tmp/ audit, git commit+push, Supabase brain sync, session brief, summary. Final step schedules a detached self-kill of the current session ONLY (3s delay) so the window closes cleanly. Other claude.exe processes (active workspaces) are NOT touched -- orphan cleanup is handled separately by Claude-Orphan-Cleanup-Hourly with proper age safeguards. Do NOT use for: mid-session quick saves (use session-checkpoint), skill syncing (use sync-skills.py), brain memory queries (use supabase-sync.py pull), document freshness reviews (use document-lifecycle), resource gap detection (use resource-auditor).
testing
Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, passive voice, negative parallelisms, and filler phrases.