skills/agentic-evaluation/SKILL.md
Evaluate agent definition files (.agent.md, system prompts) against industry best practices and produce a structured report with scores, findings, and actionable rewrites. Use this skill whenever the user asks to review, audit, evaluate, score, or improve an agent file — even if they just say "check this agent" or "is this agent any good". Also triggers for requests like "what's wrong with this agent", "review my prompt", or "grade this system prompt".
npx skillsauth add maestria-co/ai-playbook agentic-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Evaluate a provided agent file against current industry standards and produce a structured report with dimension scores, findings, and concrete rewrite suggestions.
This skill is self-contained — read the agent file, apply the evaluation dimensions, and render the report. Do not delegate to other agents.
Does: Evaluate .agent.md files, system prompts, and agent definition files for quality, efficiency, and adherence to best practices.
Does not: Rewrite the agent (suggest rewrites only), evaluate skill files (SKILL.md), or evaluate runtime agent behavior/logs.
Detect the platform from file format. This determines token thresholds.
| Platform | Signals |
| ----------------- | ---------------------------------------------------------- |
| GitHub Copilot | .agent.md extension, #tool-name references, markdown |
| Claude | System prompt format, XML tags, JSON-schema tool defs |
| Unknown | Flag it, evaluate against both standards where they differ |
Estimate token count: character_count ÷ 4.
Break down by section — the model consuming this agent file pays a cost for every token before the user even speaks, so bloat directly degrades every interaction.
| Platform | Green | Yellow | Red | | -------------- | -------------- | ----------------- | -------------- | | GitHub Copilot | < 800 tokens | 800–1500 tokens | > 1500 tokens | | Claude | < 1500 tokens | 1500–3000 tokens | > 3000 tokens |
Flag: repeated instructions saying the same thing differently, filler ("always make sure to", "it is important that"), content that belongs in a skill or context file.
Score each dimension independently using the criteria below.
A model follows an agent better when it can summarize the agent's job in one sentence. Vague scope = unpredictable behavior.
Check:
Score: CLEAR / VAGUE / UNDEFINED
Ambiguous instructions are the #1 cause of inconsistent agent behavior. "When appropriate" means something different every time — explicit if/then rules don't.
Check:
Score: PRECISE / AMBIGUOUS / CONTRADICTORY
When decision logic and output formatting mix together, both get worse. Clean separation means each concern lives in exactly one place.
Check:
Score: CLEAN / BOUNDARY EROSION / MIXED
Every token the agent consumes before the user speaks is a token unavailable for the actual conversation. Front-loading context the agent may not need on every task wastes the most constrained resource.
Check:
| Pre-user budget usage | Assessment | | --------------------- | ----------- | | < 30% | Efficient | | 30–60% | Monitor | | > 60% | Inefficient |
Score: EFFICIENT / MONITOR / INEFFICIENT
Unnecessary sequential tool calls, missing loop limits, and no fast-path for simple tasks all add latency that compounds across interactions.
Check:
Score: OPTIMIZED / IMPROVABLE / INEFFICIENT
These standards come from established patterns in reliable agent design. They exist because agents that follow them are measurably more reliable, predictable, and safe.
| Standard | What to check | | ---------------------------- | ------------------------------------------------------------ | | Single clear purpose | Agent does one thing well | | Minimal footprint | Requests only necessary permissions and context | | Explicit stopping conditions | Agent knows when it is done | | Human-readable reasoning | Agent logs or explains its decisions | | Safe defaults | Errs toward doing less and confirming when uncertain | | Narrow tool definitions | Each tool does one thing | | No prompt injection risk | Does not blindly execute instructions from tool outputs | | Graceful degradation | Handles missing context or tool failures without hallucinating | | Deterministic routing | Same input produces same routing decision | | Output contracts | Produces structured, predictable output |
Score: ALIGNED / PARTIALLY ALIGNED / MISALIGNED
Read the output template from references/output-template.md and fill it in with your findings. The template provides the exact structure — follow it.
If the reference file is unavailable, use this structure:
Every finding must include a concrete fix — not just "improve this" but the specific change to make. Include before/after rewrite suggestions for the most impactful issues.
development
Writes and runs a test suite for a piece of code, covering happy path, edge cases, error cases, and security cases. Use when: implementation is complete and needs test coverage, a bug needs a reproduction test and fix validation, or code needs coverage before a refactor. Do not use when: the code under test is not yet implemented, or the spec is still unclear.
testing
Use when creating a new skill, editing an existing skill, or helping a user author a skill for this system. Covers structure, discoverability, quality, and discipline hardening.
development
Evidence-based verification process to run before marking any task complete. Use this skill every time you're about to report that work is done — for features, bug fixes, refactoring, or any code change. This catches the most common failure mode: declaring "done" without proof. If you're finishing up and about to tell the user the task is complete, run this checklist first.
development
Teaches agents how to discover, select, and invoke skills from the skill library. Use this skill whenever you're uncertain which skill applies to a task, when composing multiple skills for complex work, or when you need to understand what skills are available. This is your go-to when facing an ambiguous task and need to figure out the right approach before diving into implementation.