skills/agent/spec-test-eval/SKILL.md
--- name: spec-test-eval description: A structured framework for evaluating LLM and agent outputs using three layers: specs (what correct looks like), test functions (pass/fail checks), and evaluation functions (quality scoring). triggers: - When you need to assess, grade, judge, or critique the quality of LLM or agent outputs - When designing evaluation criteria or building rubrics for prompt or model testing - When reviewing or comparing behavior or outputs from one or more agents or mod
npx skillsauth add thimslugga/agent-skills skills/agent/spec-test-evalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A disciplined approach to evaluating LLM/agent outputs. The core idea: you cannot meaningfully evaluate anything without first defining what "good" means. This framework enforces that discipline through three layers, applied in order.
The spec is the upstream source of truth. Without it, tests check arbitrary things and scores measure nothing meaningful.
Before evaluating any output, define the spec by answering:
Write specs as plain declarative statements. Each statement should be independently verifiable.
Example spec for a "summarize this article" task:
1. Output is 2-4 sentences.
2. Captures the article's primary claim.
3. Mentions at least one piece of supporting evidence from the article.
4. Does not introduce information absent from the source.
5. Uses neutral tone (no editorializing).
Common failure mode: Skipping the spec and jumping straight to "does this look good?" That is vibes-based evaluation. It is unreliable, inconsistent, and impossible to improve systematically because there is nothing written down to revise.
Test functions are binary checks derived from the spec. Each spec statement becomes one or more tests. A test takes the output (and optionally the input/context) and returns pass or fail.
For each spec statement, ask: "Can I check this with a yes/no question?"
From the summary spec above:
Test 1: Is the output 2-4 sentences? -> pass/fail
Test 2: Does it state the article's primary claim? -> pass/fail
Test 3: Does it cite supporting evidence? -> pass/fail
Test 4: Does it contain only source-grounded info? -> pass/fail
Test 5: Is the tone neutral? -> pass/fail
Tests can be:
Aggregation: Report the pass rate across all tests. An output that passes 5/5 tests meets the spec. An output that passes 3/5 partially meets it, and you know exactly which criteria it failed.
Common failure mode: Writing tests that do not trace back to a spec statement. If a test exists, it should be because the spec requires it. Orphan tests create confusion about what you are actually measuring.
Evaluation functions return a score on a scale, capturing quality gradients that binary tests miss. Use them when "pass" is not granular enough -- when you need to rank outputs, track improvement over time, or optimize.
An eval function scores along a dimension. Dimensions come from the spec but measure degree rather than compliance:
Dimension: Faithfulness to source (0-5)
0 = Entirely fabricated
1 = Mostly fabricated, one or two accurate details
2 = Mix of accurate and fabricated claims
3 = Mostly accurate, minor unsupported inferences
4 = Accurate, with trivial imprecision
5 = Fully faithful, every claim traceable to the source
Dimension: Conciseness (0-3)
0 = Severely over or under length, misses the point
1 = Roughly right length but padded or omits key info
2 = Appropriate length, minor fluff
3 = Tight, nothing wasted, nothing missing
Anchoring is critical. A scale without anchor descriptions is just a number. Define what each score level means in concrete terms. Without anchors, a "3 out of 5" from one evaluation and a "3 out of 5" from another are not comparable.
Composite scores: You can combine dimension scores into an overall score. Weight dimensions by importance. A weighted sum is fine; do not overcomplicate it.
Common failure mode: Using eval functions without first having tests. If an output fails basic spec compliance (hallucinates facts, wrong format), a nuanced quality score is meaningless. Tests are the gate; eval functions are the gradient beyond the gate.
Use the same spec, tests, and eval functions across all candidates. This is the whole point of the framework -- it forces apples-to-apples comparison.
The framework doubles as a feedback loop:
This is how you avoid blind prompt tweaking. Each iteration targets a specific deficiency identified by the framework, and the framework measures whether the fix worked.
If multiple people (or LLM calls) are applying the same eval function, calibrate first:
This matters especially when using an LLM as a judge -- if the scoring prompt is ambiguous, the LLM's scores will drift across runs.
Spec = "What should it do?" -> Declarative statements
Test = "Did it do that?" -> Pass/fail per statement
Eval = "How well did it do that?" -> Score per dimension
Order matters: Spec first. Tests gate. Eval scores what passes.
No spec = no meaningful evaluation.
No tests = no minimum bar.
No anchored scale = no reliable scoring.
development
Use this skill whenever the user wants to create, read, edit, or manipulate Word documents (.docx files). Triggers include: any mention of 'Word doc', 'word document', '.docx', or requests to produce professional documents with formatting like tables of contents, headings, page numbers, or letterheads. Also use when extracting or reorganizing content from .docx files, inserting or replacing images in documents, performing find-and-replace in Word files, working with tracked changes or comments, or converting content into a polished Word document. If the user asks for a 'report', 'memo', 'letter', 'template', or similar deliverable as a Word or .docx file, use this skill. Do NOT use for PDFs, spreadsheets, Google Docs, or general coding tasks unrelated to document generation.
development
Documentation templates and structure guidelines. README, API docs, code comments, and AI-friendly documentation.
development
Use when the task involves reading, creating, or editing `.docx` documents, especially when formatting or layout fidelity matters; prefer `python-docx` plus the bundled `scripts/render_docx.py` for visual checks.
documentation
Create a README.md file for the project