skills/surrogate-verifier/SKILL.md
Generates structured test assertions and failure diagnostics for skill packages from a definition and task prompt. Triggers on: "verify this skill", "generate assertions", "surrogate verification", "diagnose skill failure". NOT for code review, use pr-review.
npx skillsauth add mathews-tom/armory surrogate-verifierInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Generate structured test assertions and failure diagnostics for skill packages through information-isolated verification. The verifier operates without access to the skill generator's reasoning — it sees only the skill definition, a task prompt, and the output artifacts. This isolation prevents confirmation bias and is the single largest contributor to skill quality in co-evolutionary generation (+30pp per EvoSkills).
| File | Contents | Load When |
| ----------------------------------- | ---------------------------------------------------------------- | ------------------------------ |
| references/assertion-patterns.md | Assertion catalog by skill category with weight guidance | Always |
| references/diagnostic-templates.md| Failure diagnostic templates with root-cause categories | When producing failure reports |
This is the most critical constraint. Violating isolation degrades verification quality.
The verifier MUST NOT access:
The verifier receives ONLY:
SKILL.md content (the definition file)scripts/eval_assertions.py (when diagnosing)Implementation: When invoked by the test-engineer agent, this skill MUST be loaded
into a separate Agent spawn using isolation: "worktree" or at minimum a fresh session
with no shared context. The invoking agent passes artifacts as explicit text, not as
conversation references.
Generate assertions for a skill given its definition and task prompts.
Read the SKILL.md definition and extract:
For each task prompt, generate 5-10 assertions covering these dimensions:
| Dimension | Assertion Types to Use | Purpose |
| --------------------- | ------------------------------- | ------------------------------------------ |
| Output completeness | contains, matches_regex | All claimed sections/components present |
| Format compliance | output_format, contains | Output matches declared structure |
| Factual signals | contains, not_contains | Key domain terms present, hallmarks absent |
| Tool usage | calls_tool | Expected tools were invoked |
| Negative constraints | not_contains | Forbidden patterns absent |
Weight assignment:
See references/assertion-patterns.md for category-specific assertion catalogs.
Produce assertions in the evals/cases.yaml schema format:
assertions:
- type: contains
target: "## Scalability"
weight: 1.0
- type: output_format
target: markdown_table
weight: 0.8
- type: not_contains
target: "TODO"
weight: 0.3
- type: calls_tool
target: Read
weight: 0.5
Context cap: Do not consume more than 70% of the available context window. If the skill definition is very long, focus assertion generation on the workflow phases and output format sections. Summarize rather than quote verbatim.
When an oracle returns fail, produce a structured diagnostic explaining why.
SKILL.md (same as Mode 1)Categorize each failed assertion into a root-cause category:
| Category | Signal | Severity |
| --------------------- | -------------------------------------------------------------- | ---------- |
| Missing capability | contains assertion failed for a claimed feature | HIGH |
| Format mismatch | output_format assertion failed | HIGH |
| Incomplete output | Multiple contains assertions failed in the same section | MEDIUM |
| Hallucinated content | not_contains assertion failed (forbidden pattern present) | HIGH |
| Wrong tool usage | calls_tool assertion failed | MEDIUM |
| Partial success | Some assertions in a group pass, others fail | LOW |
For each failed assertion:
SKILL.md that promises the missing capabilityFor each root cause, produce a concrete, actionable fix:
Produce a structured diagnostic string:
DIAGNOSTIC: [skill-name] failed on [task-prompt-summary]
FAILED ASSERTIONS (N/M):
1. [SEVERITY] type=contains target="..." — Missing capability: [explanation]
2. [SEVERITY] type=output_format target="..." — Format mismatch: [explanation]
ROOT CAUSES:
- [category]: [specific explanation with SKILL.md section reference]
REMEDIATION:
1. [Concrete change to SKILL.md with exact section and wording]
2. [Concrete change to workflow with step numbers]
See references/diagnostic-templates.md for worked examples per root-cause category.
Per EvoSkills Algorithm 1:
The verifier does not track its own budget — the test-engineer agent manages iteration limits.
scripts/run_evals.py (the oracle).scripts/eval_assertions.py.| Error | Resolution |
| ----------------------------- | --------------------------------------------------------------- |
| Skill definition too large | Summarize to workflow phases + output format sections only |
| No assertions generatable | Return empty assertions list with warning; skill may be too vague |
| Ambiguous output format | Default to contains assertions; avoid output_format checks |
| Context cap exceeded | Truncate diagnostic detail; preserve failed assertion list |
testing
Manages dependent branch stacks and stacked pull requests using safe Git topology rules. Triggers on: "create stacked PRs", "publish this stack", "sync my PR stack", "rebase this stack", "merge the stack", "retarget child PRs", "split this branch into stacked PRs", "validate this stack", "cleanup stacked branches". Use when local branches or one source branch need to become a dependency-ordered PR stack with correct parent bases, validation, synchronization, merge order, and cleanup.
development
Scaffolds per-repository agent context so coding agents share the same issue tracker rules, triage label vocabulary, domain glossary, ADR layout, and handoff conventions. Triggers on: "set up project context", "configure agent docs", "create CONTEXT.md", "setup agent workflow", "agent issue tracker setup", "triage labels", "domain glossary for agents". Use when a repo needs durable context files before planning, triage, debugging, TDD, architecture review, or multi-agent implementation.
testing
Produces phased task boards from feature requests: dependency-mapped work items, parallelization flags, risk flags, edge cases, test matrices. Triggers on: "decompose this feature", "task breakdown with dependencies", "phased implementation plan", "work breakdown structure". NOT for effort estimates, use estimate-calibrator.
development
Hypothesis-driven debugging with ranked hypotheses, git bisect strategy, instrumentation planning, and minimal reproduction design. Triggers on: "debug this systematically", "root cause analysis", "bisect this bug", "rank hypotheses", "isolate this issue", "minimal reproduction". NOT for general reasoning.