skills/package-evaluator/SKILL.md
Evaluates Claude Code package quality across 6 dimensions for all 7 package types, producing scored audit reports. Triggers on: "evaluate package", "audit agent quality", "score this hook", "package audit", "skill quality check". NOT for LLM prompts, use prompt-lab.
npx skillsauth add mathews-tom/armory package-evaluatorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Packages that do not activate on relevant queries waste the entire investment in writing them. A skill can have deep, well-structured content and still deliver zero value if its frontmatter description lacks the trigger phrases users actually type. An agent without a decision tree produces inconsistent results. A hook without a handler script is inert. Quality evaluation catches trigger gaps, missing sections, structural deficiencies, and shallow content before deployment — turning a package from a static document into a reliable tool.
| File | Contents |
| --------------------------------- | --------------------------------------------------------------------------------------------------------------------------- |
| references/evaluation-rubric.md | Detailed 1-5 scoring criteria per dimension, weight justifications, type-specific criteria, worked examples for calibration |
Two modes, selected by input:
| Input | Mode | | ------------------------------------------------------- | ---------------------------------------------- | | Path to a specific package directory or definition file | Quick Audit | | "all", "every package", no path specified | Full Audit | | "--type agents" or type filter | Full Audit filtered to one type | | Multiple specific paths | Quick Audit for each, then comparative summary |
Six dimensions, each scored 1-5. Weighted sum determines overall percentage.
Evaluates the YAML frontmatter block for completeness and discoverability.
Signals:
name field present and non-emptydescription field present and non-emptyScoring constraints: A description under 100 characters caps this dimension at 2/5.
A missing name or description field caps at 1/5.
Evaluates whether the package activates on the queries users actually type.
Signals:
Scoring constraints: Fewer than 3 distinct trigger phrases caps at 2/5. Zero trigger phrases in the description caps at 1/5.
Evaluates whether the package contains the sections needed to function reliably.
Signals:
Scoring constraints: A package with no workflow section caps at 2/5. A package with a workflow but no error handling or output format caps at 3/5.
Evaluates the substantive quality of the package's guidance — whether it provides enough detail for an agent to execute well without human intervention.
Signals:
Scoring constraints: A package consisting only of bare commands with no explanatory context caps at 2/5. Reference files count toward this dimension only if they contain substantive guidance (checklists, rubrics, criteria), not just link collections.
Evaluates internal consistency and structural integrity.
Signals:
name field in frontmatter exactly../other-package/) in the definition
file or reference files. Packages must be standalone — all referenced files must live
within the package's own directory. Shared content should use the _templates/ sync
system to maintain local copies.Scoring constraints: A name mismatch between directory and frontmatter is a CRITICAL
finding and caps at 1/5. Cross-package ../ references are a CRITICAL finding and cap
at 1/5 — they break standalone packaging. Missing referenced files cap at 2/5.
Evaluates adherence to the repository's contribution guidelines.
Signals:
Scoring constraints: Any single violation caps at 3/5. Multiple violations cap at 2/5. Invalid YAML that prevents parsing caps at 1/5.
In addition to the 6 shared dimensions, each package type has type-specific quality signals that influence D3 (Structural Completeness) and D4 (Content Depth) scoring.
When evaluating an AGENT.md:
D3 additions:
model field specified (opus/sonnet/haiku)color field specifiedmetadata.category specifiedmetadata.execution_phase specified (pre-write/post-write/pre-commit)metadata.language_targets specifiedD4 additions:
Scoring impact: An agent without a decision tree caps D4 at 2/5. An agent without model/category/phase metadata caps D3 at 3/5.
When evaluating a HOOK.md:
D3 additions:
hook.events list specified (PreToolUse, PostToolUse, Stop, etc.)hook.handler.type specified (command/python-module)hook.handler.command specifiedhandler.sh or equivalent handler script exists in directoryD4 additions:
Scoring impact: A hook without a handler script caps D4 at 1/5. A hook without documented events caps D3 at 2/5.
When evaluating a RULE.md:
D3 additions:
metadata.scope specified (global/project)metadata.applies_to.languages specified if scope is language-specificD4 additions:
Scoring impact: A rule with only vague principles and no concrete requirements caps D4 at 2/5.
When evaluating a COMMAND.md:
D3 additions:
command.syntax specified with argument documentationcommand.handler specified (inline/command)D4 additions:
Scoring impact: A command without a step-by-step workflow caps D4 at 2/5.
When evaluating a UTILITY.md:
D3 additions:
utility.runtime specified (python/node/shell)utility.entry_point specified and the file existsutility.executable specifiedD4 additions:
Scoring impact: A utility without a working entry_point script caps D4 at 1/5.
When evaluating a PRESET.md:
D3 additions:
preset.packages specified with at least one type sectionpreset.compatibility.platforms specifiedD4 additions:
Scoring impact: A preset referencing non-existent packages is a CRITICAL finding.
| Severity | Criteria | Score Impact | | -------- | ------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------- | | CRITICAL | Package cannot activate or breaks on load — missing frontmatter, name mismatch, invalid YAML, non-existent preset references | Caps overall score at 40% | | HIGH | Significant trigger gap or missing core section — no workflow, no error handling, zero trigger phrases, missing handler script | Caps affected dimension at 3/5 | | MEDIUM | Weak coverage, shallow content, few trigger synonyms, missing type-specific metadata | Dimension needs improvement but functions | | LOW | Minor polish — formatting inconsistencies, slightly short description, missing calibration rules | Fix when convenient |
SKILL.md = skill, AGENT.md = agent, HOOK.md = hook, RULE.md = rule,
COMMAND.md = command, UTILITY.md = utility, PRESET.md = preset.
If the path points to a definition file directly, use its parent directory.skills/, agents/, hooks/,
rules/, commands/, utilities/, and presets/ that contain a recognized
definition file. If a --type filter is specified, restrict to that type's directory.For each package under evaluation:
name and description fields. If YAML parsing
fails, record a CRITICAL finding and score D1 and D6 as 1/5.references/ directory for existence and contents. Verify every file
referenced in the definition file body exists on disk.../ patterns pointing outside the package directory). Flag any as CRITICAL D5
findings.references/evaluation-rubric.md.references/evaluation-rubric.md.Overall% = (sum of dimension_score x weight) / 5 x 100.| Range | Verdict | | --------- | ---------- | | 90-100% | Exemplary | | 80-89% | Strong | | 70-79% | Adequate | | 60-69% | Needs Work | | Below 60% | Deficient |
Generate the structured output using the appropriate template below.
## Package Audit: {package-name} ({type})
| Dimension | Score | Weight | Weighted | Key Finding |
|-----------|-------|--------|----------|-------------|
| D1: Frontmatter Quality | X/5 | 20% | X.XXX | ... |
| D2: Trigger Coverage | X/5 | 18% | X.XXX | ... |
| D3: Structural Completeness | X/5 | 20% | X.XXX | ... |
| D4: Content Depth | X/5 | 22% | X.XXX | ... |
| D5: Consistency & Integrity | X/5 | 12% | X.XXX | ... |
| D6: CONTRIBUTING Compliance | X/5 | 8% | X.XXX | ... |
**Overall: XX% — {Verdict}**
### Type-Specific Findings
[Findings from type-specific signal evaluation, if any.]
### Findings
[Severity-sorted list. Each entry includes dimension tag, severity, description,
evidence, and recommendation.]
- **[CRITICAL] D5:** ...
- **[HIGH] D2:** ...
- **[MEDIUM] D4:** ...
- **[LOW] D3:** ...
### Score Calculation
D1: {score} x 0.20 = {result}
D2: {score} x 0.18 = {result}
D3: {score} x 0.20 = {result}
D4: {score} x 0.22 = {result}
D5: {score} x 0.12 = {result}
D6: {score} x 0.08 = {result}
Sum = {weighted_sum}
Overall = {weighted_sum} / 5 x 100 = {percentage}% — {Verdict}
## Package Repository Audit
| Package | Type | Overall | Verdict | Worst Dimension | Top Issue |
|---------|------|---------|---------|-----------------|-----------|
| {name} | {type} | XX% | {verdict} | {dimension} | {issue} |
| ... | ... | ... | ... | ... | ... |
### Per-Package Summaries
[Condensed Quick Audit for each package: scorecard table, overall score, top 3 findings.
Omit the full Score Calculation section in condensed mode.]
| Problem | Cause | Fix |
| ----------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
| Definition file not found in directory | Path incorrect or file missing | Report as CRITICAL; do not attempt evaluation; surface the path and stop |
| Unknown definition file | Directory has no recognized definition file (SKILL.md, AGENT.md, HOOK.md, RULE.md, COMMAND.md, UTILITY.md, PRESET.md) | Skip directory; report as warning in Full Audit; report as CRITICAL in Quick Audit |
| YAML frontmatter parse failure | Invalid YAML syntax (unclosed quotes, bad indentation) | Report as CRITICAL finding; score D1 and D6 as 1/5; continue evaluating the body content where parseable |
| references/ directory missing | Package has no reference files | Not an error — score D5 normally; check only that any files referenced in the definition file body actually exist on disk |
| references/ exists but referenced file is absent | File path in definition file body doesn't resolve | Record as a CRITICAL D5 finding; missing referenced files cap D5 at 2/5 |
| Empty definition file (zero bytes or whitespace only) | File created but never populated | Treat as CRITICAL; score all dimensions 1/5; overall verdict: Deficient |
| references/evaluation-rubric.md not found | Evaluator's own reference file missing | Note the irony; evaluate using the criteria inline in this SKILL.md; flag D5 as a CRITICAL finding |
| Handler script missing for hook | HOOK.md references a handler that doesn't exist | Record as CRITICAL D4 finding; cap D4 at 1/5 |
| Preset references non-existent package | PRESET.md lists a package not in the manifest | Record as CRITICAL finding; caps overall at 40% |
testing
Create, review, and restyle data visualizations using Edward Tufte principles: high data-ink ratio, direct labels, range-frame axes, small multiples, accessible color, responsive charts, and honest comparisons. Triggers on: "create a chart", "style this chart", "review this graph", "Tufte chart", "data visualization", "Recharts", "Plotly", "matplotlib", "Chart.js", "ECharts", "D3". Use when generating or critiquing charts, dashboards, sparklines, and data tables.
testing
Manages dependent branch stacks and stacked pull requests using safe Git topology rules. Triggers on: "create stacked PRs", "publish this stack", "sync my PR stack", "rebase this stack", "merge the stack", "retarget child PRs", "split this branch into stacked PRs", "validate this stack", "cleanup stacked branches". Use when local branches or one source branch need to become a dependency-ordered PR stack with correct parent bases, validation, synchronization, merge order, and cleanup.
development
Scaffolds per-repository agent context so coding agents share the same issue tracker rules, triage label vocabulary, domain glossary, ADR layout, and handoff conventions. Triggers on: "set up project context", "configure agent docs", "create CONTEXT.md", "setup agent workflow", "agent issue tracker setup", "triage labels", "domain glossary for agents". Use when a repo needs durable context files before planning, triage, debugging, TDD, architecture review, or multi-agent implementation.
testing
Produces phased task boards from feature requests: dependency-mapped work items, parallelization flags, risk flags, edge cases, test matrices. Triggers on: "decompose this feature", "task breakdown with dependencies", "phased implementation plan", "work breakdown structure". NOT for effort estimates, use estimate-calibrator.