skills/skill-grader/SKILL.md
Evaluates Claude Agent Skills on 10 quality axes with letter grades (A+ through F) and specific improvement recommendations. Use when auditing a skill, comparing skills, prioritizing improvements, or performing quality control on a skill library. Activate on "grade skill", "evaluate skill", "skill quality", "skill audit", "skill review", "rate skill". NOT for creating skills (use skill-architect), grading code quality, or evaluating non-skill documents.
npx skillsauth add curiositech/windags-skills skill-graderInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Structured evaluation rubric for Claude Agent Skills. Produces letter grades (A+ through F) on 10 axes plus an overall grade, with specific improvement recommendations for each axis.
Designed for sub-agents and non-expert reviewers who need a mechanical, repeatable process for assessing skill quality without deep domain expertise.
✅ Use for:
❌ NOT for:
skill-architect)flowchart TD
A[Read SKILL.md + all files] --> B[Score each of 10 axes]
B --> C[Assign letter grade per axis]
C --> D[Compute overall grade]
D --> E[Write improvement recommendations]
E --> F[Produce grading report]
Does the description follow [What] [When] [Keywords]. NOT for [Exclusions]?
| Grade | Criteria | |-------|----------| | A | Specific verb+noun, domain keywords users would type, 2-5 explicit NOT exclusions, 25-50 words | | B | Has keywords and NOT clause, but slightly vague or missing synonym coverage | | C | Too generic, missing NOT clause, or >100 words of process detail | | D | Single vague sentence ("helps with X") or name/description mismatch | | F | Missing or empty description |
Is the skill narrowly focused on one expertise type, or a catch-all?
| Grade | Criteria | |-------|----------| | A | One clear expertise domain, "When to Use" and "NOT for" sections both present and specific | | B | Mostly focused, minor boundary ambiguity | | C | Covers 2-3 related but distinct domains, should probably be split | | D | Catch-all skill ("helps with anything related to X") | | F | No scope boundaries defined at all |
Does the skill follow the three-layer architecture (metadata → SKILL.md → references)?
| Grade | Criteria | |-------|----------| | A | SKILL.md <300 lines, heavy content in references, reference index in SKILL.md with 1-line descriptions | | B | SKILL.md <500 lines, some references used, index present | | C | SKILL.md >500 lines, or all content inlined with no references | | D | SKILL.md >800 lines, or references exist but aren't indexed in SKILL.md | | F | Single massive file with no structure |
Does the skill encode expert knowledge that prevents common mistakes?
| Grade | Criteria | |-------|----------| | A | 3+ anti-patterns with Novice/Expert/Timeline template, LLM-mistake notes | | B | 1-2 anti-patterns with clear explanation | | C | Anti-patterns mentioned but no structured template | | D | No anti-patterns, just positive instructions | | F | Contains advice that IS an anti-pattern (outdated, harmful) |
Does the skill include working tools (scripts, MCPs, subagents)?
| Grade | Criteria | |-------|----------| | A | Working scripts with CLI interface, error handling, dependency docs; OR valid "no tools needed" justification | | B | Scripts exist and work but lack error handling or docs | | C | Scripts referenced but are templates/pseudocode | | D | Phantom tools (SKILL.md references files that don't exist) | | F | References non-existent tools AND no acknowledgment |
Note: Not every skill needs tools. A pure decision-tree skill can score A if tools aren't applicable.
Would the skill activate correctly on relevant queries and stay silent on irrelevant ones?
| Grade | Criteria | |-------|----------| | A | Description has specific keywords matching user language, clear NOT clause, no obvious false-positive vectors | | B | Good keywords, minor false-positive risk | | C | Generic keywords that overlap with other skills | | D | No specific keywords, or NOT clause contradicts intended use | | F | Description would cause constant false activation |
Does the skill use Mermaid diagrams, code examples, and tables effectively?
| Grade | Criteria | |-------|----------| | A | Decision trees as Mermaid flowcharts, tables for comparisons, code examples for concrete patterns | | B | Some diagrams or tables, but key decision trees still in prose | | C | Tables used but no Mermaid diagrams for processes | | D | Prose-only, no visual structure | | F | Wall of text with no formatting aids |
Does the skill define what it produces in a format consumable by other agents?
| Grade | Criteria | |-------|----------| | A | Explicit output format (JSON schema, markdown template, or structured sections), subagent-consumable | | B | Output format implied but not explicitly documented | | C | No output format, but content is structured enough to infer | | D | Unstructured prose output expected | | F | N/A (pure reference skill) — exempt from this axis |
Does the skill track when knowledge was current and what has changed?
| Grade | Criteria | |-------|----------| | A | Timelines in anti-patterns, "as of [date]" markers, CHANGELOG with dates | | B | Some temporal context, CHANGELOG exists | | C | No dates on knowledge, but CHANGELOG exists | | D | No temporal context anywhere, knowledge could be stale | | F | Contains demonstrably outdated advice without warning |
README, CHANGELOG, and reference organization.
| Grade | Criteria | |-------|----------| | A | README with quick start, CHANGELOG with dated versions, references well-organized with clear filenames | | B | README and CHANGELOG exist, references present | | C | SKILL.md is the only file, but it's well-structured | | D | No README, no CHANGELOG, disorganized references | | F | SKILL.md is the only file and it's poorly structured |
| Letter | Score Range | Meaning | |--------|------------|---------| | A+ | 97-100 | Exemplary — sets the standard | | A | 93-96 | Excellent — minor improvements possible | | A- | 90-92 | Very good — a few small gaps | | B+ | 87-89 | Good — notable room for improvement | | B | 83-86 | Solid — several areas need work | | B- | 80-82 | Above average — meaningful gaps | | C+ | 77-79 | Average — significant improvements needed | | C | 73-76 | Below average — major gaps | | C- | 70-72 | Barely adequate | | D+ | 67-69 | Poor — fundamental issues | | D | 63-66 | Very poor — needs major rework | | D- | 60-62 | Near-failing quality | | F | <60 | Failing — start over |
Axes 1 (Description) and 2 (Scope) carry 2x weight. All others carry 1x weight. If Axis 8 (Output Contracts) is marked exempt, remove it from the calculation.
Overall = (2×Axis1 + 2×Axis2 + Axis3 + Axis4 + Axis5 + Axis6 + Axis7 + Axis8 + Axis9 + Axis10) / 12
Convert the numeric average to a letter grade using the scale above.
Produce this exact structure:
# Skill Grading Report: [skill-name]
**Graded**: [date]
**Overall Grade**: [letter] ([score]/100)
## Axis Grades
| # | Axis | Grade | Score | Key Finding |
|---|------|-------|-------|-------------|
| 1 | Description Quality | [grade] | [score] | [1-line finding] |
| 2 | Scope Discipline | [grade] | [score] | [1-line finding] |
| 3 | Progressive Disclosure | [grade] | [score] | [1-line finding] |
| 4 | Anti-Pattern Coverage | [grade] | [score] | [1-line finding] |
| 5 | Self-Contained Tools | [grade] | [score] | [1-line finding] |
| 6 | Activation Precision | [grade] | [score] | [1-line finding] |
| 7 | Visual Artifacts | [grade] | [score] | [1-line finding] |
| 8 | Output Contracts | [grade] | [score] | [1-line finding] |
| 9 | Temporal Awareness | [grade] | [score] | [1-line finding] |
| 10 | Documentation Quality | [grade] | [score] | [1-line finding] |
## Top 3 Improvements (Highest Impact)
1. **[Axis]: [Specific action]** — [Why this matters, expected grade improvement]
2. **[Axis]: [Specific action]** — [Why this matters, expected grade improvement]
3. **[Axis]: [Specific action]** — [Why this matters, expected grade improvement]
## Detailed Notes
### [Axis name] ([grade])
[2-3 sentences of specific feedback with examples from the skill]
[Repeat for each axis scoring below B+]
For rapid triage across many skills, produce only:
| Skill | Overall | Desc | Scope | Disc | Anti | Tools | Activ | Visual | Output | Temp | Docs |
|-------|---------|------|-------|------|------|-------|-------|--------|--------|------|------|
| [name] | [grade] | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Wrong: Giving B+ because "it's pretty good" without checking criteria. Right: Match observations to the rubric table literally. If the description lacks a NOT clause, it cannot score above C on Axis 1.
Wrong: Grading a pure decision-tree skill poorly on Axis 5 (tools) because it has no scripts. Right: Mark Axis 5 as "A — tools not applicable for this skill type."
Wrong: Scoring Axis 5 as B because scripts are "referenced."
Right: Actually check if every referenced file exists. If scripts/validate.py is mentioned but doesn't exist, that's D.
This skill produces:
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.