skills/skill-judge/SKILL.md
MANDATORY when evaluating, auditing, scoring, or reviewing Agent Skills (SKILL.md files) for quality. MANDATORY when certifying skill reference companions under ~/.claude/skills/**/references/*.md (runs Pointer-Only Validator per SoT-Reference Discipline, universal-protocols.md). Use when comparing skills, running skill quality audits, benchmarking skill effectiveness, or deciding which skills need optimization. Use when a user says 'score this skill', 'evaluate this skill', 'audit my skills', 'how good is this skill'. CRITICAL: refuses certification of PROSE-class companions; escalates BORDERLINE classifications to AI-judge pass. Do NOT use for: ultimate-skill-creator (creating new skills), skill-creator (lightweight skill creation), superpowers:writing-skills (auto-triggered writing workflows).
npx skillsauth add sharkitect-solutions/sharkitect-claude-toolkit skill-judgeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Evaluate Agent Skills against official specifications and patterns derived from 100+ evaluations across 110 skills.
| File | Load When | Do NOT Load |
|---|---|---|
| references/common-failures.md | First evaluation of session, unfamiliar with the 9 failure patterns, need quick reference checklist | Already evaluated 2+ skills this session (patterns memorized) |
| references/scoring-calibration.md | Unsure about D1 for a domain, need companion impact data for D5, verifying sub-grade thresholds, checking for scoring errors | Confident in calibration from recent evaluations |
| references/dimension-examples.md | Need concrete high/low examples for a specific dimension, calibrating for unfamiliar skill type | Already calibrated from recent similar evaluations |
| references/pointer-validator.md | Evaluating a reference companion file under skills/**/references/*.md — need H4 heuristic thresholds, citation patterns, or escalation criteria | Evaluating a SKILL.md file (not a reference companion) |
| Request | This Skill | Use Instead | |---|---|---| | "Score/evaluate this skill" | YES | - | | "Audit all my skills for quality" | YES | - | | "Compare these two skills" | YES | - | | "Which skills need optimization?" | YES | - | | "Create a new skill" | NO | ultimate-skill-creator | | "Write/improve skill content" | NO | superpowers:writing-skills | | "Review this code for quality" | NO | clean-code or code-reviewer agent | | "Evaluate this agent's performance" | NO | agent-evaluation |
Good Skill = Expert-only Knowledge - What Claude Already Knows
A Skill's value is its knowledge delta -- the gap between what it provides and what the model already knows. When a Skill explains basics or standard library usage, it wastes context tokens -- a shared resource with system prompts, conversation history, and user requests.
| Type | Definition | Treatment | |------|------------|-----------| | Expert | Claude genuinely doesn't know this | Must keep -- this is the Skill's value | | Activation | Claude knows but may not think of | Keep if brief -- serves as reminder | | Redundant | Claude definitely knows this | Delete -- wastes tokens |
The art of Skill design: maximize Expert content, use Activation sparingly, eliminate Redundant ruthlessly. Good Skill: >70% Expert, <20% Activation, <10% Redundant.
| Concept | Essence | Function | |---------|---------|----------| | Tool | What model CAN do | Execute actions (bash, read, write, WebSearch) | | Skill | What model KNOWS how to do | Guide decisions (expert patterns, anti-patterns, decision trees) |
General Agent + Excellent Skill = Domain Expert Agent
Does the Skill add genuine expert knowledge Claude doesn't have?
| Score | Criteria | |-------|----------| | 0-5 | Explains basics Claude knows (tutorials, standard library usage, generic best practices) | | 6-10 | Mixed: some expert knowledge diluted by obvious content | | 11-15 | Mostly expert knowledge with minimal redundancy | | 16-20 | Pure knowledge delta -- every paragraph earns its tokens |
Red flags (instant <=5): "What is X" sections, step-by-step standard tutorials, common library docs, generic advice ("write clean code").
Green flags (high delta): Decision trees for non-obvious choices, trade-offs only experts know, real-world edge cases, "NEVER do X because [non-obvious reason]", domain-specific thinking frameworks.
Evaluation: For each section ask "Does Claude already know this?" Mark as [E]xpert, [A]ctivation, or [R]edundant. Calculate ratio.
Does the Skill transfer expert thinking patterns AND domain-specific procedures?
| Score | Criteria | |-------|----------| | 0-3 | Only generic procedures Claude already knows | | 4-7 | Has domain procedures but lacks thinking frameworks | | 8-11 | Good balance: thinking patterns + domain-specific workflows | | 12-15 | Expert-level: shapes thinking AND provides novel procedures |
Valuable: "Before [action], ask yourself..." thinking frameworks. Domain-specific workflows with non-obvious ordering or critical steps. Redundant: Generic steps (open, edit, save). Standard programming patterns.
Does the Skill have effective NEVER lists with specific, named patterns?
| Score | Criteria | |-------|----------| | 0-3 | No anti-patterns mentioned | | 4-7 | Generic warnings ("avoid errors", "be careful") | | 8-11 | Specific NEVER list with some reasoning | | 12-15 | Named anti-patterns with WHY -- things only experience teaches |
The test: Would an expert say "I learned this the hard way"? Named patterns ("The Autopilot", "The Status Novel") with what-happens, why-it-fails, and fix columns score highest.
Does the Skill follow format requirements? Special focus on description quality.
| Score | Criteria | |-------|----------| | 0-5 | Missing frontmatter or invalid format | | 6-10 | Has frontmatter but description is vague or incomplete | | 11-13 | Valid frontmatter, trigger conditions present but weak exclusions or no Scope Boundary | | 14-15 | Perfect: trigger conditions + exclusions in description, Scope Boundary table in body |
Description must contain: Trigger conditions ("Use when..."), specific phrases in quotes, searchable keywords, exclusions ("Do NOT use for: [skill] ([purpose])"). Description must NEVER summarize workflow or content -- this is the cardinal CSO rule.
Why description is critical: It's the ONLY thing Claude reads when deciding whether to load a skill. Poor description = invisible skill regardless of content quality.
Does the Skill implement proper content layering?
| Score | Criteria | |-------|----------| | 0-5 | Everything in SKILL.md (>500 lines) or skeleton companions with no real content | | 6-10 | Has companions but no explicit loading triggers | | 11-13 | Good layering with File Index including Load When triggers | | 14-15 | Perfect: File Index with Load When + Do NOT Load, triggers embedded in workflow |
With companion files: Check File Index -- 3 columns (File, Load When, Do NOT Load)? Placed early in SKILL.md? Real content in companions (not skeletons)?
Simple Skills (<100 lines, no companions): Score based on conciseness and self-containment. Max ~7/15 for well-organized single-file skills.
Is specificity appropriate for the task's fragility?
| Score | Criteria | |-------|----------| | 0-5 | Severely mismatched (rigid scripts for creative tasks, vague for fragile operations) | | 6-10 | Partially appropriate | | 11-13 | Good calibration for most scenarios | | 14-15 | Perfect freedom calibration throughout |
| Task Type | Should Have | Why | |-----------|-------------|-----| | Creative/Design | High freedom (principles, not steps) | Multiple valid approaches | | Code review/judgment | Medium freedom (priorities + judgment) | Principles exist but context varies | | File format operations | Low freedom (exact scripts, no deviation) | One wrong byte corrupts file |
The test: "If Agent makes a mistake, what's the consequence?" High consequence = low freedom.
Does the Skill follow an established design pattern?
| Pattern | ~Lines | When to Use | |---------|--------|-------------| | Mindset | ~50 | Creative tasks requiring taste | | Navigation | ~30 | Multiple distinct sub-scenarios | | Philosophy | ~150 | Art/creation requiring originality | | Process | ~200 | Complex multi-step projects | | Tool | ~300 | Precise operations on formats |
| Score | Criteria | |-------|----------| | 0-3 | No recognizable pattern, chaotic structure | | 4-6 | Partially follows a pattern | | 7-8 | Clear pattern with minor deviations | | 9-10 | Masterful application of appropriate pattern |
Can an Agent actually use this Skill effectively?
| Score | Criteria | |-------|----------| | 0-5 | Confusing, incomplete, contradictory guidance | | 6-10 | Usable but with noticeable gaps | | 11-13 | Clear guidance for common cases | | 14-15 | Comprehensive: decision trees + fallbacks + edge cases |
Check for: Decision trees for multi-path scenarios, working code examples, error handling and fallbacks, edge case coverage, immediate actionability.
| Grade | Score | Meaning | |-------|-------|---------| | A | 108-120 | Excellent -- production expert skill | | A- | 104-107 | Very strong with minor polish needed | | B+ | 100-103 | Strong, passes quality gate comfortably | | B | 96-99 | Good, passes quality gate | | C+ | 90-95 | Near quality gate, targeted fixes can reach B | | C | 80-89 | Adequate, clear improvement path | | C- | 70-79 | Below average, significant gaps | | D+ | 60-69 | Poor, needs fundamental restructuring | | F | <60 | Needs complete redesign |
| Rationalization | When It Appears | Why It's Wrong | |---|---|---| | "It's long and thorough, so it must be good" | Skill is 500+ lines | Length often means redundancy. A 43-line mindset skill can outperform an 800-line dump | | "It has nice formatting and tables" | Well-structured layout | Formatting wraps content, doesn't create knowledge delta. Score content, not presentation | | "It covers all the basics well" | Clear fundamentals | Covering basics = covering what Claude already knows. Redundant, not valuable | | "The procedures are detailed" | Step-by-step tutorials | Detailed GENERIC procedures waste tokens. Only domain-specific procedures have value | | "It has code examples" | Multiple snippets | Code Claude could generate itself is redundant. Only non-obvious patterns add value | | "The description is clear" | Readable description | Readability isn't the metric. Does it have triggers, WHEN scenarios, and exclusions? | | "It's better than nothing" | Comparing to baseline | A bad skill wastes context and may teach wrong patterns. Sometimes no skill IS better |
Read SKILL.md completely. Mark each section [E]xpert, [A]ctivation, or [R]edundant. Calculate E:A:R ratio.
Check: frontmatter validity, total lines, companion files and sizes, pattern identification, loading triggers presence.
For each of 8 dimensions: find specific evidence (quote relevant lines), assign score with one-line justification.
Total = D1 + D2 + D3 + D4 + D5 + D6 + D7 + D8 (Max = 120)
MANDATORY verification: Add in pairs -- (D1+D2) + (D3+D4) + (D5+D6) + (D7+D8) = Total. Arithmetic errors are the most common evaluation failure across 100+ evaluations.
Include: Total/120 with grade, dimension table with scores and notes, critical issues, top 3 improvements with specific guidance.
For skill reference companions under ~/.claude/skills/**/references/*.md, invoke the H4 hybrid pointer-vs-prose validator from ~/.claude/scripts/skill_judge_pointer_validator.py. POINTER passes certification; PROSE refuses; BORDERLINE escalates to AI-judge pass with K1 SoT loaded.
See references/pointer-validator.md.
"Would an expert in this domain say: 'Yes, this captures knowledge that took me years to learn'?"
If yes -- the Skill has genuine value. If no -- it's compressing what Claude already knows.
development
When the user wants help with paid advertising campaigns on Google Ads, Meta (Facebook/Instagram), LinkedIn, Twitter/X, or other ad platforms. Also use when the user mentions 'PPC,' 'paid media,' 'ad copy,' 'ad creative,' 'ROAS,' 'CPA,' 'ad campaign,' 'retargeting,' or 'audience targeting.' This skill covers campaign strategy, ad creation, audience targeting, and optimization.
testing
--- name: using-sharkitect-methodology description: Use when starting any conversation in a Sharkitect workspace OR before any task involving NEW pricing, positioning, proposal, strategy, plan-execution, or schema-design work — mandates invocation of Sharkitect-specific methodology skills (pricing-strategy, marketing-strategy-pmm, smb-cfo, hq-revenue-ops, executing-plans, brainstorming) under the same anti-rationalization discipline as using-superpowers. Documentation has failed 4 times across H
testing
Use when user says 'end session', 'wrap up', 'stop for the day', 'done for today', 'close out', 'save session', 'wrapping up', or invokes /end-session. Runs the full 9-step end-of-session protocol: resource audit, MEMORY.md update, lessons capture, plan status, pending items, workspace checklist, .tmp/ audit, git commit+push, Supabase brain sync, session brief, summary. Final step schedules a detached self-kill of the current session ONLY (3s delay) so the window closes cleanly. Other claude.exe processes (active workspaces) are NOT touched -- orphan cleanup is handled separately by Claude-Orphan-Cleanup-Hourly with proper age safeguards. Do NOT use for: mid-session quick saves (use session-checkpoint), skill syncing (use sync-skills.py), brain memory queries (use supabase-sync.py pull), document freshness reviews (use document-lifecycle), resource gap detection (use resource-auditor).
testing
Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, passive voice, negative parallelisms, and filler phrases.