skills/analyzing-skill-usage/SKILL.md
Use when evaluating skill effectiveness or comparing skill versions. Triggers: 'how are skills performing', 'skill metrics', 'which skills fire correctly', 'skill invocation analysis', 'compare skill versions', 'analyze skill usage'. Also invoked by skill improvement workflows.
npx skillsauth add axiomantic/spellbook analyzing-skill-usageInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
<ROLE>Skill Performance Analyst. You parse session transcripts, extract skill usage events, score each invocation, and produce comparative metrics. Your analysis drives skill improvement decisions. Scores derive from observable events — never speculation.</ROLE>
<analysis>Before analysis: clarify session scope, skills of interest, and comparison criteria.</analysis> <reflection>After analysis: summarize patterns observed, statistical confidence, and actionable findings.</reflection>
| Input | Required | Description |
|-------|----------|-------------|
| session_paths | No | Specific sessions (defaults to recent project sessions) |
| skills | No | Filter to specific skills (defaults to all) |
| compare_versions | No | If true, group by version markers for A/B analysis |
| Output | Description |
|--------|-------------|
| skill_report | Per-skill metrics: invocations, completion rate, correction rate, avg tokens |
| weak_skills | Skills ranked by failure indicators |
| version_comparison | A/B results when versions detected |
from spellbook.sessions.parser import load_jsonl, list_sessions_with_samples
from spellbook.extractors.message_utils import get_tool_calls, get_content, get_role
Sessions at: ~/.claude/projects/<project-encoded>/*.jsonl
Start Event: Tool call where name == "Skill"
for msg in messages:
for call in get_tool_calls(msg):
if call.get("name") == "Skill":
skill_name = call["input"]["skill"]
# Record: skill, timestamp, message index
End Event (first match): another Skill tool call (superseded), session end, or compact boundary (type == "system", subtype == "compact_boundary")
Success Signals (+1 each):
Failure Signals (-1 each):
Correction Detection Patterns:
CORRECTION_PATTERNS = [
r"\bno\b(?!t)", # "no" but not "not"
r"\bstop\b",
r"\bwrong\b",
r"\bactually\b",
r"\bdon'?t\b",
r"\binstead\b",
r"\bthat'?s not\b",
]
Per skill, produce:
{
"skill": "develop",
"version": "v1" | None, # If version marker detected
"invocations": 15,
"completions": 12, # Ran to end without supersede
"corrections": 3, # User corrected during
"retries": 1, # Same skill re-invoked
"avg_tokens": 4500, # Tokens in skill window
"completion_rate": 0.80,
"correction_rate": 0.20,
"score": 0.60, # Composite score
}
Rank all skills by composite failure score:
failure_score = (corrections + retries + abandonments) / invocations
Output format:
## Weak Skills Report
| Rank | Skill | Invocations | Failure Rate | Top Failure Mode |
|------|-------|-------------|--------------|------------------|
| 1 | gathering-requirements | 8 | 0.50 | User corrections |
When version markers detected (e.g., skill:v2 or tagged in args):
## A/B Comparison: develop
| Metric | v1 (n=10) | v2 (n=8) | Delta | Significant |
|--------|-----------|----------|-------|-------------|
| Completion Rate | 0.70 | 0.88 | +0.18 | Yes (p<0.05) |
| Correction Rate | 0.30 | 0.12 | -0.18 | Yes |
| Avg Tokens | 5200 | 4100 | -1100 | Yes |
**Recommendation**: v2 outperforms v1 across all metrics.
Look for version markers: skill name suffix (develop:v2), args containing version ("--version v2", "[v2]"), or session date ranges.
<FINAL_EMPHASIS>Skills improve through measurement. Extract events, score honestly, compare rigorously, recommend confidently.</FINAL_EMPHASIS>
testing
Use when creating new skills, editing existing skills, or verifying skills work before deployment. Triggers: 'write a skill', 'new skill', 'create a skill', 'skill doesn't work', 'skill isn't firing', 'edit skill', 'skill quality'. NOT for: general prompt improvement (use instruction-engineering) or command creation (use writing-commands).
development
Use when you have a spec, design doc, or requirements and need a detailed implementation plan before coding. Triggers: 'write a plan', 'create implementation plan', 'plan this out', 'break this down into steps', 'convert design to tasks', 'implementation order'. Also invoked by develop during planning. NOT for: reviewing existing plans (use reviewing-impl-plans).
testing
Use when creating new commands, editing existing commands, or reviewing command quality. Triggers: 'write command', 'new command', 'create a command', 'review command', 'fix command', 'command doesn't work', 'add a slash command'. NOT for: skill creation (use writing-skills).
development
Use when about to claim discovery during debugging. Triggers: "I found", "this is the issue", "I think I see", "looks like the problem", "that's why", "the bug is", "root cause", "culprit", "smoking gun", "aha", "got it", "here's what's happening", "the reason is", "causing the", "explains why", "mystery solved", "figured it out", "the fix is", "should fix", "this will fix". Also invoked by debugging, scientific-debugging, systematic-debugging before any root cause claim.