claude/skills/skill-improver/SKILL.md
Audit and improve agent and skill definitions for better calibration, tool scoping, context management, activation quality, and output format. Use when the user says "improve this skill", "audit this agent", "optimize this agent", "review agent definition", "fix trigger rate", "skill not activating", or invokes /skill-improver with a path. Also trigger when an agent is producing poor results and needs prompt tuning, or when a skill isn't triggering reliably. Covers: calibrated confidence scoring, tool scoping, sub-agent delegation, fork vs inline, context budgets, and activation optimization. Do NOT use for creating new skills from scratch — use /skill-creator for that.
npx skillsauth add paulnsorensen/dotfiles skill-improverInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Audit agent and skill definitions against best practices, then produce scored improvement recommendations. This skill eats its own cooking — it uses the same 4-step calibrated confidence scoring it recommends for others.
LLM agent prompts have predictable failure modes: over-broad tool access causes the model to waste time on irrelevant actions, unbounded output pollutes the orchestrator's context window, and self-reported confidence scores are pattern matching on rubric descriptions rather than calibrated probabilities.
This skill codifies what we've learned into a repeatable audit.
A path to an agent definition (agents/*.md) or skill definition (skills/*/SKILL.md).
If no path is given, ask.
This skill runs inline (no context: fork) at opus tier — set explicitly in frontmatter.
tools:/disallowedTools: frontmatter) or skill (has name:/description: in SKILL.md frontmatter)Before auditing, gather empirical data from session logs. This step is best-effort — skip if the database doesn't exist or queries return empty.
Run ingestion to ensure fresh data:
python3 ~/Dev/dotfiles/claude/skills/session-analytics/scripts/ingest.py
Spawn three parallel sub-agents (all sonnet, read-only):
| Agent | Type | Prompt |
|-------|------|--------|
| Usage | skill-analytics-usage | "Analyze usage patterns for skill: {name}" |
| Tools | skill-analytics-tools | "Analyze tool patterns for skill: {name}. Declared tools: {tools list from frontmatter}" |
| Friction | skill-analytics-friction | "Analyze friction patterns for skill: {name}" |
Collect their structured findings for use in Dimension 7.
If ingestion fails (duckdb not installed, no JSONL logs), skip to Phase 2 and omit Dimension 7 from the report. Never block the audit on analytics.
Evaluate the definition against each dimension below. For each finding, use the 4-step scoring process (Phase 3) before including it in the report.
Agents that make judgments (review, triage, audit) need calibrated scoring. The pattern that works:
Step 1: Classify claim type — Each category gets a base score and hard cap. Category priors predict accuracy better than the model's self-assessed number. Style nits cap at 60; bugs start at 50 and can reach 100.
Step 2: Evidence grounding — Modifiers based on verification quality. LSP-verified (+20-25), grep-confirmed (+20), specific file:line (+15), generic observation (-15), misread code (hard cap 0).
Step 3: Context modifiers — Signals that adjust severity. Git hotspot (+10), pre-existing issue (-15), public API boundary (-10), review state (+10).
Step 4: Re-assess borderline items — Items near the surfacing threshold get scored independently a second time. If scores diverge >15 points, the finding is ambiguous — don't surface it. This catches false confidence from pattern matching.
Key insight: ordering between findings is more reliable than absolute magnitude. "A is more important than B" is trustworthy. "A is exactly 82" is not.
Check: Does the agent have scoring? Does it use category priors? Does it ground evidence? Does it re-assess borderline items? Is there a surfacing threshold?
Reference implementations:
claude/agents/fromage-age.md — review findingsclaude/agents/fromage-fort.md — PR comment triageclaude/agents/ricotta-reducer.md — simplification auditThree-tier model — every agent should fall into one of these tiers:
| Tier | Tools | Use case | Frontmatter |
|------|-------|----------|-------------|
| Read-only | Grep, Glob, Read, Bash | Reviewers, auditors, explorers | disallowedTools: [Edit, Write, NotebookEdit] |
| Write-scoped | + Edit, Write | Implementers, fixers | Exclude tools not needed (WebSearch, LSP, etc.) |
| Focused sub-agent | 2-4 tools max | Pipeline sub-tasks | Disallow 5-8 unused tools explicitly |
Over-broad tool access degrades behavior in two ways: models waste tokens
considering irrelevant tools, and they're more likely to take irreversible
actions when stuck if write tools are available — even when a read-only path
exists. Prose constraints ("this is a read-only agent") are weaker than hard
disallowedTools blocks. If an agent says "read-only" in its body but doesn't
set disallowedTools, that's a finding.
Skill access (skills: [...]):
Check: Are tools appropriately constrained for the tier? Does the agent have
disallowedTools matching its stated role? Is anything listed that isn't used?
Is anything missing that the agent needs? Do skill delegations match the task?
Every token in an agent's context competes with the task at hand. Agents that produce or consume too much context degrade their own performance and their orchestrator's.
Fork vs inline — fork (sub-agent) when:
Run inline when the result is concise, immediately needed, and action-relevant.
Sub-agent delegation — agents that need to choose between strategies (research from multiple sources, review from multiple angles) should spawn parallel sub-agents rather than sequentially trying each approach. The research agent pattern: spawn N focused sub-agents, synthesize results.
Context degradation thresholds (from LLM research):
| Context size | Observed behavior | |---|---| | < 20K tokens | Strong instruction following, full recall | | 20K–60K tokens | Moderate degradation, especially mid-context instructions | | 60K–100K tokens | Noticeable instruction drift, increased repetition | | > 100K tokens | Models increasingly ignore early system prompt instructions |
Output budgets:
$TMPDIR/), return a pointer.Model selection — document rationale in the agent, not just the choice:
opus — judgment-heavy tasks (review, architecture, complex reasoning)sonnet — implementation, exploration, most general-purpose workhaiku — focused fetch tasks, simple transforms, token-constrained sub-agentsFrontmatter controls (skills only):
context: fork — runs the skill in an isolated subagent context. Use when
the skill reads 30+ files, produces verbose reports, or needs isolated context.
Do NOT use on guideline-only skills (no task = subagent returns nothing useful).allowed-tools — restricts tools to only what the skill needs. Same
principle as agent tool scoping but via frontmatter.Check: Does the agent manage its output size? Should it fork? Does it use
sub-agents where parallel work would help? Is the model appropriate and
documented? Is the prompt file under 500 lines? Does the agent have a wrap-up
signal to prevent runaway execution (e.g., "after ~60 tool calls, wrap up")?
For skills: is context: fork appropriate? Are allowed-tools constrained?
Structure: Structured prompts (sections, tables, explicit rules) outperform freeform prose for instruction following. But structure has diminishing returns — a 50-row table of rules gets skimmed the same way a wall of text does.
Why over what: Explaining why a rule exists makes the model better at
edge cases. "Never use find" is brittle. "Use fd instead of find because
fd respects .gitignore and is faster on large repos" transfers to novel situations.
Positive over negative framing: "Use named exports" beats "Don't use default exports." LLMs struggle with negation — positive framing reduces rule violations by ~50% in testing. Flag rules that rely heavily on "don't", "never", "avoid" without providing the positive alternative.
Examples: One good example is worth ten rules. Two examples establish a pattern. Three confirm it. More than three for the same concept is diminishing returns.
Role framing: A single opening sentence that establishes the agent's identity and purpose ("You are the Age phase — long maturation where cheese develops complex character") is more effective than a paragraph of role description.
Negative constraints ("What You Don't Do"): Explicit sections listing what the agent must NOT do significantly reduce scope creep and overlap with adjacent pipeline phases. Every pipeline agent should have one. Example from fromage-cook: "Make design decisions... Add tests... Review code quality" — all belong to other phases.
Decision scaffolds for judgment tasks: Skills that use "always/never" for
judgment tasks should use a structured reasoning scaffold (Classify → Ground → Context → Reassess) or
degrees-of-freedom patterns instead. Match constraint level to risk: high
freedom for low-risk, exact steps for fragile operations.
See references/decision-frameworks.md for the full pattern catalog.
Gotchas section: Every skill should capture known failure modes. These are the highest-value content per token — they directly prevent repeated failures.
Check: Is the prompt well-structured? Does it explain why? Are there examples? Is the role framing concise? Are there walls of text that could be tables? Does the agent have an explicit "What You Don't Do" section? Does it have a Gotchas section? Do judgment tasks use decision scaffolds instead of rigid rules? Does the skill's workflow align with a recognized pattern (sequential workflow, iterative refinement, context-aware tool selection, or domain-specific intelligence)?
Agents that produce reports need standardized, scannable output. The patterns that work:
$TMPDIR/, summary to orchestratorCheck: Is the output format defined? Is it scannable? Does it separate summary from detail? Is there a clear "clean" vs "issues found" signal?
Skills have an undertriggering problem — community testing shows a 20% baseline
trigger rate. The description field is not a summary for humans; it's a trigger
specification for the model's routing decision.
Description structure — effective descriptions follow a three-part pattern:
[Core capability]. [Secondary capabilities]. Use when [trigger1], [trigger2], or when user mentions "[keyword1]", "[keyword2]".
Check for:
Frontmatter fields — check for appropriate use of:
context: fork — runs in isolated subagent context. Use when the skill
reads 30+ files or produces verbose reports. Skills with only guidelines
(no task) should NOT fork — the subagent gets no actionable prompt.agent — specifies subagent type when context: fork is set (Explore,
Plan, general-purpose, or custom). Should match the skill's workload.allowed-tools — restricts which tools the skill can use. Grants access
without per-use approval. Use to constrain skills to their actual needs.disable-model-invocation: true — requires explicit /skill-name to trigger.
Appropriate for destructive or infrequently-needed skills.effort — overrides model effort level (low/medium/high). Use effort: high
for research-heavy skills, effort: low for simple formatting tasks. New in
March 2026.user-invocable: false — hides from / menu. Use for background knowledge
Claude should know but users shouldn't invoke directly.Check: Is the description a trigger spec or just a summary? Does it list
trigger phrases? Is it pushy enough? Are frontmatter fields appropriate?
Would /skill-creator description optimization improve trigger rate?
Static analysis reveals what the definition says. Usage analytics reveals what actually happens when the skill runs. This dimension uses findings from the Phase 1.5 sub-agents. Skip entirely if analytics data was unavailable.
What to look for:
Zero or low invocations — Skill exists but isn't used. Cross-reference with Dimension 6 (activation). A well-described skill with zero invocations is a stronger signal than a poorly-described one with zero invocations.
Declared-vs-actual tool mismatch — allowed-tools lists Read but the
skill never reads files in practice. Or the skill triggers Bash calls that
aren't in allowed-tools. Mismatches reveal stale declarations or missing
permissions.
Undeclared agent spawns — Skill spawns agent types it doesn't document. Either the agent spawns are intentional (add them to docs) or unintended (scope creep from the model).
High error rate vs baseline — If tools error >2x the baseline rate during skill windows, the skill is fighting the environment. Common causes: wrong tool for the job, missing permissions, stale file paths.
Permission friction — Repeated denials in skill windows mean the skill triggers tools not in the user's allowlist. Either add to allowlist docs or change the skill's approach.
Hook interruptions — Stop hooks blocking continuation during skill execution reveal conflicts between the skill's behavior and the user's guard rails.
Declining usage — Skill was active, now rarely used. Something changed — a better alternative, workflow shift, or the problem it solved was fixed. Worth flagging for the user to decide if the skill should be retired.
Single-project concentration — Skill used in only one project may be too specialized for its current scope, or could be generalized.
Check: Does actual tool usage match declarations? Is the error rate elevated? Are there permission or hook conflicts? Is usage healthy or declining?
For each improvement recommendation, apply the same 4-step scoring this skill recommends for others. Walk the walk.
| Type | Description | Base score | Cap |
|------|-------------|------------|-----|
| SCORING | Missing or miscalibrated confidence scoring | 45 | 100 |
| TOOLS | Tool access too broad or too narrow | 40 | 90 |
| CONTEXT | Context pollution, missing fork/delegation, wrong model | 40 | 95 |
| PROMPT | Ambiguous instructions, missing examples, wall of text | 35 | 85 |
| OUTPUT | Missing or unclear output format | 30 | 80 |
| ACTIVATION | Poor description, missing triggers, wrong frontmatter fields | 35 | 90 |
| ENFORCEMENT | Critical rule as instruction-only, missing companion hooks | 40 | 90 |
| ANALYTICS | Usage data contradicts definition (tool mismatch, friction, decay) | 35 | 85 |
| Evidence quality | Modifier | |------------------|----------| | Cites specific line in the definition + concrete failure scenario | +20 | | Backed by session analytics data (query results, counts, rates) | +15 | | Names a reference implementation that does it right | +15 | | References a CLAUDE.md rule or established pattern | +10 | | Generic observation without specific reference | -10 | | Misreads the definition or overlooks existing handling | hard cap at 0 |
| Signal | Modifier | |--------|----------| | Agent is judgment-heavy (review, triage, audit) and lacks scoring | +15 | | Agent produces unbounded output with no size constraint | +10 | | Agent is a focused sub-agent (context management less critical) | -10 | | Issue is stylistic preference rather than functional impact | -15 |
For any recommendation scoring 35-49: re-read the full definition file, then score independently a second time without looking at your first score. If the two scores diverge by >15 points, don't surface — the recommendation is ambiguous. If both scores land >= 50, surface it.
## Skill Improvement Report: <name>
### Summary
- Type: agent | skill
- Model: <model>
- Tools: <N allowed, N disallowed>
- Prompt size: <N lines>
- Findings: N total (N scored >= 50, N below threshold)
### Recommendations (score >= 50)
| # | Score | Category | Issue | Recommendation |
|---|-------|----------|-------|----------------|
| 1 | 95 | SCORING | No confidence scoring on judgment agent | Add 4-step calibration (see fromage-age pattern) |
| 2 | 85 | TOOLS | Edit/Write allowed on read-only reviewer | Add to disallowedTools |
| 3 | 80 | CONTEXT | Unbounded output, no summary/detail split | Write details to $TMPDIR, return summary |
### Detailed Recommendations
For each finding >= 50, expand with:
- **What**: the specific issue
- **Why**: why it matters (with reference to a pattern or principle)
- **How**: concrete fix, ideally with the exact frontmatter or section to add/change
- **Reference**: link to an agent/skill that does it right
### Recommended Hooks (if applicable)
For findings where enforcement matters more than guidance, suggest a companion
hook from `references/hooks-catalog.md`:
| Finding # | Hook Type | What It Enforces |
|-----------|-----------|-----------------|
| (only include findings where a hook would help — not every finding needs one) |
If no findings warrant hooks, omit this section.
### Below Threshold
N findings scored < 50 (not shown)
If the audit reveals activation or trigger issues, suggest the user run
/skill-creator to generate eval queries and measure trigger rate before/after
applying changes. Static audit identifies problems; eval-driven iteration
validates fixes.
These are the most common issues across agent/skill definitions, ordered by how often they appear and how much impact they have:
Judgment without scoring — Agent makes pass/fail decisions but has no confidence framework. Every reviewer, auditor, and triage agent needs the 4-step calibration.
Prose-only tool constraints — Agent says "read-only" in its body but
doesn't set disallowedTools: [Edit, Write, NotebookEdit] in frontmatter.
Prose constraints are weaker than hard blocks — models with write tools
available are more likely to take irreversible actions when stuck.
Unbounded tool access — Agent has all tools available when it only needs 3-4. Especially common in agents cloned from a general-purpose template.
Monolithic output — Agent dumps everything into the conversation instead of writing details to a temp file and returning a summary. Pollutes the orchestrator's context window.
Missing model directive — No model: in frontmatter. Defaults to
whatever the parent uses, which may be wrong (opus for a haiku-appropriate
fetch task, or haiku for a judgment-heavy review).
No output format — Agent has no defined output structure. Results vary
wildly between invocations. Structured output (tables) reduces unverifiable
claims because you need a file:line to fill the cell.
No "What You Don't Do" section — Pipeline agents without explicit negative constraints have the most scope-creep and overlap risk with adjacent phases.
No wrap-up signal — Long-running agents without a tool-call limit (e.g., "after ~60 tool calls, wrap up") can run indefinitely, consuming context until performance degrades.
PR skills without health checks — Skills that respond to PR review
comments but don't check CI status or merge conflicts first. Build failures
and merge conflicts should be fixed before processing review comments —
comments may be moot if the build is broken. Reference: /respond checks
get_check_runs and mergeable_state in Phase 0.
The remaining anti-patterns (freeform instructions, missing "why", over-specified role, passive descriptions, negation-heavy rules, critical-rule-as-instruction-only, always/never for judgment, missing Gotchas) are lower-frequency and covered by the dimension audit sections above.
Read these before making changes to the relevant dimension:
references/calibrated-scoring.md — 4-step calibration method and templatesreferences/description-optimization.md — Trigger optimization with before/after examplesreferences/decision-frameworks.md — Structured reasoning scaffold, degrees of freedom, example-driven specreferences/hooks-catalog.md — JS hook examples for activation, validation, enforcementdisallowedTools even when the agent's tool list is
naturally constrained by platform defaultstools
Reconstruct what a past coding-agent session was doing so you can resume it — goal, files touched, last verified state, and the next step — by querying the session logs. Use when the user says "what was I working on", "recover that session", "reconstruct where I left off", "resume my last session", "what did that session change", "rebuild context from logs", or invokes /work-recovery. Report-only — it never scores or judges. Do NOT use for usage scoring (that is /skill-improver, /tool-efficiency, /prompt-analytics) or one-off interactive log queries (that is /session-analytics).
development
Curate this repo's hallouminate wiki (.hallouminate/wiki/, the repo:dotfiles:wiki corpus) — add or update architecture pages, per-harness docs, and gotchas. Use when the user says "update the wiki", "document this in the wiki", "refresh the harness docs", "add a wiki page", "curate the wiki", "the wiki is stale", or invokes /wiki-curator. Also use at session end to write back a non-obvious decision or gotcha worth preserving. Grounds the existing wiki first, follows one-topic-per-file conventions, verifies every external doc URL before writing, and reindexes. Do NOT use for general code search (that is cheez-search) or for editing AGENTS.md command reference.
tools
Audit how a tool, command, or MCP server is actually used across coding-agent sessions and produce calibrated recommendations — tool-vs-task fit, error forensics, fix recommendations, permission friction, MCP health, and token economics. Use when the user says "tool efficiency", "am I using X efficiently", "audit tool usage", "why does X keep failing", "how do I fix this error", "what should I change", "permission friction", "is this MCP worth it", "tool error rate", "fix recommendations", or invokes /tool-efficiency. Do NOT use for auditing a skill or agent definition (that is /skill-improver) or for one-off interactive log queries (that is /session-analytics).
tools
Analyze how prompts and skill routing behave across coding-agent sessions and produce calibrated recommendations — prompt-pattern analysis, routing accuracy, and knowledge gaps. Use when the user says "analyze my prompts", "prompt patterns", "is routing working", "which skill should have fired", "knowledge gaps", "what do I keep asking", or invokes /prompt-analytics. Do NOT use for auditing a single skill/agent definition (that is /skill-improver), tool/MCP efficiency (that is /tool-efficiency), or one-off interactive log queries (that is /session-analytics).