skills/skill-improver/SKILL.md
Audit and improve agent and skill definitions — calibration, tool scoping, context budget, activation quality, and output format. Use when the user says "improve this skill", "audit this agent", "optimize this agent", "review agent definition", "fix trigger rate", "skill not activating", or invokes /skill-improver with a path. Also trigger when an agent is producing poor results and needs prompt tuning, or when a skill isn't triggering reliably. Covers: confidence and severity calibration, tool scoping, sub-agent delegation, fork vs inline, context budgets, and activation optimization. Do NOT use for creating new skills from scratch — use /skill-creator for that.
npx skillsauth add paulnsorensen/dotfiles skill-improverInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Audit agent and skill definitions against best practices, then produce calibrated improvement recommendations. This skill eats its own cooking — it uses the same confidence/severity calibration it recommends for others.
LLM agent prompts have predictable failure modes: over-broad tool access causes the model to waste time on irrelevant actions, unbounded output pollutes the orchestrator's context window, and self-reported confidence scores are pattern matching on rubric descriptions rather than calibrated probabilities.
This skill codifies what we've learned into a repeatable audit.
A path to an agent definition (agents/*.md) or skill definition (skills/*/SKILL.md).
If no path is given, ask.
This skill runs inline (no context: fork) at opus tier — set explicitly in frontmatter.
tools:/disallowedTools: frontmatter) or skill (has name:/description: in SKILL.md frontmatter)Before auditing, gather empirical data from session logs. This step is best-effort — skip if the database doesn't exist or queries return empty.
Run ingestion to ensure fresh data:
python3 ~/Dev/dotfiles/skills/session-analytics/scripts/ingest.py
Fan out one parallel duckdb-expert spawn per owned domain (read-only) —
this skill owns three domain packs co-located under its own references/:
| Domain pack | Spawn prompt |
|-------------|--------------|
| skill-usage | "Run analytics pack skill-improver/references/skill-usage.md for target {name}. harness=all" |
| agent-orchestration | "Run analytics pack skill-improver/references/agent-orchestration.md for target {name}. harness=all" |
| drift-regression | "Run analytics pack skill-improver/references/drift-regression.md for target {name}. harness=all" |
Each spawn reads its pack's queries plus the canonical schema from
session-analytics/references/canonical-schema.md, and returns one ~2 KB
digest. This is the platform's one-domain-per-spawn contract — do not collapse
to a single all-domains spawn.
Collect the three digests for use in Dimension 7.
If ingestion fails (duckdb not installed, no JSONL logs), skip to Phase 2 and omit Dimension 7 from the report. Never block the audit on analytics.
Evaluate the definition against each dimension below. For each finding, use the calibration process (Phase 3) before including it in the report.
Agents that make judgments (review, triage, audit) need calibrated findings — but on two qualitative axes, not one number. LLM absolute numeric self-scores are poorly calibrated (models anchor to round numbers and conflate "important" with "certain"); relative/qualitative judgments track human assessment far better, so rank and tag rather than score.
Confidence — how sure you are the finding is real. Tag each one:
<certain> (verified by reading the code, running it, or citing a source),
<speculative> (pattern-match or inference — surface it, but say so),
<don't know> (can't tell — do NOT surface; drop or raise as an open question).
Evidence drives the tag: tool-verified / specific file:line + concrete failure →
<certain>; generic observation → <speculative>; misread or unverifiable →
<don't know>.
Severity — how much it matters if real: blocker (broken as written) /
high / medium / low (style, polish). Orthogonal to confidence: a
<certain> style nit is low; a <speculative> correctness risk can be high.
Re-assess borderline — for any <speculative> finding you're about to
surface, re-derive the reasoning once more without looking at the first pass.
If it doesn't reproduce, drop to <don't know>. Never average two divergent
reads into a vague "maybe" — divergence means you don't actually know.
Key insight: tiers and tags are more reliable than absolute magnitude. "A is more severe than B" and "certain vs speculative" are trustworthy. "A is exactly 82" is invented precision.
Check: Does the agent separate confidence from severity? Does it use the
<certain>/<speculative>/<don't know> tags instead of an invented number?
Does it ground evidence? Does it re-assess borderline items? Is the surfacing
rule clear (don't-know never surfaces)?
Reference implementations:
/age — severity tiers (Blocker/High/Medium/Low) with per-finding confidencecertain | speculative | don't knowThree-tier model — every agent should fall into one of these tiers:
| Tier | Tools | Use case | Frontmatter |
|------|-------|----------|-------------|
| Read-only | Grep, Glob, Read, Bash | Reviewers, auditors, explorers | disallowedTools: [Edit, Write, NotebookEdit] |
| Write-scoped | + Edit, Write | Implementers, fixers | Exclude tools not needed (WebSearch, mcp__serena__*, etc.) |
| Focused sub-agent | 2-4 tools max | Pipeline sub-tasks | Disallow 5-8 unused tools explicitly |
Over-broad tool access degrades behavior in two ways: models waste tokens
considering irrelevant tools, and they're more likely to take irreversible
actions when stuck if write tools are available — even when a read-only path
exists. Prose constraints ("this is a read-only agent") are weaker than hard
disallowedTools blocks. If an agent says "read-only" in its body but doesn't
set disallowedTools, that's a finding.
Skill access (skills: [...]):
Check: Are tools appropriately constrained for the tier? Does the agent have
disallowedTools matching its stated role? Is anything listed that isn't used?
Is anything missing that the agent needs? Do skill delegations match the task?
Every token in an agent's context competes with the task at hand. Agents that produce or consume too much context degrade their own performance and their orchestrator's.
Fork vs inline — fork (sub-agent) when:
Run inline when the result is concise, immediately needed, and action-relevant.
Sub-agent delegation — agents that need to choose between strategies (research from multiple sources, review from multiple angles) should spawn parallel sub-agents rather than sequentially trying each approach. The research agent pattern: spawn N focused sub-agents, synthesize results.
Context degradation — performance drops as context grows, and information buried mid-context suffers most (primacy/recency bias). This is well-established directionally (Lost in the Middle, RULER, NoLiMa); the token buckets below are illustrative rules of thumb, not measured thresholds:
| Context size | Rule-of-thumb behavior | |---|---| | < 20K tokens | Strong instruction following, full recall | | 20K–60K tokens | Moderate degradation, especially mid-context instructions | | 60K–100K tokens | Noticeable instruction drift, increased repetition | | > 100K tokens | Early system-prompt instructions increasingly ignored |
Output budgets:
$TMPDIR/), return a pointer.Model selection — document rationale in the agent, not just the choice:
opus — judgment-heavy tasks (review, architecture, complex reasoning)sonnet — implementation, exploration, most general-purpose workhaiku — focused fetch tasks, simple transforms, token-constrained sub-agentsFrontmatter controls (skills only):
context: fork — runs the skill in an isolated subagent context. Use when
the skill reads 30+ files, produces verbose reports, or needs isolated context.
Do NOT use on guideline-only skills (no task = subagent returns nothing useful).allowed-tools — restricts tools to only what the skill needs. Same
principle as agent tool scoping but via frontmatter.Check: Does the agent manage its output size? Should it fork? Does it use
sub-agents where parallel work would help? Is the model appropriate and
documented? Is the prompt file under 500 lines? Does the agent have a wrap-up
signal to prevent runaway execution (e.g., "after ~60 tool calls, wrap up")?
For skills: is context: fork appropriate? Are allowed-tools constrained?
Structure: Structured prompts (sections, tables, explicit rules) outperform freeform prose for instruction following. But structure has diminishing returns — a 50-row table of rules gets skimmed the same way a wall of text does.
Why over what: Explaining why a rule exists makes the model better at
edge cases. "Never use find" is brittle. "Use fd instead of find because
fd respects .gitignore and is faster on large repos" transfers to novel situations.
Positive over negative framing: "Use named exports" beats "Don't use default exports." LLMs struggle with negation, so positive framing materially reduces rule violations (magnitude varies by model and rule; Anthropic's prompting guidance recommends stating the desired behavior). Flag rules that rely heavily on "don't", "never", "avoid" without providing the positive alternative.
Examples: One good example is worth ten rules. Two examples establish a pattern. Three confirm it. More than three for the same concept is diminishing returns.
Role framing: A single opening sentence that establishes the agent's identity and purpose ("You are the Age phase — long maturation where cheese develops complex character") is more effective than a paragraph of role description.
Negative constraints ("What You Don't Do"): Explicit sections listing what the agent must NOT do significantly reduce scope creep and overlap with adjacent pipeline phases. Every pipeline agent should have one. Example: an implementation agent's "What You Don't Do" might list "Make design decisions... Add tests... Review code quality" — all belong to other phases.
Decision scaffolds for judgment tasks: Skills that use "always/never" for
judgment tasks should use a structured reasoning scaffold (Classify → Ground → Context → Reassess) or
degrees-of-freedom patterns instead. Match constraint level to risk: high
freedom for low-risk, exact steps for fragile operations.
See references/decision-frameworks.md for the full pattern catalog.
Gotchas section: Every skill should capture known failure modes. These are the highest-value content per token — they directly prevent repeated failures.
Check: Is the prompt well-structured? Does it explain why? Are there examples? Is the role framing concise? Are there walls of text that could be tables? Does the agent have an explicit "What You Don't Do" section? Does it have a Gotchas section? Do judgment tasks use decision scaffolds instead of rigid rules? Does the skill's workflow align with a recognized pattern (sequential workflow, iterative refinement, context-aware tool selection, or domain-specific intelligence)?
Agents that produce reports need standardized, scannable output. The patterns that work:
$TMPDIR/, summary to orchestratorCheck: Is the output format defined? Is it scannable? Does it separate summary from detail? Is there a clear "clean" vs "issues found" signal?
Skills have an undertriggering problem — they often fail to activate when they
should. Community testing reports baseline auto-activation rates that vary widely
(roughly 20–55% depending on description quality and environment; not an official
Anthropic figure). The description field is not a summary for humans; it's a
trigger specification for the model's routing decision.
Description structure — effective descriptions follow a three-part pattern:
[Core capability]. [Secondary capabilities]. Use when [trigger1], [trigger2], or when user mentions "[keyword1]", "[keyword2]".
Check for:
Frontmatter fields — check for appropriate use of:
context: fork — runs in isolated subagent context. Use when the skill
reads 30+ files or produces verbose reports. Skills with only guidelines
(no task) should NOT fork — the subagent gets no actionable prompt.agent — specifies subagent type when context: fork is set (Explore,
Plan, general-purpose, or custom). Should match the skill's workload.allowed-tools — restricts which tools the skill can use. Grants access
without per-use approval. Use to constrain skills to their actual needs.disable-model-invocation: true — requires explicit /skill-name to trigger.
Appropriate for destructive or infrequently-needed skills.effort — overrides model effort level (low/medium/high). Use effort: high
for research-heavy skills, effort: low for simple formatting tasks. Supported
for both skills and subagents (per code.claude.com/docs).user-invocable: false — hides from / menu. Use for background knowledge
Claude should know but users shouldn't invoke directly.Check: Is the description a trigger spec or just a summary? Does it list
trigger phrases? Is it pushy enough? Are frontmatter fields appropriate?
Would /skill-creator description optimization improve trigger rate?
Static analysis reveals what the definition says. Usage analytics reveals what
actually happens when the skill runs. This dimension reads the three fanned-out
digests from Phase 1.5 (skill-usage, agent-orchestration, drift-regression).
Skip entirely if analytics data was unavailable.
What to look for:
Zero or low invocations — Skill exists but isn't used. Cross-reference with Dimension 6 (activation). A well-described skill with zero invocations is a stronger signal than a poorly-described one with zero invocations.
Declared-vs-actual tool mismatch — allowed-tools lists Read but the
skill never reads files in practice. Or the skill triggers Bash calls that
aren't in allowed-tools. Mismatches reveal stale declarations or missing
permissions.
Undeclared agent spawns — Skill spawns agent types it doesn't document. Either the agent spawns are intentional (add them to docs) or unintended (scope creep from the model).
High error rate vs baseline — If tools error >2x the baseline rate during skill windows, the skill is fighting the environment. Common causes: wrong tool for the job, missing permissions, stale file paths.
Permission friction — Repeated denials in skill windows mean the skill triggers tools not in the user's allowlist. Either add to allowlist docs or change the skill's approach.
Hook interruptions — Stop hooks blocking continuation during skill execution reveal conflicts between the skill's behavior and the user's guard rails.
Declining usage — Skill was active, now rarely used. Something changed — a better alternative, workflow shift, or the problem it solved was fixed. Worth flagging for the user to decide if the skill should be retired.
Single-project concentration — Skill used in only one project may be too specialized for its current scope, or could be generalized.
Check: Does actual tool usage match declarations? Is the error rate elevated? Are there permission or hook conflicts? Is usage healthy or declining?
For each recommendation, assign a confidence tag and a severity tier — the same calibration this skill recommends for others. Walk the walk.
Each category has a default severity tier; adjust for the specific case.
| Type | Description | Default severity |
|------|-------------|------------------|
| SCORING | Missing or miscalibrated confidence/severity calibration | high |
| TOOLS | Tool access too broad or too narrow | high |
| CONTEXT | Context pollution, missing fork/delegation, wrong model | high |
| PROMPT | Ambiguous instructions, missing examples, wall of text | medium |
| OUTPUT | Missing or unclear output format | medium |
| ACTIVATION | Poor description, missing triggers, wrong frontmatter fields | high |
| ENFORCEMENT | Critical rule as instruction-only, missing companion hooks | high |
| ANALYTICS | Usage data contradicts definition (tool mismatch, friction, decay) | medium |
| Evidence quality | Confidence |
|------------------|------------|
| Cites a specific line + concrete failure, or backed by analytics data, or names a reference impl that does it right | <certain> |
| References a CLAUDE.md rule / pattern, or a checkable but unverified observation | <speculative> (raise once verified) |
| Generic observation with no specific reference | <speculative> at most |
| Misreads the definition or overlooks existing handling | <don't know> — drop it |
| Signal | Effect | |--------|--------| | Agent is judgment-heavy (review, triage, audit) and lacks calibration | raise | | Agent produces unbounded output with no size constraint | raise | | Agent is a focused sub-agent (context management less critical) | lower | | Issue is stylistic preference rather than functional impact | lower (often to low) |
Surface a finding only if it is <certain> or <speculative>. For each
<speculative> finding, re-read the full definition file and re-derive the
reasoning once without looking at your first pass; if it doesn't reproduce, drop
it to <don't know>. <don't know> findings never surface — count them in the
below-bar tally. Order the report by severity (blocker → low); within a tier,
<certain> before <speculative>.
## Skill Improvement Report: <name>
### Summary
- Type: agent | skill
- Model: <model>
- Tools: <N allowed, N disallowed>
- Prompt size: <N lines>
- Findings: N total (N surfaced, N below the bar / don't-know)
### Recommendations (surfaced)
| # | Severity | Confidence | Category | Issue | Recommendation |
|---|----------|------------|----------|-------|----------------|
| 1 | blocker | `<certain>` | SCORING | No confidence calibration on judgment agent | Add confidence/severity tags (see /age pattern) |
| 2 | high | `<certain>` | TOOLS | Edit/Write allowed on read-only reviewer | Add to disallowedTools |
| 3 | high | `<speculative>` | CONTEXT | Unbounded output, no summary/detail split | Write details to $TMPDIR, return summary |
### Detailed Recommendations
For each surfaced finding, expand with:
- **What**: the specific issue
- **Why**: why it matters (with reference to a pattern or principle)
- **How**: concrete fix, ideally with the exact frontmatter or section to add/change
- **Reference**: link to an agent/skill that does it right
### Recommended Hooks (if applicable)
For findings where enforcement matters more than guidance, suggest a companion
hook from `references/hooks-catalog.md`:
| Finding # | Hook Type | What It Enforces |
|-----------|-----------|-----------------|
| (only include findings where a hook would help — not every finding needs one) |
If no findings warrant hooks, omit this section.
### Below the Bar
N findings were `<don't know>` or speculative-trivial (not shown)
If the audit reveals activation or trigger issues, suggest the user run
/skill-creator to generate eval queries and measure trigger rate before/after
applying changes. Static audit identifies problems; eval-driven iteration
validates fixes.
These are the most common issues across agent/skill definitions, ordered by how often they appear and how much impact they have:
Judgment without calibration — Agent makes pass/fail decisions but has no
confidence/severity framework. Every reviewer, auditor, and triage agent needs
confidence tags (<certain>/<speculative>/<don't know>) plus severity tiers.
Prose-only tool constraints — Agent says "read-only" in its body but
doesn't set disallowedTools: [Edit, Write, NotebookEdit] in frontmatter.
Prose constraints are weaker than hard blocks — models with write tools
available are more likely to take irreversible actions when stuck.
Unbounded tool access — Agent has all tools available when it only needs 3-4. Especially common in agents cloned from a general-purpose template.
Monolithic output — Agent dumps everything into the conversation instead of writing details to a temp file and returning a summary. Pollutes the orchestrator's context window.
Missing model directive — No model: in frontmatter. Defaults to
whatever the parent uses, which may be wrong (opus for a haiku-appropriate
fetch task, or haiku for a judgment-heavy review).
No output format — Agent has no defined output structure. Results vary
wildly between invocations. Structured output (tables) reduces unverifiable
claims because you need a file:line to fill the cell.
No "What You Don't Do" section — Pipeline agents without explicit negative constraints have the most scope-creep and overlap risk with adjacent phases.
No wrap-up signal — Long-running agents without a tool-call limit (e.g., "after ~60 tool calls, wrap up") can run indefinitely, consuming context until performance degrades.
PR skills without health checks — Skills that respond to PR review
comments but don't check CI status or merge conflicts first. Build failures
and merge conflicts should be fixed before processing review comments —
comments may be moot if the build is broken. Reference: /respond checks
get_check_runs and mergeable_state in Phase 0.
The remaining anti-patterns (freeform instructions, missing "why", over-specified role, passive descriptions, negation-heavy rules, critical-rule-as-instruction-only, always/never for judgment, missing Gotchas) are lower-frequency and covered by the dimension audit sections above.
Read these before making changes to the relevant dimension:
../session-analytics/references/calibration.md — the shared confidence ×
severity model (imported, not redefined here)references/skill-usage.md, references/agent-orchestration.md,
references/drift-regression.md — this skill's owned analytics packs (Phase 1.5)references/description-optimization.md — Trigger optimization with before/after examplesreferences/decision-frameworks.md — Structured reasoning scaffold, degrees of freedom, example-driven specreferences/hooks-catalog.md — JS hook examples for activation, validation, enforcement<don't know> recommendations stay in the below-bar countdisallowedTools even when the agent's tool list is
naturally constrained by platform defaultstools
Reconstruct what a past coding-agent session was doing so you can resume it — goal, files touched, last verified state, and the next step — by querying the session logs. Use when the user says "what was I working on", "recover that session", "reconstruct where I left off", "resume my last session", "what did that session change", "rebuild context from logs", or invokes /work-recovery. Report-only — it never scores or judges. Do NOT use for usage scoring (that is /skill-improver, /tool-efficiency, /prompt-analytics) or one-off interactive log queries (that is /session-analytics).
development
Curate this repo's hallouminate wiki (.hallouminate/wiki/, the repo:dotfiles:wiki corpus) — add or update architecture pages, per-harness docs, and gotchas. Use when the user says "update the wiki", "document this in the wiki", "refresh the harness docs", "add a wiki page", "curate the wiki", "the wiki is stale", or invokes /wiki-curator. Also use at session end to write back a non-obvious decision or gotcha worth preserving. Grounds the existing wiki first, follows one-topic-per-file conventions, verifies every external doc URL before writing, and reindexes. Do NOT use for general code search (that is cheez-search) or for editing AGENTS.md command reference.
tools
Audit how a tool, command, or MCP server is actually used across coding-agent sessions and produce calibrated recommendations — tool-vs-task fit, error forensics, fix recommendations, permission friction, MCP health, and token economics. Use when the user says "tool efficiency", "am I using X efficiently", "audit tool usage", "why does X keep failing", "how do I fix this error", "what should I change", "permission friction", "is this MCP worth it", "tool error rate", "fix recommendations", or invokes /tool-efficiency. Do NOT use for auditing a skill or agent definition (that is /skill-improver) or for one-off interactive log queries (that is /session-analytics).
tools
Analyze how prompts and skill routing behave across coding-agent sessions and produce calibrated recommendations — prompt-pattern analysis, routing accuracy, and knowledge gaps. Use when the user says "analyze my prompts", "prompt patterns", "is routing working", "which skill should have fired", "knowledge gaps", "what do I keep asking", or invokes /prompt-analytics. Do NOT use for auditing a single skill/agent definition (that is /skill-improver), tool/MCP efficiency (that is /tool-efficiency), or one-off interactive log queries (that is /session-analytics).