optimizing-skills/SKILL.md
Disciplined, validation-gated revision of an EXISTING skill so each edit is a measured improvement rather than a guess. Use when editing, revising, or tuning a skill that already exists and there is evidence it underperforms (observed failures, drift, complaints) — invoke by name, or have versioning-skills / creating-skill defer to it before applying edits. Not for authoring a brand-new skill from scratch (use creating-skill) or one-off prose.
npx skillsauth add oaustegard/claude-skills optimizing-skillsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Treat the skill document as the parameter under optimization: change it only when the change demonstrably beats the version you already ship. This is the discipline distilled from SkillOpt (microsoft/SkillOpt, arXiv:2605.23904) — its training apparatus dropped, its reproducibility discipline kept. The point is to stop editing skills on intuition and start editing them on evidence.
A skill edit is only worth shipping if it strictly improves measured behavior on a held-out check. Most edits that feel like improvements don't move the needle, and some quietly regress. The gate below is what separates a real improvement from a confident guess.
best = what you currently ship (never let it
silently degrade). candidate = best + your proposed edits.subagent_type=general-purpose) with the skill version in
context, or evaluate by hand for small sets. Score per criterion, not one
collapsed pass/fail. When a task carries several criteria, the criterion that
decides accept/reject is the failure that prompted this revision; the
others are regression guards that must not get worse. Collapsing criteria
masks the win: in the down-skilling-v1.2.0 retro, the edit drove architectural
hallucination 60%→0% while an unrelated length criterion stayed 0/5 in both
arms — a single combined pass/fail scored that as a 0–0 tie and would have
rejected a large, real improvement.candidate strictly beats best on the triggering-failure
criterion, with no regression guard worse. Ties → reject, keep best. An edit
that doesn't move the needle does not ship.When the skill's own output is compiled by an Agent (down-skilling and creating-skill produce a prompt an author writes from the SKILL), score ≥2 author samples per version, or fix one author across both arms. A single author sample per arm lets author capability dominate the edit effect: the same down-skilling edit measured 95%→0% with one author pair and 60%→0% with another — real either way, but n=1 cannot tell a real edit from a lucky author.
This two-tier best/candidate split is the heart of it: a working revision
can explore, but the shipped skill only ever ratchets upward.
Cap edits per revision — default ~4 distinct add/replace/delete operations, fewer as the skill matures. Large speculative rewrites drift and destroy your ability to attribute a regression to a cause. If a revision wants more edits than the budget, rank and keep the top ones (below) and let the rest wait.
Separate the evidence before proposing edits:
For both: edits must generalize (never hardcode task-specific values), and must not duplicate content already in the skill — patch genuine gaps only.
When candidate edits exceed the budget, keep them in this priority order:
Drop the rest. They can return next revision if still warranted.
If a skill has a battle-tested core that routine edits keep eroding, fence it off and treat it as off-limits to fast edits. Revisit it only on a deliberate longitudinal review: compare the same check tasks across several versions to catch slow drift and regressions that single-edit review misses. (SkillOpt fences this region with HTML-comment markers and only rewrites it at epoch boundaries — the same idea, manual cadence.)
After a revision, record what you learned about editing this skill — which
kinds of edits helped, which were brittle, redundant, or harmful — via
remember() tagged with the skill name. Before the next revision, recall()
it. This is the compounding part: each revision starts smarter than the last,
the way SkillOpt's optimizer-side meta-skill conditions its future edits.
Edits are literal string operations (the Edit tool): the target text must
match exactly or the edit is a silent no-op. Keep targets unique and
verbatim. Prefer append / insert-after-heading / replace-exact / delete-exact,
and verify each edit landed before scoring.
best; accept decided by the triggering-failure criterion, others as regression guards; shipped only if strictly betterremember()For the deeper "dispatch reflection/scoring to the Agent tool" recipe and the
adapted reflection/ranking prompt templates, see
references/skillopt-provenance.md.
development
--- name: verifying-claims description: Check that a document's claims about code are actually true by reading the prose, the code, and the tests and reporting (or fixing) where they disagree. Use whenever the user wants to verify a README, guide, spec, or docstring still matches the code; whenever they mention documentation drift, doc-code sync, "is this still accurate", stale docs, or keeping docs/tests/code consistent; before publishing or merging a docs change; or as a periodic doc-accuracy
tools
Query, filter, and transform Markdown structurally with mq — a jq-like CLI for Markdown. Use to extract headings/sections/code-blocks/links from .md files, build a table of contents, pull code blocks of a given language, slice or reshape LLM prompt/output Markdown, or batch-transform docs. Triggers on "extract sections from this markdown", "get all the code blocks", "jq for markdown", "mq", or any structural query over Markdown that grep/Read can't do cleanly.
development
Composes single-file HTML artifacts (PR review writeups, status reports, incident postmortems, slide decks, design systems, prototypes, flowcharts, module maps, feature explainers, kanban boards, prompt tuners) from a small JSON spec instead of hand-written HTML/CSS/JS. Use when the user asks to "compare options side-by-side", requests an HTML version of a report or review or deck, asks for a flowchart, status update, postmortem, design system reference, interactive prototype, custom editor — or explicitly says "HTML artifact", "single HTML file", "self-contained HTML". Skip for ad-hoc HTML snippets (forms, emails, embedded widgets) where there's no template fit.
development
DAG workflow runner that encodes control flow in code, not prose. Use when a procedure has 3+ steps with branching, retries, or validation that must be enforced — gates as `when=`, edge contracts as `validate=`, predicate loops as `retry_until=`. The runner owns the graph; the LLM provides leaves. Also covers parallel execution, checkpoint resume, detached side-effects.