Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

oaustegard/optimizing-skills

Name: optimizing-skills
Author: oaustegard

optimizing-skills/SKILL.md

npx skillsauth add oaustegard/claude-skills optimizing-skills

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Optimizing Skills

Treat the skill document as the parameter under optimization: change it only when the change demonstrably beats the version you already ship. This is the discipline distilled from SkillOpt (microsoft/SkillOpt, arXiv:2605.23904) — its training apparatus dropped, its reproducibility discipline kept. The point is to stop editing skills on intuition and start editing them on evidence.

Core principle

A skill edit is only worth shipping if it strictly improves measured behavior on a held-out check. Most edits that feel like improvements don't move the needle, and some quietly regress. The gate below is what separates a real improvement from a confident guess.

The gate — run it every revision

Assemble a held-out check set. 3–8 representative tasks/prompts the skill should handle well, and it must include the failure(s) that prompted this revision. Keep the set fixed across the revision so before/after scores are comparable.
Hold two versions. best = what you currently ship (never let it silently degrade). candidate = best + your proposed edits.
Score both on the check set. "Run" here = dispatch each check task to the Agent tool (subagent_type=general-purpose) with the skill version in context, or evaluate by hand for small sets. Score per criterion, not one collapsed pass/fail. When a task carries several criteria, the criterion that decides accept/reject is the failure that prompted this revision; the others are regression guards that must not get worse. Collapsing criteria masks the win: in the down-skilling-v1.2.0 retro, the edit drove architectural hallucination 60%→0% while an unrelated length criterion stayed 0/5 in both arms — a single combined pass/fail scored that as a 0–0 tie and would have rejected a large, real improvement.
Accept only if candidate strictly beats best on the triggering-failure criterion, with no regression guard worse. Ties → reject, keep best. An edit that doesn't move the needle does not ship.

When the skill's own output is compiled by an Agent (down-skilling and creating-skill produce a prompt an author writes from the SKILL), score ≥2 author samples per version, or fix one author across both arms. A single author sample per arm lets author capability dominate the edit effect: the same down-skilling edit measured 95%→0% with one author pair and 60%→0% with another — real either way, but n=1 cannot tell a real edit from a lucky author.

This two-tier best/candidate split is the heart of it: a working revision can explore, but the shipped skill only ever ratchets upward.

Bounded edits (the "textual learning rate")

Cap edits per revision — default ~4 distinct add/replace/delete operations, fewer as the skill matures. Large speculative rewrites drift and destroy your ability to attribute a regression to a cause. If a revision wants more edits than the budget, rank and keep the top ones (below) and let the rest wait.

Reflect: failures first, then successes

Separate the evidence before proposing edits:

Failure reflection. Across the failing cases, find the common, systematic pattern — not a one-off edge case. Propose edits that fix the pattern. Failures take priority in any merge.
Success reflection. Across cases that already work, find generalizable patterns worth encoding so they survive future edits. Reinforce; don't duplicate.

For both: edits must generalize (never hardcode task-specific values), and must not duplicate content already in the skill — patch genuine gaps only.

Rank when over budget

When candidate edits exceed the budget, keep them in this priority order:

Systematic impact — fixes a recurring failure across many cases, not one.
Complementarity — fills a real gap rather than restating existing content.
Generality — phrased as a durable principle, not tied to one task/entity.
Actionability — concrete, followable guidance over vague advice.

Drop the rest. They can return next revision if still warranted.

Protect the hard-won core

If a skill has a battle-tested core that routine edits keep eroding, fence it off and treat it as off-limits to fast edits. Revisit it only on a deliberate longitudinal review: compare the same check tasks across several versions to catch slow drift and regressions that single-edit review misses. (SkillOpt fences this region with HTML-comment markers and only rewrites it at epoch boundaries — the same idea, manual cadence.)

Carry memory across revisions

After a revision, record what you learned about editing this skill — which kinds of edits helped, which were brittle, redundant, or harmful — via remember() tagged with the skill name. Before the next revision, recall() it. This is the compounding part: each revision starts smarter than the last, the way SkillOpt's optimizer-side meta-skill conditions its future edits.

Edit mechanics

Edits are literal string operations (the Edit tool): the target text must match exactly or the edit is a silent no-op. Keep targets unique and verbatim. Prefer append / insert-after-heading / replace-exact / delete-exact, and verify each edit landed before scoring.

When NOT to use this

Authoring a brand-new skill from scratch → creating-skill.
Tracking/rolling back versions during development → versioning-skills.
One-off prose with no reuse → just write it.

Checklist

[ ] Held-out check set assembled (includes the triggering failure), fixed for the revision
[ ] Edits bounded (~4 max), each generalizable and non-duplicative
[ ] Failure patterns addressed before success reinforcement
[ ] Candidate scored per-criterion against best; accept decided by the triggering-failure criterion, others as regression guards; shipped only if strictly better
[ ] For Agent-compiled artifacts (down-skilling, creating-skill): ≥2 author samples per version, or a fixed author across arms
[ ] Hard-won core left untouched unless doing a deliberate longitudinal review
[ ] Lesson about editing this skill recorded via remember()

For the deeper "dispatch reflection/scoring to the Agent tool" recipe and the adapted reflection/ranking prompt templates, see references/skillopt-provenance.md.

oaustegard/optimizing-skills

optimizing-skills/SKILL.md

Disciplined, validation-gated revision of an EXISTING skill so each edit is a measured improvement rather than a guess. Use when editing, revising, or tuning a skill that already exists and there is evidence it underperforms (observed failures, drift, complaints) — invoke by name, or have versioning-skills / creating-skill defer to it before applying edits. Not for authoring a brand-new skill from scratch (use creating-skill) or one-off prose.

124 stars

testing

Updated May 30, 2026

$ install --global

skillsauth

npx skillsauth add oaustegard/claude-skills optimizing-skills

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 30, 2026, 6:36 AM8.2s4 files scanned

SKILL.md

name:: optimizing-skills
description:: Disciplined, validation-gated revision of an EXISTING skill so each edit is a measured improvement rather than a guess. Use when editing, revising, or tuning a skill that already exists and there is evidence it underperforms (observed failures, drift, complaints) — invoke by name, or have versioning-skills / creating-skill defer to it before applying edits. Not for authoring a brand-new skill from scratch (use creating-skill) or one-off prose.
version:: 0.2.0

Optimizing Skills

Core principle

The gate — run it every revision

Assemble a held-out check set. 3–8 representative tasks/prompts the skill should handle well, and it must include the failure(s) that prompted this revision. Keep the set fixed across the revision so before/after scores are comparable.
Hold two versions. best = what you currently ship (never let it silently degrade). candidate = best + your proposed edits.
Score both on the check set. "Run" here = dispatch each check task to the Agent tool (subagent_type=general-purpose) with the skill version in context, or evaluate by hand for small sets. Score per criterion, not one collapsed pass/fail. When a task carries several criteria, the criterion that decides accept/reject is the failure that prompted this revision; the others are regression guards that must not get worse. Collapsing criteria masks the win: in the down-skilling-v1.2.0 retro, the edit drove architectural hallucination 60%→0% while an unrelated length criterion stayed 0/5 in both arms — a single combined pass/fail scored that as a 0–0 tie and would have rejected a large, real improvement.
Accept only if candidate strictly beats best on the triggering-failure criterion, with no regression guard worse. Ties → reject, keep best. An edit that doesn't move the needle does not ship.

This two-tier best/candidate split is the heart of it: a working revision can explore, but the shipped skill only ever ratchets upward.

Bounded edits (the "textual learning rate")

Reflect: failures first, then successes

Separate the evidence before proposing edits:

Failure reflection. Across the failing cases, find the common, systematic pattern — not a one-off edge case. Propose edits that fix the pattern. Failures take priority in any merge.
Success reflection. Across cases that already work, find generalizable patterns worth encoding so they survive future edits. Reinforce; don't duplicate.

For both: edits must generalize (never hardcode task-specific values), and must not duplicate content already in the skill — patch genuine gaps only.

Rank when over budget

When candidate edits exceed the budget, keep them in this priority order:

Systematic impact — fixes a recurring failure across many cases, not one.
Complementarity — fills a real gap rather than restating existing content.
Generality — phrased as a durable principle, not tied to one task/entity.
Actionability — concrete, followable guidance over vague advice.

Drop the rest. They can return next revision if still warranted.

Protect the hard-won core

Carry memory across revisions

Edit mechanics

When NOT to use this

Authoring a brand-new skill from scratch → creating-skill.
Tracking/rolling back versions during development → versioning-skills.
One-off prose with no reuse → just write it.

Checklist

[ ] Held-out check set assembled (includes the triggering failure), fixed for the revision
[ ] Edits bounded (~4 max), each generalizable and non-duplicative
[ ] Failure patterns addressed before success reinforcement
[ ] Candidate scored per-criterion against best; accept decided by the triggering-failure criterion, others as regression guards; shipped only if strictly better
[ ] For Agent-compiled artifacts (down-skilling, creating-skill): ≥2 author samples per version, or a fixed author across arms
[ ] Hard-won core left untouched unless doing a deliberate longitudinal review
[ ] Lesson about editing this skill recorded via remember()

For the deeper "dispatch reflection/scoring to the Agent tool" recipe and the adapted reflection/ranking prompt templates, see references/skillopt-provenance.md.

Related Skills

oaustegard/semantic-grep

development

VerifiedTrustedCommunity

In-process semantic search over text files or in-memory strings, using Gemini embeddings via the CF AI Gateway. Use when user wants fuzzy/conceptual search where exact-keyword grep would miss — "sessions discussing regulatory constraints", "code about retry logic", "notes mentioning burnout even if the word isn't there". Complements searching-codebases (regex/AST) and extracting-keywords (YAKE). Do NOT use when an exact string/regex match is what's wanted — grep/rg wins on speed and precision there.

135SKILL.mdUpdated Jul 26, 2026

oaustegard/semantic-grep

oaustegard/assessing-impact

development

VerifiedTrustedCommunity

Pre-change blast-radius report for a symbol or file. Walks tree-sitting references, augments with a plain-text scan over non-parsed files (configs, plain docs), and clusters affected sites by feature (`_FEATURES.md`) or top-level package. Use when about to refactor, rename, or delete something in a repo you don't own — "what breaks if I change `validateUser`", "who calls this", "is this safe to remove", "where is this used", "blast radius", "impact analysis". This is the CONVERGENT pre-change risk skill — for "what is this repo?" use exploring-codebases; for "where is X?" use searching-codebases.

135SKILL.mdUpdated Jul 26, 2026

oaustegard/assessing-impact

oaustegard/verifying-claims

development

VerifiedTrustedCommunity

Check that a document's claims about code are actually true by reading the prose, the code, and the tests and reporting (or fixing) where they disagree. Use whenever the user wants to verify a README, guide, spec, or docstring still matches the code; whenever they mention documentation drift, doc-code sync, "is this still accurate", stale docs, or keeping docs/tests/code consistent; before publishing or merging a docs change; or as a periodic doc-accuracy sweep. The agent reads the prose's meaning directly — there is no claim-comment DSL to maintain. Pairs with TDD — the test suite is the deterministic behavioral gate, this skill is the semantic prose-vs-reality review.

135SKILL.mdUpdated Jul 26, 2026

oaustegard/verifying-claims

oaustegard/parsing-video

tools

VerifiedTrustedCommunity

Interpret video content visually by sampling frames into timestamped contact sheets that can be read as images. Use when: user asks what happens in a video; asks to summarize, describe, review, or QA video content or footage; asks about scenes, actions, people, or objects in a video; needs a storyboard-style overview of a clip; asks to find where something occurs in a video. Triggers on 'watch this video', 'what's in this video', 'summarize the video', 'describe the footage', 'contact sheet', 'storyboard', 'review this clip', 'find the scene where', 'scene detection', 'shot boundaries', 'detect cuts', 'dwell points'. For converting, trimming, or transcoding video, use processing-video instead.

135SKILL.mdUpdated Jul 19, 2026

oaustegard/parsing-video

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/oaustegard/claude-skills.git

# Copy into Claude Code skills folder (global)
cp -r claude-skills/optimizing-skills ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

oaustegard/claude-skills

124 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT