framework/claude/skills/sdd-forge-skill/SKILL.md
Create, rebuild, improve, evaluate, compare, and optimize AI agent skills. Use this skill whenever the user wants to make a skill, create a skill, turn something into a skill, improve a skill, edit a skill, fix a skill, reforge a skill, rebuild a skill from scratch, regenerate a skill, test a skill, run evals, benchmark a skill, compare skill versions, check if a new version is better, optimize a skill description, or improve skill triggering. Covers the full skill lifecycle from idea to polished, tested, well-triggering skill.
npx skillsauth add sync-dev-org/sync-sdd sdd-forge-skillInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Adapted from anthropics/skills (Apache 2.0). See LICENSE.txt.
Create, rebuild, improve, evaluate, compare, and optimize AI agent skills through iterative development.
| Mode | Purpose | Entry |
|------|---------|-------|
| create | Interview user, draft skill, eval, iterate | "make a skill", "create a skill", "turn this into a skill" |
| reforge | Rebuild existing skill from scratch via requirements extraction | "reforge skill", "rebuild skill", "regenerate skill" |
| improve | Iterate on existing skill with feedback | "improve skill", "edit skill", "fix skill" |
| eval | Run test cases, grade, benchmark, review | "test skill", "run evals", "benchmark" |
| compare | Blind A/B comparison of two skill versions | "compare versions", "is the new version better?" |
| optimize-description | Automated description optimization loop | "optimize description", "improve triggering" |
The core loop, regardless of mode:
Figure out where the user is in this process and help them progress. Be flexible — if the user says "just vibe with me", skip the formal eval/benchmark loop.
The skill forge is used by people across a wide range of technical familiarity. Pay attention to context cues:
All Python scripts are at ${CLAUDE_SKILL_DIR}/scripts/ and ${CLAUDE_SKILL_DIR}/eval-viewer/. Use env prefix with PYTHONPATH — bare PYTHONPATH=... python triggers Claude Code's environment variable prefix heuristic:
env PYTHONPATH=${CLAUDE_SKILL_DIR} python -m scripts.<module_name> <args>
Example:
env PYTHONPATH=${CLAUDE_SKILL_DIR} python -m scripts.run_loop --eval-set eval.json --skill-path ./my-skill --model claude-sonnet-4-6 --verbose
For non-module scripts (eval-viewer), use the full path:
env PYTHONPATH=${CLAUDE_SKILL_DIR} python ${CLAUDE_SKILL_DIR}/eval-viewer/generate_review.py <args>
Gather from the user (extract from conversation history first if context already exists):
Proactively ask about edge cases, dependencies, and example files. Keep it conversational — don't over-interview.
Spawn a generic Agent() with run_in_background: true. The prompt should include:
${CLAUDE_SKILL_DIR}/references/writer.md and follow its instructions."create.sdd/session/decisions.yaml, .sdd/session/knowledge.yaml (if they exist)${CLAUDE_SKILL_DIR} valueRead the writer's output. Check:
AskUserQuestion in allowed-toolsPresent the draft to the user. If they want changes, re-dispatch in improve mode.
If the user wants to test, switch to Eval mode. Otherwise iterate on feedback.
Rebuild an existing skill from scratch. The key insight: requirements-only prompts produce higher quality than detailed specifications. Strip away implementation details and let the writer rediscover the best approach.
Use mv (not cp) for each file being reforged. Physical removal prevents the writer SubAgent from reading the old version — filesystem constraints are more reliable than prompt constraints.
mv {file} {file}.bak{N}
Increment N from 1. Check existing .bak* files to find the next available number. When multiple files are being reforged, issue all mv commands as parallel Bash calls in a single turn.
Reforge scope can include SKILL.md and any reference .md files the user specifies — not just SKILL.md alone. Scripts, assets, and other non-target files are excluded.
From the backed-up files, extract what the skill does, not how:
Do NOT extract implementation details (step procedures, variable names, section headings). The goal is a clean brief that communicates intent without biasing the writer's design.
Catalog everything the skill touches:
Present the inventory to the user for confirmation.
Send requirements + interfaces only. No old implementation details. Include the explicit instruction: "Generate a completely fresh SKILL.md. Do NOT read the existing skill."
Compare old (.bak{N}) and new versions. Report:
.bak{N}Read the existing skill. Ask the user what needs to change — specific complaints, eval results showing problems, benchmark regressions.
Spawn Agent() with run_in_background: true:
${CLAUDE_SKILL_DIR}/references/writer.md and follow its instructions."improveHow to think about improvements:
scripts/.Re-eval if test cases exist. Present results. Keep going until the user is satisfied or feedback is all empty.
Run test cases against a skill, grade assertions, aggregate into benchmark statistics, and present results in an HTML viewer for qualitative and quantitative review. This section is one continuous sequence — don't stop partway through.
Check for evals/evals.json in the skill directory. If absent, create 2-3 realistic test prompts — the kind of thing a real user would say. Share with the user for review.
Save to evals/evals.json. Don't write assertions yet — you'll draft them while runs are in progress.
{
"skill_name": "example-skill",
"evals": [
{
"id": 1,
"prompt": "User's task prompt",
"expected_output": "Description of expected result",
"files": []
}
]
}
See ${CLAUDE_SKILL_DIR}/references/schemas.md for the full schema.
Put results in <skill-name>-workspace/ as a sibling to the skill directory. Within the workspace, organize by iteration (iteration-1/, iteration-2/, etc.) and by test case (eval-0/, eval-1/, etc.).
Create all workspace directories in one Bash call upfront:
mkdir -p <workspace>/iteration-N/eval-0/{with_skill,without_skill}/outputs <workspace>/iteration-N/eval-1/{with_skill,without_skill}/outputs
For each test case, spawn two subagents in the same turn — with-skill and baseline. Don't do all with-skill first; launch everything at once so it all finishes around the same time.
With-skill run: Same prompt, with the skill loaded, save outputs to <workspace>/iteration-N/eval-ID/with_skill/outputs/.
Baseline run (depends on context):
without_skill/outputs/<workspace>/skill-snapshot/ first), save to old_skill/outputs/Write eval_metadata.json for each test case. Give descriptive names, not just "eval-0".
Use the time productively. Draft quantitative assertions for each test case and explain them to the user. Good assertions are objectively verifiable with descriptive names that read clearly in the benchmark viewer.
Don't force assertions onto subjective skills (writing style, design quality) — those are better evaluated qualitatively.
Update eval_metadata.json and evals/evals.json with assertions once drafted.
When each subagent completes, the task notification includes total_tokens and duration_ms. Save immediately to timing.json in the run directory — this data isn't persisted elsewhere.
{
"total_tokens": 84852,
"duration_ms": 23332,
"total_duration_seconds": 23.3
}
Once all runs complete:
Grade — dispatch all grader subagents in a single turn (one Agent() per run, all with run_in_background: true). Each grader follows ${CLAUDE_SKILL_DIR}/references/grader.md and saves grading.json in its run directory. The grading.json expectations array must use fields text, passed, and evidence — the viewer depends on these exact names. Wait for all grader notifications before proceeding.
Aggregate + Analyze + Launch — after all grading completes, do these in one turn:
First, aggregate:
env PYTHONPATH=${CLAUDE_SKILL_DIR} python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
Produces benchmark.json and benchmark.md. Put each with_skill version before its baseline counterpart.
Then read benchmark.json, do an analyst pass (surface patterns the aggregate stats hide — see ${CLAUDE_SKILL_DIR}/references/analyzer.md), and launch the viewer:
env PYTHONPATH=${CLAUDE_SKILL_DIR} python ${CLAUDE_SKILL_DIR}/eval-viewer/generate_review.py <workspace>/iteration-N --skill-name "my-skill" --benchmark <workspace>/iteration-N/benchmark.json
Run the viewer via Bash(run_in_background=true) — do not use nohup ... & (triggers & and 2> security heuristics).
For iteration 2+, pass --previous-workspace <workspace>/iteration-<N-1>.
Headless environments: use --static <output_path> for a standalone HTML file. Feedback downloads as feedback.json when the user clicks "Submit All Reviews".
The Outputs tab shows one test case at a time: prompt, output files, previous output (iteration 2+), formal grades (if grading ran), and a feedback textbox.
The Benchmark tab shows stats: pass rates, timing, token usage for each configuration, with per-eval breakdowns and analyst observations.
Navigation via prev/next buttons or arrow keys. "Submit All Reviews" saves all feedback to feedback.json.
Read feedback.json. Empty feedback means the user thought it was fine. Focus improvements on test cases with specific complaints.
Kill the viewer when done: kill $VIEWER_PID (omit 2>/dev/null — the 2> redirect triggers security heuristics; tolerate stderr).
If improvements needed, switch to Improve mode, then rerun evals into iteration-<N+1>/.
Blind A/B comparison for rigorous version comparison.
Execute the same eval set against both skill versions. Collect transcripts and outputs.
For each eval, dispatch a comparator subagent (Agent() with run_in_background: true). It receives outputs labeled A and B without knowing which version produced which. See ${CLAUDE_SKILL_DIR}/references/comparator.md.
After blind comparison, dispatch the analyzer with full context (both skills revealed). It explains why the winner won and generates improvement suggestions. See ${CLAUDE_SKILL_DIR}/references/analyzer.md.
Present: winner, rubric scores, quality assessment, improvement suggestions. Let the user decide next steps.
The description field is the primary mechanism that determines whether Claude invokes a skill. This mode automates optimization for better triggering.
Create 20 eval queries — a mix of should-trigger (8-10) and should-not-trigger (8-10). Save as JSON.
Queries must be realistic — concrete, specific, with detail (file paths, context about the user's situation, column names). Use varied lengths, some casual, some formal. Focus on edge cases, not clear-cut examples.
For should-trigger: different phrasings of the same intent, cases where the user doesn't name the skill explicitly but clearly needs it, uncommon use cases, competitive cases where this skill should win.
For should-not-trigger: near-misses that share keywords but need something different. "Write a fibonacci function" as a negative for a PDF skill is too easy. The negative cases should be genuinely tricky.
Present queries via the HTML template:
${CLAUDE_SKILL_DIR}/assets/eval_review.html__EVAL_DATA_PLACEHOLDER__ with the JSON array, __SKILL_NAME_PLACEHOLDER__ and __SKILL_DESCRIPTION_PLACEHOLDER__ with the skill's valuesopen /tmp/eval_review_<skill-name>.html~/Downloads/ for the exported eval_set.jsonenv PYTHONPATH=${CLAUDE_SKILL_DIR} python -m scripts.run_loop --eval-set <path-to-eval-set.json> --skill-path <path-to-skill> --model <model-id-powering-this-session> --max-iterations 5 --verbose
Run via Bash(run_in_background=true) — this is a long-running process.
Use the model ID from the system prompt so triggering tests match the user's actual experience.
The loop splits queries into 60% train / 40% test, runs each query 3x for reliability, generates improved descriptions, and selects by test score (not train) to avoid overfitting. Up to 5 iterations.
Skills appear in Claude's available_skills list with name + description. Claude decides whether to consult a skill based on that. The key: Claude only consults skills for tasks it can't easily handle alone — simple one-step queries may not trigger even if the description matches. Complex, multi-step, or specialized queries reliably trigger when the description matches.
This means eval queries should be substantive enough that Claude would benefit from a skill. Simple queries like "read file X" are poor test cases.
Take best_description from the output and update the skill's frontmatter. Show before/after and report scores.
If you have access to the present_files tool (check first — skip if not available):
env PYTHONPATH=${CLAUDE_SKILL_DIR} python -m scripts.package_skill <path/to/skill-folder>
Direct the user to the resulting .skill file.
${CLAUDE_SKILL_DIR}/references/writer.md — writer SubAgent instructions (create/improve modes)${CLAUDE_SKILL_DIR}/references/comparator.md — blind A/B comparator instructions${CLAUDE_SKILL_DIR}/references/analyzer.md — post-hoc analysis + benchmark analysis instructions${CLAUDE_SKILL_DIR}/references/grader.md — assertion grading instructions${CLAUDE_SKILL_DIR}/references/schemas.md — JSON schema definitions for the eval pipeline${CLAUDE_SKILL_DIR}/references/skill-authoring.md — comprehensive skill authoring guide${CLAUDE_SKILL_DIR}/references/skill-authoring-sources.md — reference guide update procedureAskUserQuestion in allowed-tools (auto-approve bug causes empty responses)Agent() with run_in_background: trueenv PYTHONPATH=... python (not bare PYTHONPATH=...) to avoid security heuristicrun_loop, viewer) use Bash(run_in_background=true) — not nohup ... &2> stderr redirects — tolerate error output insteadtools
--- name: sdd-steering description: Set up project-wide context (create, update, delete, custom) allowed-tools: Bash, Glob, Grep, Read, Write, Edit, Skill argument-hint: [-y] [custom] --- # SDD Steering (Unified) <instructions> ## Core Task Manage project steering documents. Lead handles directly (no SubAgent dispatch needed) since it requires user interaction. **Before any steering operation**, read `{{SDD_DIR}}/settings/rules/agent/steering-principles.md` and apply its principles (content
tools
--- name: sdd-status description: Check progress and analyze downstream impact allowed-tools: Read, Glob, Grep argument-hint: [feature-name] [--impact] --- # SDD Status (Unified) <instructions> ## Core Task Display comprehensive status for specifications and optionally analyze downstream impact of changes. Lead handles directly (read-only, no SubAgent needed). ## Step 1: Parse Arguments ``` $ARGUMENTS = "" → Overall roadmap + all specs progress $ARGUMENTS = "{feature}"
tools
Session start — invoke on "再開", "continue", "resume", or at every session start
content-media
Unified spec lifecycle (design, impl, review, roadmap management)