skills/tuning/SKILL.md
Empirical prompt tuning loop. Dispatch fresh subagents against skill/agent/command prompts, collect ambiguities, patch one at a time.
npx skillsauth add thkt/dotclaude tuningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Prompt authors are the worst readers of their own prompts. They fill tacit knowledge without noticing. Only an independent executor AI, reading the prompt cold, can surface missing context. The loop is TDD-shaped: the executor is the test, the prompt is the production code.
| Target | Path | Dispatch | | ------------- | --------------------------------------------- | ---------------------------- | | Skill | ${CLAUDE_SKILL_DIR}/../<name>/SKILL.md | Task + general-purpose agent | | Agent | ${CLAUDE_SKILL_DIR}/../../agents/**/<name>.md | Task + subagent_type=<name> | | Slash command | ${CLAUDE_SKILL_DIR}/../../commands/<name>.md | Task + general-purpose agent |
One iteration, one theme. Multiple patches in one round breaks attribution.
| Type | Count | Purpose | | -------- | ----- | ---------------------------------------- | | Median | 1 | Typical happy path | | Edge | 1-2 | Boundary conditions, unusual inputs | | Hold-out | 1 | Never used for tuning, overfitting check |
Scenarios stay frozen across iterations. Tuning scenarios to match prompt changes hides real ambiguities.
Every scenario carries a checklist. Tag at least one item [critical]. A run is successful only when every [critical] item is met. Without critical tags, success collapses into "50% everywhere" and the next patch target becomes invisible.
You are the executor reading <target prompt name> cold.
## Target Prompt
<prompt body or file path>
## Scenario
<one-paragraph situation>
## Requirements Checklist
1. [critical] <minimum-line item>
2. <regular item>
...
## Task
1. Follow the target prompt against the scenario and produce the deliverable
2. Return a report in the structure below at the end
## Report Structure
- Deliverable: <output or execution summary>
- Requirement check: pass / fail / partial (with reasoning) per item
- Ambiguities: stall points, wording that was hard to interpret
- Discretionary fills: gaps filled by your own judgment
- Retries: how many times you redid the same decision and why
| Outcome | Criteria | | --------- | ----------------------------------------------------------------------------------------------------------- | | Converged | 2 consecutive runs with new ambiguities 0, accuracy ±3pt, steps ±10%, duration ±15%, no hold-out regression | | Diverged | New ambiguities do not drop after 3+ iterations. Rewrite the structure instead of patching | | Cut off | Improvement cost > remaining importance. 90→100 hits diminishing returns, candidate for cutoff |
Task tool return values append <usage>total_tokens: N, tool_uses: M, duration_ms: D</usage>. Capture this every run and compare across iterations. "Feels faster" is not a comparable metric.
Compare tool_uses across scenarios. If 4 scenarios sit at 1-3 and 1 spikes to 15+, that scenario is digging through references, signaling low self-containment. Add an inline minimal example and a "when to read references" guideline to the body to bring it down. Even at 100% accuracy, structural gaps surface in tool_uses.
| Failure | Fix |
| ------------------------------- | ------------------------------------------------------------------------------ |
| Task tool not dispatched | Tell the subagent explicitly to spawn another via Task tool |
| <usage> block missing | Force the report structure in the dispatch prompt |
| Rate limit (529/overloaded) | Drop parallelism from 3 to 1, or wait 30s and re-dispatch |
| Parent context token exhaustion | Spawn a fresh session dedicated to evaluation |
| Scenario echoes the body | Redesign median + edge. Edge cases are mandatory |
| Same AI reused | Always fresh dispatch. Do not continue via SendMessage |
| Multiple patches per iteration | One theme per iteration. Otherwise attribution is lost |
| Decision based on metrics only | Qualitative (ambiguities, discretionary fills) leads. Quantitative supports it |
| Vague patch by axis name | Force the subagent to cite the threshold text from the rubric |
| Case | Reason | | ---------------------------------------- | ------------------------------------------------ | | One-shot disposable prompt | Iteration cost does not pay off | | Preserving the author's subjective taste | Conflicts with the goal of removing author bias | | Self-rereading as a substitute | The author's tacit knowledge slips in undetected |
Do not discard the iteration log in place. Keep it as TUNING_LOG.md next to the target prompt. Record score trends, scenarios, and patch theme per iteration. It becomes the starting point for the next tuning round.
| Skill | Relation |
| ------------------------------- | ------------------------------------------------------------------------ |
| use-workflow-tdd-cycle | Applies the same Red-Green-Refactor cycle to prompts |
| use-context-root-cause-analysis | Drills into ambiguities (5 Whys) to find structural gaps |
| use-workflow-spec-validation | Spec consistency is separate; this one measures prompt reproducibility |
| assert | /assert is the Ready/NotReady gate for code outcomes. Different target |
| challenge | Adversarial gap-hunting is separate. This is an iterative patch loop |
tools
Internal helper for /think Step 11. Renders SOW.md + Spec.md as an integrated Astro view and returns a dev server URL.
development
Extract repository spec while detecting bugs, spec gaps, and consistency drift via dual-purpose documentation. OUTCOME.md-axis question-driven exploration with ephemeral output. Do NOT use for code review (use /audit or /polish), feature implementation (use /code), planning only (use /think), or single-bug fix (use /fix).
development
Discover undocumented design decisions and challenge each candidate via critic-design before promotion. Rank by impact and reversibility, produce ADR promotion candidates. Treat each candidate as a position arguing for ADR status, not a fact to be filed. Pairs with audit-adr-drift, which scans existing ADRs for drift against code.
development
Scan ADR Decision sections against current code and report drift with modification direction and priority. Do NOT use for repos without ADRs (use audit-adr-gaps instead).