dot_claude/skills/empirical-prompt-tuning/SKILL.md
Iteratively improve agent-facing text instructions (skills, slash commands, task prompts, CLAUDE.md sections, code-generation prompts) by having a bias-free executor run them and evaluating from both sides (executor self-report + caller-side metrics). Repeat until improvement plateaus. Use immediately after creating or significantly revising a skill/prompt, or when unexpected agent behavior is suspected to stem from ambiguous instructions.
npx skillsauth add paveg/dots empirical-prompt-tuningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Prompt quality is invisible to its author. The more "clear" a writer thinks something is, the more likely a fresh executor stumbles on it. The core of this skill: have a bias-free executor actually run the prompt, evaluate from both sides, and iterate until improvement plateaus.
Do NOT use for:
This skill applies to both scopes equally. The evaluation workflow is identical; only the file location differs.
| Scope | Typical paths | Examples |
|---|---|---|
| Global (~/.claude/) | ~/.claude/skills/*/SKILL.md, ~/.claude/rules/*.md, ~/.claude/CLAUDE.md | Reusable skills shared across all repos, global behavior rules |
| Project (repo-local) | CLAUDE.md, docs/adr/*.md, inline task prompts, CI prompt snippets | Project-specific instructions, ADRs with embedded prompts |
When tuning a global prompt, consider that it will run across many different project contexts — design scenarios that span 2–3 representative project types, not just the repo where you noticed the problem.
Always edit the chezmoi source at ~/.local/share/chezmoi/dot_claude/ — never edit ~/.claude/ directly. Changes to the live path are overwritten by the next chezmoi apply. After each iteration's fix, confirm with the user before running chezmoi apply to reflect the change.
Read the frontmatter description and the body independently.
npx playwright test CLI referenceFix these two artifacts before dispatching any subagent:
Evaluation scenarios (2–3):
Requirements checklist (3–7 items per scenario):
[critical] per scenarioDispatch a new subagent via the Agent tool. Never self-review (you cannot objectively read text you just wrote — this is structurally impossible).
When running multiple scenarios in parallel, place all Agent calls in a single message.
See "Environment Constraints" if dispatch is unavailable.
Pass the subagent a prompt following the Subagent Launch Contract below. The subagent implements/generates the output and returns a self-report.
Record the following from the returned result:
Executor self-report (extracted from the subagent's report):
Caller-side measurements (judgment rules are authoritative here — reference only this section):
[critical] items are ○. Any [critical] item that is × or partial = FAILURE (×). Binary only — no "partial success".tool_uses from Agent tool's usage metadata (include Read / Grep — no exclusions)duration_ms from Agent tool's usage metadata[critical] item failed and whyRequirements checklist must contain at least 1 [critical] item (0 items makes success vacuously true). Do not add or remove [critical] tags after fixing the checklist.
Apply the minimal fix that addresses one theme of unclear points. Scope per iteration:
~/.local/share/chezmoi/dot_claude/), never ~/.claude/ directly. After editing, confirm with the user before running chezmoi apply.Dispatch a new subagent (never reuse — it learned from the previous run). Repeat steps 2–5.
Increase parallelism when improvement is not plateauing.
Stop when both of the following hold for 2 consecutive iterations:
For high-value prompts: require 3 consecutive iterations.
| Axis | Source | Meaning |
|---|---|---|
| Success/Failure | Caller measures | Minimum bar |
| Accuracy | Caller measures | Degree of partial success |
| Step count | tool_uses metadata | Proxy for wasted effort |
| Duration | duration_ms metadata | Proxy for cognitive load |
| Retry count | Executor self-report | Signal of ambiguity |
| Unclear points | Executor self-report | Qualitative improvement material |
| Discretionary fill-ins | Executor self-report | Surfacing implicit spec |
Weighting: qualitative (unclear points, fill-ins) is primary; quantitative (time, steps) is supplementary. Chasing only time reduction causes the prompt to become too sparse.
tool_usesUse tool_uses as a relative value across scenarios, not an absolute target:
tool_uses but one at 15+ → no recipe for that scenario; executor is traversing references/Even at 100% accuracy, a tool_uses outlier justifies starting iter 2.
Fixes are non-linear. Three patterns to expect:
To stabilize estimates: before applying a fix, have the subagent state which judgment criterion text it satisfies. Without criterion-level mapping, estimate accuracy stays low.
The prompt passed to each executor subagent must follow this structure:
You are a fresh executor reading <target prompt name> with no prior context.
## Target Prompt
<paste full text, or specify path for the subagent to Read>
## Scenario
<1-paragraph situation description>
## Requirements Checklist (what the output must satisfy)
1. [critical] <minimum bar item>
2. <standard item>
3. <standard item>
...
(Judgment rules are defined in the "Dual-Side Evaluation" section of the empirical-prompt-tuning skill. [critical] tag required on at least 1 item.)
## Task
1. Execute the scenario following the target prompt. Generate the deliverable.
2. At the end, return a report in the structure below.
## Report Structure
- Deliverable: <generated output or execution summary>
- Requirements met: for each item, ○ / × / partial (with reason)
- Unclear points: wording that caused confusion or required interpretation (bulleted list)
- Discretionary fill-ins: decisions not covered by instructions that you made yourself (bulleted list)
- Retries: how many times you reconsidered the same decision, and why
The caller extracts the self-report section and reads tool_uses / duration_ms from the Agent tool's usage metadata to fill the evaluation table.
If dispatching a new subagent is not possible (already running as a subagent, Task tool disabled, etc.):
empirical evaluation skipped: dispatch unavailable and stopStructure Audit Mode: if the goal is checking textual consistency / clarity only (not running the prompt), mark the subagent prompt explicitly as "structure audit mode: text consistency check only, not execution." This prevents the environment-constraint skip rule from triggering. Structure audits are a supplement to empirical evaluation — they do not count toward convergence.
Convergence (stop): all of the following hold for 2 consecutive iterations:
Divergence (redesign): 3+ iterations with no reduction in new unclear points → the prompt's structural design is wrong. Stop patching; rewrite the structure.
Resource cutoff: when improvement cost no longer justifies improvement value (shipping at 80 is a valid call).
Record and present after each iteration:
## Iteration N
### Changes (diff from previous)
- <1-line description of modification>
### Results (per scenario)
| Scenario | Pass/Fail | Accuracy | steps | duration | retries |
|---|---|---|---|---|---|
| A | ○ | 90% | 4 | 20s | 0 |
| B | × | 60% | 9 | 41s | 2 |
### Unclear Points (new this iteration)
- <Scenario B>: [critical] item N is × — <1-line reason> # required on failure
- <Scenario B>: <other finding>
- <Scenario A>: (none)
### Discretionary Fill-ins (new this iteration)
- <Scenario B>: <decision made>
### Next Fix
- <1-line minimal modification>
(Convergence: N consecutive clears / Y more to stop)
| Rationalization | Reality | |---|---| | "Reading it myself is the same thing" | You cannot objectively read text you just wrote. Always dispatch a new subagent. | | "One scenario is enough" | Single scenarios overfit. Minimum 2, ideally 3. | | "Zero unclear points once means we're done" | Could be chance. Require 2 consecutive iterations. | | "Let me fix multiple unclear points at once" | You won't know what worked. 1 theme per iteration. | | "Related micro-fixes should each be their own iteration" | Opposite trap. 1 theme = semantic unit. 2–3 related micro-fixes belong in 1 iter. Over-splitting explodes iteration count. | | "Metrics look good, ignore qualitative feedback" | Time reduction alone is a sign of over-pruning. Qualitative is primary. | | "Rewriting is faster" | Correct after 3+ iterations of no progress. Before that, it's avoidance. | | "Reuse the same subagent" | It learned from the previous run. Dispatch fresh every iteration. |
feature — feature development with harness engineering. Same generator-evaluator separation principle appliesinterrupt — background agent dispatch. Reference for how to structure subagent launch promptsdevelopment
UI design quality standards and principles for frontend implementation and code review. Use when (1) implementing UI from design specs or mockups, (2) reviewing frontend/UI code, (3) creating new UI components, (4) building user interfaces for web or mobile apps. Complements frontend-design skills with quality enforcement.
development
Pre-submission review for iOS App Store. Scans the codebase for common rejection reasons and generates a pass/fail report with fixes. Use this skill when the user mentions App Store submission, review, release, 審査, 提出, or phrases like "ready to submit", "before submitting to Apple", "submission review", "rejection check". Also trigger when the user is preparing a TestFlight build for external review or discussing App Store rejection issues. Supports Swift/SwiftUI, UIKit, and React Native projects with Apple IAP and RevenueCat.
data-ai
Delegate an interrupt task to a background agent in an isolated worktree. The main session continues uninterrupted. The agent implements, commits, pushes, and creates a PR. Use when a quick fix or small task needs to happen without losing current context.
tools
Atlassian CLI (acli) for Jira operations. Use when (1) creating Jira tickets/issues, (2) updating or editing existing tickets, (3) searching Jira issues, (4) viewing ticket details, (5) managing work items programmatically. Provides correct syntax and ADF format for descriptions.