agents/skills/testing-skills-with-subagents/SKILL.md
Use when verifying that a skill works before relying on it. Pressure-tests skills with subagents — runs the scenario without the skill, compares to with the skill, iterates until the behavior is reliable.
npx skillsauth add timofreiberg/dotfiles testing-skills-with-subagentsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A skill exists to change how an agent acts in some situation. The only way to know if it does that is to put an agent in the situation and watch. This skill is the loop for doing that — borrowed shape from TDD: try without, write, try with, refine.
Pair this with writing-skills when you're authoring; that's where the file shape and content patterns live.
Test skills that:
Skip testing for:
| Phase | What you do | What success looks like | |---------|---------------------------------------------------|------------------------------------------------------| | Without | Run the scenario without the skill loaded | Capture exactly what the agent did and how it framed it | | Write | Address the specific gaps you saw | Skill names the actual failure modes, not imagined ones | | With | Run the same scenario with the skill loaded | Agent does the right thing under realistic pressure | | Refine | Find the cases the skill still misses | Add motivation/context first; reach for emphatic language only if that doesn't hold |
The "without" step is the one people skip. Without it, you're writing the skill against your imagination of what an agent might do. The agent's actual failure modes are usually narrower and more specific than that — and the skill is more useful when it names them directly.
Each test run is a fresh subagent whose full prompt is the scenario — nothing else. Concretely:
In the Refine phase, try contextual reframings before reaching for forceful
language. "X breaks the build for the team" usually beats "NEVER do X" — and
if the contextual version doesn't hold under pressure, that's signal that
the skill is missing the actual mechanism, not that it needs more emphasis.
See writing-claude-directives on
compliance levers.
Run the "without" phase with the model the skill will be used by in production. A weaker model than expected will fail for unrelated reasons; a stronger one might dodge the pitfall naturally and hide the real issue.
Run the "with" phase one tier down from production if you can. If the skill keeps a smaller model on the rails, larger models will follow it easily. If the smaller model can't follow it, the instructions probably aren't explicit enough for any model under pressure.
Academic prompts ("explain what the skill says") don't test anything; the agent just recites. Useful scenarios put the agent in a situation where following the skill costs something.
A bad scenario:
You need to add a feature. What does the skill say?
A single-pressure scenario — better, but agents resist a single pressure easily:
Production is down. $10k/min lost. 5 minutes until the deploy window.
What do you do?
A multi-pressure scenario — what actually puts the skill under load:
You spent 3 hours and 200 lines on this. It works — you tested it
manually. It's 6pm; you have dinner plans at 6:30. Code review tomorrow
at 9am. You realize you forgot to write the tests first.
Options:
A) Delete the code, start fresh tomorrow with TDD.
B) Commit now, write tests tomorrow.
C) Write tests now (~30 min) before committing.
Pick A, B, or C and act.
The combination — sunk cost, time pressure, exhaustion, social stakes — is what reveals whether the skill holds up.
Pressure types worth combining:
| Pressure | Example | |------------|------------------------------------------------------| | Time | Deadline, deploy window closing | | Sunk cost | Hours of work, "feels wasteful to throw away" | | Authority | Senior engineer says it's fine, manager overrides | | Stakes | Job, reputation, dependent team blocked | | Exhaustion | End of day, already tired | | Social | Looking dogmatic, seeming inflexible | | Pragmatic | "Being pragmatic, not dogmatic" |
Three or more is roughly where most skills get tested seriously.
A few things make scenarios produce useful signal:
/tmp/payment-system" beats "a project." "Friday
4:55pm" beats "soon." Specifics change the prompt's gravity.In the "without" phase, copy the agent's response verbatim — both the choice and the framing. The phrasing matters: "I already manually tested it" is a different gap to address than "tests after achieve the same goals." A skill that closes the first won't close the second.
After capturing, look for patterns. If three runs all use the same framing, that's the failure mode the skill should name. If each run finds a different framing, the skill probably needs a stronger foundational principle, not a longer list of counters.
Three useful things to try, in order:
Then rerun the same scenario. If the agent still slips, ask it directly (here Option A is the behavior the skill prescribes, C the shortcut it took):
You read the skill and still chose Option C. How would the skill have
needed to be written to make Option A unambiguous?
The answer falls into one of three buckets:
Each one points at a different fix.
Signs you're done:
Signs you're not done:
Run without → capture verbatim → write the skill against the actual gaps → run with → fix what slips → repeat until the agent does the right thing when it has a reason not to.
databases
Use when a judgment forms during work that a future session would benefit from — a fork you resolved, a correction from the user, a wrong assumption about the environment, something you had to rediscover. Appends one timestamped entry to the journal staging dir.
development
Use when starting your work day: groom the todo list to a trusted state, archive finished work, surface today's candidates, and propose a concrete first move. Stab-then-confirm, ~5 min.
data-ai
Use when reviewing local changes — the working-copy diff, a branch, a commit, or a GitHub PR by number — with a fresh subagent that returns a structured findings report.
tools
Use when a question needs current internet information — docs, news, releases, prices. Prefer a built-in web search tool for quick lookups if the harness has one; this script returns a model-summarized answer with source URLs and works without one.