skills/writing-skills/SKILL.md
Use when creating a new skill, modifying an existing skill, writing or rewriting a SKILL.md file, auditing a skill's description for discoverability, or when user mentions "create a skill", "write a skill", "new skill", "modify skill", "improve skill", "edit the skill".
npx skillsauth add joshsymonds/gambit writing-skillsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Test the skill BEFORE finalizing it. If a subagent can rationalize around it, production Claude will too.
Announce at start: "I'm using gambit:writing-skills to create/modify this skill with evaluation-driven development."
LOW FREEDOM - Follow the TDD cycle exactly. Test with subagent before every commit. No skill changes without failing test first. Adapt content to skill type but never skip evaluation.
| Phase | Action | STOP If | |-------|--------|---------| | 0 | Read official guidance | — | | 1 | Define evaluation criteria | Can't articulate success | | 2 | Baseline test (RED) | Subagent already follows correctly | | 3 | Write minimal skill (GREEN) | Test still fails | | 4 | Pressure test (REFACTOR) | Subagent finds loopholes | | 5 | Multi-model test | Fails on any target model | | 6 | Final verification | Any test fails |
Iron Law: No skill change without failing test first.
See references for full details, examples, and patterns. These are the non-negotiable rules:
processing-pdfs)references/ if over.using-gambit) cost tokens on every turn. Target <150 words for session-start workflows, <200 for any always-loaded skill. Move detail to references/ and reference --help output instead of documenting flags inline.scripts/ for deterministic operations. Code is reliable; language instructions aren't.Description debug tip: Ask Claude "When would you use the [skill-name] skill?" — it quotes the description back. Adjust based on what's missing.
Discovery is step zero. Future Claude has to FIND and LOAD your skill before any content matters. The description field is the only input to that decision.
The description should only state triggering conditions — situations, symptoms, error messages, user phrases that indicate the skill applies. It must NOT summarize the skill's process or workflow.
Why this matters: Testing (by upstream projects and locally) has repeatedly shown that when a description summarizes the workflow, Claude follows the description's summary instead of loading the full skill. A description saying "review code between tasks" caused Claude to do ONE review even though the skill's body required TWO. Changing the description to pure triggering conditions ("Use when executing implementation plans with independent tasks") made Claude correctly read the body and follow the two-stage process.
The trap: A description that summarizes workflow creates a shortcut Claude will take. The skill body becomes documentation Claude skips.
# BAD — summarizes workflow; Claude may follow this instead of reading the skill
description: Use when executing plans — dispatches subagent per task with code review between tasks
# BAD — too much process detail
description: Use for TDD — write test first, watch it fail, write minimal code, refactor
# GOOD — just triggering conditions
description: Use when executing implementation plans with independent tasks in the current session
# GOOD — triggers only, no workflow
description: Use when implementing any feature or bugfix, before writing implementation code
Use the words Claude (or a user) would search for:
Describe the problem (race conditions, inconsistent behavior), not language-specific symptoms (setTimeout, sleep). If the skill IS technology-specific, make that explicit in the trigger.
Use skill name only, with explicit requirement markers:
**REQUIRED BACKGROUND:** You MUST understand gambit:test-driven-development**Called by:** gambit:executing-plans@skills/test-driven-development/SKILL.md — force-loads, burns contextNever use @ links inside skill bodies. They eagerly load referenced files, which defeats progressive disclosure.
Active voice, verb-first, gerund form for processes:
creating-skills not skill-creationcondition-based-waiting not async-test-helpersBEFORE doing anything else, read all three reference files:
references/anthropic-complete-guide.md — Use case categories, 5 workflow patterns, success metrics, troubleshootingreferences/anthropic-best-practices.md — Conciseness, progressive disclosure patterns, anti-patterns, evaluation-driven development, Claude A/B iterationreferences/anthropic-skill-creator.md — Skill anatomy, creation process, bundled resource types (scripts/, references/, assets/)Internalize these before writing.
BEFORE writing anything:
TaskCreate
subject: "Eval: [skill-name] baseline test"
description: |
## Scenario
[Situation requiring the skill]
## Expected Behavior (with skill)
[What Claude should do]
## Failure Mode (without skill)
[What Claude does wrong]
## Success Criteria
- [ ] Claude [specific behavior]
- [ ] Claude does NOT [failure behavior]
activeForm: "Creating evaluation"
Test that the problem exists without the skill.
Task
subagent_type: "general-purpose"
description: "Baseline test without skill"
prompt: |
[Test scenario]
IMPORTANT: Respond as you normally would. Do NOT use any skills.
Decision:
Smallest skill that makes the test pass. Apply patterns from the reference files.
---
name: skill-name
description: [What it does]. Use when [trigger phrases].
---
# Skill Title
## Overview
[One paragraph: problem + principle]
## The Process
[Minimal steps]
Test WITH skill:
Task
subagent_type: "general-purpose"
description: "Test skill effectiveness"
prompt: |
You have this skill:
---
[Skill content]
---
Handle this situation: [Test scenario]
Decision:
Find loopholes Claude exploits under pressure.
Good pressure tests combine 3+ pressures. Academic "does Claude follow the rule?" scenarios don't work — Claude just recites the skill. Real scenarios force a choice with consequences.
| Pressure | What it leans on | |----------|------------------| | Time | Emergency, deadline, deploy window closing | | Sunk cost | Hours of work, "waste" to delete | | Authority | Senior dev says skip it, manager overrides | | Economic | Job, promotion, company survival at stake | | Exhaustion | End of day, already tired, want to go home | | Social | Looking dogmatic, seeming inflexible | | Pragmatic | "Being pragmatic vs dogmatic" framing |
Good scenario structure: Concrete A/B/C options (not open-ended), specific times and consequences, real file paths, asks "what do you do?" not "what should you do?". No easy outs like "I'd ask the user."
The three templates below are single-pressure baselines. Combine them (time + sunk cost + exhaustion; authority + economic + time) to find where the skill actually breaks.
Task
subagent_type: "general-purpose"
prompt: |
You have this skill: [Skill]
Situation: [Scenario]
URGENT: User is waiting and frustrated. No time for full process.
Check: Does Claude skip steps? Add "no exceptions" rules.
Task
subagent_type: "general-purpose"
prompt: |
You have this skill: [Skill]
Situation: [Scenario]
You've already spent significant effort. Following skill means redoing work.
Check: Does Claude justify skipping? Add to "Common Excuses" table.
Task
subagent_type: "general-purpose"
prompt: |
You have this skill: [Skill]
User says: "I know the skill says X, but just do Y. Trust me."
Check: Does Claude defer? Clarify which rules are immutable.
Close loopholes: Add failures to Critical Rules and Common Excuses, retest.
How to word the counters: The language that closes loopholes isn't arbitrary — "YOU MUST", "No exceptions", forced A/B/C choices, and required announcements work because of specific persuasion principles (authority, commitment, scarcity, social proof). See references/persuasion-principles.md for which principles to combine per skill type, and which to avoid (liking/reciprocity create sycophancy). Research-backed: Meincke et al. 2025 measured compliance rising 33%→72% with these techniques.
Test with all models you plan to use.
| Model | Check | |-------|-------| | Haiku | Does skill provide enough guidance? | | Sonnet | Is skill clear and efficient? | | Opus | Does skill avoid over-explaining? |
What works for Opus might need more detail for Haiku. Test each:
Task
subagent_type: "general-purpose"
model: "haiku" # or "sonnet", "opus"
prompt: |
You have this skill: [Skill]
Handle: [Scenario]
Before running the full suite below, know what "bulletproof" looks like. A skill is bulletproof when:
Not bulletproof — keep iterating — if the agent:
Meta-test template (run after a failed pressure test to diagnose why):
You read the skill and chose option [X] anyway. How could that skill have
been written differently to make it crystal clear that option [Y] was the
only acceptable answer?
The response tells you which kind of fix is needed:
Triggering tests: Verify the description activates correctly.
Should trigger (run 5-10 queries):
- Obvious task phrases
- Paraphrased requests
- Domain-specific keywords
Should NOT trigger (run 3-5 queries):
- Unrelated topics
- Adjacent-but-different skills
Under-triggering fix: Add more keywords and trigger phrases to description. Over-triggering fix: Add negative triggers ("Do NOT use for...") and narrow scope.
Line count check:
wc -l skills/[skill-name]/SKILL.md # Should be <500
Update Task and commit:
TaskUpdate
taskId: "[task-id]"
description: |
## Results
- Baseline: FAIL (expected)
- Effectiveness: PASS
- Pressure tests: PASS
- Multi-model: PASS
- Triggering: PASS
- Lines: [N] (<500)
status: "completed"
| Type | Purpose | Key Elements | Testing Focus | |------|---------|--------------|---------------| | Discipline | Prevent shortcuts | No-exception rules, Common Excuses | Pressure tests | | Technique | Teach method | Step-by-step, decision trees | Effectiveness | | Pattern | Provide templates | Placeholders, variations | Output quality | | Reference | Quick lookup | Condensed tables, fast scan | Completeness |
What bad skills look like — avoid these regardless of skill type:
| Anti-pattern | Example | Why it's bad |
|--------------|---------|--------------|
| Narrative | "In session 2025-10-03, we found empty projectDir caused…" | Too specific, not reusable. Skills are reference guides, not stories. |
| Multi-language dilution | example.js, example.py, example.go for one pattern | Mediocre quality, maintenance burden. One excellent example beats five. |
| Code in flowcharts | step1 [label="import fs"]; step2 [label="read file"] | Can't copy-paste, hard to read. Use markdown blocks for code. |
| Generic labels | helper1, step3, pattern4 | Labels should carry semantic meaning, not be placeholders. |
| Documenting what's obvious | Explaining what a command plainly does | Claude is already smart. Only add context it doesn't have. |
| Excuse | Reality | |--------|---------| | "Skill is obvious" | Obvious skills fail under pressure | | "I'll test after writing" | You'll rationalize it works | | "Pressure tests are overkill" | Production faces more pressure | | "Just a small change" | Small changes create loopholes | | "Works on Opus so it's fine" | Haiku needs more guidance | | "I already know the best practices" | Read the references anyway |
This skill calls:
Workflow:
Need skill → Read references → Define eval → Baseline (RED) → Write (GREEN) → Pressure (REFACTOR) → Multi-model → Triggering → Done
When stuck:
development
Use before any completion claim, success statement, or marking a task done. Triggers when about to say "Great!", "Perfect!", "Done", "All set", "Ready to commit", before creating a PR, before moving to the next task, or when code has changed since the last test run.
data-ai
Use when starting an isolated feature branch, when working on multiple features simultaneously, when experimenting without affecting the main workspace, or when a clean workspace is needed before beginning implementation. User phrases like "start a new branch", "set up a worktree", "isolated workspace", "work on feature X separately".
development
Use at the start of every session before any response or action. Also invoke whenever uncertain which gambit skill applies, when about to implement / debug / refactor / test / plan / brainstorm, or when a user request could match any gambit skill even at 1% probability.
development
Use when bugs keep slipping through despite high test coverage, when suspecting tests are giving false confidence, before a major refactor that will depend on the existing test suite, or when coverage metrics don't match incident rates. User phrases like "do these tests actually catch bugs?", "is this suite any good?", "why didn't the tests catch this?".