Writing Khuym Skills

If .khuym/onboarding.json is missing or stale for the current repo, stop and invoke khuym:using-khuym before continuing.

Overview

Skills are code. They have bugs. Test them before deploying.

This is the TDD-for-skills methodology adapted from Superpowers (N=28,000 scale testing confirms persuasion-optimized skills produce 3-4× better agent compliance than plain instructions).

THE IRON LAW: NO SKILL WITHOUT A FAILING TEST FIRST. Write skill before testing? Delete it. Start over. No exceptions — not for "simple additions," not for "just a section," not for "reference only."

When to Use

Use this skill when you are about to:

Create any new skill for the khuym ecosystem
Edit an existing skill (even a small section)
Deploy a skill and want confidence it works under pressure

When NOT to use: AGENTS.md files, project-specific conventions, one-off prompt instructions.

The Core Cycle: RED → GREEN → REFACTOR

| TDD Concept | Skill Equivalent | |---|---| | Test case | Pressure scenario with subagent | | Production code | SKILL.md | | Test fails (RED) | Agent violates rule without skill | | Test passes (GREEN) | Agent complies with skill present | | Refactor | Close loopholes, maintain compliance |

PHASE 1 — RED: Write the Failing Test

HARD-GATE: Do not write any skill content until you complete this phase.

Teams that skip baseline testing consistently deploy skills with predictable, preventable failures.

Steps:

Define the skill's purpose: what behavior must it enforce? What are failure modes without it?
Create 3–5 pressure scenarios that stress-test critical constraints (see references/pressure-test-template.md)
Run scenarios WITHOUT the skill — give agents the realistic task under pressure
Document exact rationalizations verbatim: "Agent was wrong" is useless. "Agent said 'I already manually tested it, so the spirit of TDD is satisfied'" is target material
Identify patterns: which excuses repeat across scenarios?

What to record:

Scenario: [name]
Combined pressures: [list]
Exact violation: [what agent chose]
Exact rationalization (verbatim): "[quote]"

PHASE 2 — GREEN: Write the Minimal Skill

Write SKILL.md addressing the specific rationalizations documented in RED only. Do not add content for hypothetical cases you didn't observe — hypothetical content bloats the skill and gets skipped.

SKILL.md checklist:

[ ] YAML frontmatter starts on line 1 (---)
[ ] name: bare hyphen-case, matches the directory name exactly
[ ] description: starts with "Use when..." — triggering conditions ONLY, no workflow summary
[ ] Description is third-person, ≤1024 chars
[ ] Body stays lean; prefer < 400 lines and move overflow into references/ when practical
[ ] Uses persuasion principles (see table below)
[ ] HARD-GATE markers on critical stops
[ ] references/ files never nested more than one level deep

For Khuym plugin skills specifically, keep frontmatter name bare. The plugin wrapper adds the khuym: prefix when the skill is surfaced to agents.

Description trap (most common mistake): Workflow summary in description → Claude follows description instead of reading skill body. Every time.

# ❌ BAD — workflow summary
description: Use when creating skills — run baseline test, write minimal skill, run tests

# ✅ GOOD — triggering conditions only
description: Use when creating a new khuym skill or editing an existing one

Dependency metadata style: When a Khuym skill has runtime dependencies, write metadata.dependencies as a mapping keyed by dependency id. Do not use YAML arrays of objects such as - id: beads-cli; generic skill evaluators may reject that frontmatter shape.

metadata:
  dependencies:
    beads-cli:
      kind: command
      command: br
      missing_effect: unavailable
      reason: The skill reads and updates beads.

Apply persuasion principles:

| Principle | Implementation | Use For | |---|---|---| | Authority | "YOU MUST", "Never", "No exceptions" | Discipline-enforcing rules | | Commitment | Ordered checklists, announce skill usage | Multi-step processes | | Scarcity | "Before proceeding", "IMMEDIATELY after X" | Verification requirements | | Social Proof | "Teams report...", "X without Y = failure. Every time." | Common failure patterns | | Unity | "our skills", collaborative framing | Techniques, guidance |

After writing: re-run the same pressure scenarios WITH the skill. Agent must now comply. If agent still fails → skill is unclear or incomplete. Revise and re-test. Do not proceed.

PHASE 3 — REFACTOR: Close Loopholes

When an agent violates a rule despite having the skill, that is a test regression — the skill has a bug. Fix it:

Capture the new rationalization verbatim
Add explicit negation in the rule
Add entry to rationalization table in the skill
Add entry to red flags list
Re-run all scenarios — verify all still pass

Continue until no new rationalizations emerge from pressure testing.

Meta-testing technique: After an agent chooses wrong, ask:

"You read the skill and chose Option C anyway. How could the skill have been written differently to make Option A the only acceptable answer?"

Three diagnoses:

"The skill WAS clear, I chose to ignore it" → add "Violating the letter IS violating the spirit"
"The skill should have said X" → add their exact suggestion verbatim
"I didn't see section Y" → make key point more prominent, move it earlier

PHASE 4 — VALIDATE & DOCUMENT

Run validation:

python3 "${CODEX_HOME:-$HOME/.codex}/skills/.system/skill-creator/scripts/quick_validate.py" plugins/khuym/skills/<skill-name>
bash scripts/check-markdown-links.sh plugins/khuym/skills/<skill-name>/SKILL.md
bash scripts/sync-skills.sh --dry-run

If the edited skill owns a repo-local test script, run that too.

Create CREATION-LOG.md documenting the full TDD process (see references/creation-log-template.md):

Source material and extraction decisions
Pressure scenarios run and results
Rationalizations found and fixes applied
Iterations required before bulletproof

Signs the skill IS bulletproof:

Agent chooses correct option under maximum pressure
Agent cites specific skill sections as justification
Agent acknowledges temptation but follows rule
Meta-test reveals: "skill was clear, I should follow it"

Signs the skill is NOT bulletproof:

Agent finds rationalizations not addressed in the skill
Agent argues the skill itself is wrong
Agent creates "hybrid approaches" that satisfy letter but not spirit

Rationalization Table (Common Violations)

| Excuse | Reality | |---|---| | "I know this technique, testing is unnecessary" | You're testing the SKILL, not your knowledge. Agents differ from you. | | "It's so simple it can't have bugs" | Every untested skill has issues. Test takes 30 minutes. | | "Academic questions passed — that's sufficient" | Reading a skill ≠ using a skill under pressure. Test application scenarios. | | "My description summarizes the workflow so agents know what to do" | Workflow-summary descriptions cause agents to skip the skill body. Remove it. | | "This edit is minor — testing isn't needed" | The Iron Law applies to edits. No exceptions. | | "I'll test it after a few real uses" | Problems = agents misuse in production. Test BEFORE deploying. | | "The baseline is obvious, I know what failures to expect" | You know YOUR failures. Agent failures differ. Run the baseline. |

Red Flags — STOP and Run Baseline Tests

Writing skill content before creating any pressure scenarios
"I already know what agents will do"
"It's just a small addition"
"Academic questions passed, that's sufficient testing"
Description contains workflow steps or process summary
Skill addresses hypothetical scenarios not observed in baseline
Deploying without running scenarios WITH skill (no green verification)
"The skill was good last month, edits don't need testing"

All of these mean: Stop. Run baseline tests first.

References

Load when needed:

references/creation-log-template.md — CREATION-LOG.md template for documenting the TDD process
references/pressure-test-template.md — Pressure scenario templates and the 7 pressure types

Background: The TDD-for-skills methodology originates from the Superpowers framework (obra/superpowers). Persuasion research: Meincke et al. (2025), N=28,000 LLM conversations, University of Pennsylvania. Compliance methodology validated by ComplexBench, PromptAgent, and RNR studies (see research/15-tdd-skills-methodology.md for full citations).

Writing Khuym Skills

If .khuym/onboarding.json is missing or stale for the current repo, stop and invoke khuym:using-khuym before continuing.

Overview

Skills are code. They have bugs. Test them before deploying.

This is the TDD-for-skills methodology adapted from Superpowers (N=28,000 scale testing confirms persuasion-optimized skills produce 3-4× better agent compliance than plain instructions).

When to Use

Use this skill when you are about to:

Create any new skill for the khuym ecosystem
Edit an existing skill (even a small section)
Deploy a skill and want confidence it works under pressure

When NOT to use: AGENTS.md files, project-specific conventions, one-off prompt instructions.

The Core Cycle: RED → GREEN → REFACTOR

PHASE 1 — RED: Write the Failing Test

HARD-GATE: Do not write any skill content until you complete this phase.

Teams that skip baseline testing consistently deploy skills with predictable, preventable failures.

Steps:

Define the skill's purpose: what behavior must it enforce? What are failure modes without it?
Create 3–5 pressure scenarios that stress-test critical constraints (see references/pressure-test-template.md)
Run scenarios WITHOUT the skill — give agents the realistic task under pressure
Document exact rationalizations verbatim: "Agent was wrong" is useless. "Agent said 'I already manually tested it, so the spirit of TDD is satisfied'" is target material
Identify patterns: which excuses repeat across scenarios?

What to record:

Scenario: [name]
Combined pressures: [list]
Exact violation: [what agent chose]
Exact rationalization (verbatim): "[quote]"

PHASE 2 — GREEN: Write the Minimal Skill

SKILL.md checklist:

[ ] YAML frontmatter starts on line 1 (---)
[ ] name: bare hyphen-case, matches the directory name exactly
[ ] description: starts with "Use when..." — triggering conditions ONLY, no workflow summary
[ ] Description is third-person, ≤1024 chars
[ ] Body stays lean; prefer < 400 lines and move overflow into references/ when practical
[ ] Uses persuasion principles (see table below)
[ ] HARD-GATE markers on critical stops
[ ] references/ files never nested more than one level deep

For Khuym plugin skills specifically, keep frontmatter name bare. The plugin wrapper adds the khuym: prefix when the skill is surfaced to agents.

Description trap (most common mistake): Workflow summary in description → Claude follows description instead of reading skill body. Every time.

# ❌ BAD — workflow summary
description: Use when creating skills — run baseline test, write minimal skill, run tests

# ✅ GOOD — triggering conditions only
description: Use when creating a new khuym skill or editing an existing one

metadata:
  dependencies:
    beads-cli:
      kind: command
      command: br
      missing_effect: unavailable
      reason: The skill reads and updates beads.

Apply persuasion principles:

After writing: re-run the same pressure scenarios WITH the skill. Agent must now comply. If agent still fails → skill is unclear or incomplete. Revise and re-test. Do not proceed.

PHASE 3 — REFACTOR: Close Loopholes

When an agent violates a rule despite having the skill, that is a test regression — the skill has a bug. Fix it:

Capture the new rationalization verbatim
Add explicit negation in the rule
Add entry to rationalization table in the skill
Add entry to red flags list
Re-run all scenarios — verify all still pass

Continue until no new rationalizations emerge from pressure testing.

Meta-testing technique: After an agent chooses wrong, ask:

"You read the skill and chose Option C anyway. How could the skill have been written differently to make Option A the only acceptable answer?"

Three diagnoses:

"The skill WAS clear, I chose to ignore it" → add "Violating the letter IS violating the spirit"
"The skill should have said X" → add their exact suggestion verbatim
"I didn't see section Y" → make key point more prominent, move it earlier

PHASE 4 — VALIDATE & DOCUMENT

Run validation:

python3 "${CODEX_HOME:-$HOME/.codex}/skills/.system/skill-creator/scripts/quick_validate.py" plugins/khuym/skills/<skill-name>
bash scripts/check-markdown-links.sh plugins/khuym/skills/<skill-name>/SKILL.md
bash scripts/sync-skills.sh --dry-run

If the edited skill owns a repo-local test script, run that too.

Create CREATION-LOG.md documenting the full TDD process (see references/creation-log-template.md):

Source material and extraction decisions
Pressure scenarios run and results
Rationalizations found and fixes applied
Iterations required before bulletproof

Signs the skill IS bulletproof:

Agent chooses correct option under maximum pressure
Agent cites specific skill sections as justification
Agent acknowledges temptation but follows rule
Meta-test reveals: "skill was clear, I should follow it"

Signs the skill is NOT bulletproof:

Agent finds rationalizations not addressed in the skill
Agent argues the skill itself is wrong
Agent creates "hybrid approaches" that satisfy letter but not spirit

Rationalization Table (Common Violations)

Red Flags — STOP and Run Baseline Tests

Writing skill content before creating any pressure scenarios
"I already know what agents will do"
"It's just a small addition"
"Academic questions passed, that's sufficient testing"
Description contains workflow steps or process summary
Skill addresses hypothetical scenarios not observed in baseline
Deploying without running scenarios WITH skill (no green verification)
"The skill was good last month, edits don't need testing"

All of these mean: Stop. Run baseline tests first.

References

Load when needed:

references/creation-log-template.md — CREATION-LOG.md template for documenting the TDD process
references/pressure-test-template.md — Pressure scenario templates and the 7 pressure types

Adoption

hoangnb24/writing-khuym-skills

$ install --global

Security Scan Results

SKILL.md

Writing Khuym Skills

Overview

When to Use

The Core Cycle: RED → GREEN → REFACTOR

PHASE 1 — RED: Write the Failing Test

PHASE 2 — GREEN: Write the Minimal Skill

PHASE 3 — REFACTOR: Close Loopholes

PHASE 4 — VALIDATE & DOCUMENT

Rationalization Table (Common Violations)

Red Flags — STOP and Run Baseline Tests

References

Related Skills

hoangnb24/smart-commits

hoangnb24/using-khuym

hoangnb24/goal-griller

hoangnb24/concept-game-builder

hoangnb24/writing-khuym-skills

$ install --global

Security Scan Results

SKILL.md

Writing Khuym Skills

Overview

When to Use

The Core Cycle: RED → GREEN → REFACTOR

PHASE 1 — RED: Write the Failing Test

PHASE 2 — GREEN: Write the Minimal Skill

PHASE 3 — REFACTOR: Close Loopholes

PHASE 4 — VALIDATE & DOCUMENT

Rationalization Table (Common Violations)

Red Flags — STOP and Run Baseline Tests

References

Related Skills

hoangnb24/smart-commits

hoangnb24/using-khuym

hoangnb24/goal-griller

hoangnb24/concept-game-builder