whip/skills/whip-simulate/SKILL.md
Run multi-agent simulations to measure consistency of non-deterministic behavior. Use when the user wants to A/B test, validate behavioral equivalence, or stress-test outputs at scale.
npx skillsauth add bang9/ai-tools whip-simulateInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Run multi-agent simulations from a user-provided scenario. Concretize the scenario into test cases, spawn agents, and analyze output patterns for consistency.
Extract from $ARGUMENTS:
--runs N: number of simulation runs (default: 5)--agent: use inline mode (see Execution Mode below)$ARGUMENTS determines which dispatch mode this skill uses. The two modes are mutually exclusive:
| Mode | Activates when | Dispatch mechanism |
|------|---------------|-------------------|
| Tracked (default) | --agent is absent from $ARGUMENTS | /whip-start Team Flow — IRC, workspace, polling |
| Inline | --agent is present in $ARGUMENTS | Agent tool directly — no whip, no IRC, no lifecycle |
Strict rules:
--agent in arguments → tracked mode. No exceptions, no inference.--agent in arguments → inline mode. /whip-start, IRC, and lifecycle steps are all skipped.--backend specification (e.g., user says "use codex") → implies tracked mode. Backend selection is a whip concept and is incompatible with --agent.--agent from task simplicity, speed preference, or any other heuristic. The flag must be explicitly present in the user's input.If running inside an active whip workspace, use whip workspace view <workspace-name> to get the worktree path for reading code artifacts referenced in the scenario. In tracked mode, simulation tasks go in the global workspace (ephemeral — do not pollute the active workspace).
Read any files, git refs, or codebase artifacts referenced in the scenario, then transform it into concrete test cases:
| Field | Description |
|-------|-------------|
| Name | Short identifier (e.g., deprecated-move-1) |
| Setup | Context the agent receives (file contents, code, instructions) |
| Action | What the agent executes |
| Output contract | Structured format the agent must produce |
The output contract is critical — all agents must produce the same structure so results are mechanically comparable:
### Result
- pattern: [short label for the approach taken]
- output:
[code block, JSON, or other structured output]
- decisions: [key judgment calls made]
For A/B comparisons, choose a strategy:
| Strategy | When to use | Agent count | |----------|-------------|-------------| | Sequential | Outputs are structured (code, configs) — one agent runs A then B | N | | Isolated | Outputs involve judgment or prose — separate agents per version | 2N |
Present the test plan including:
Wait for user approval before executing.
Hand off dispatch to /whip-start. Prepare one task spec per simulation run and let /whip-start handle IRC, creation, assignment, and monitoring.
Each simulation run becomes one task:
sim-{test-case}-{run}globaleasyAfter all tasks complete, collect outputs and proceed to analysis.
--agent)Spawn one Agent tool call per run, named sim-{test-case}-{run}.
Each prompt must be self-contained — embed all context inline, not file paths:
Batching:
run_in_background: trueClassify outputs into patterns:
## Simulation Report
### Consistency: X/N (Y%)
### Output Patterns
| Pattern | Count | Runs | Description |
|---------|-------|------|-------------|
| A | 8 | #1-6,#8,#10 | [dominant behavior] |
| B | 2 | #7,#9 | [variant behavior] |
### Divergence Analysis
For each non-dominant pattern:
- Runs: [list]
- Root cause: [why]
- Severity: cosmetic | functional | breaking
- Diff from dominant: [key differences]
### Summary
- Total: N runs across M test cases
- Dominant pattern: A (X%)
- Key findings: ...
- Recommendation: [if applicable]
Save the full report with raw agent outputs to /tmp/simulate-{slug}-{timestamp}.md and tell the user the path.
global workspace and delegate dispatch to /whip-startwhip task cleandevelopment
Spawn whip agent sessions to handle tasks. Dispatch a single agent or assemble a small team with explicit backend, scope, and ownership.
development
Triage unresolved PR review threads via webform and dispatch fixes through whip-start. Use after receiving review feedback on your own PR.
content-media
Analyze work, design a stacked task plan, and get user approval before execution. Use when starting a multi-task project that needs planning.
development
Iterative review-fix loop — dispatch a fresh codex reviewer each round, apply fixes, repeat until LGTM. Use when you want rigorous code quality validation before merge.