whip/skills-codex/whip-simulate/SKILL.md
Run multi-agent simulations to measure output consistency. Use when you want to A/B test, validate behavioral equivalence, or stress-test non-deterministic behavior at scale.
npx skillsauth add bang9/ai-tools whip-simulateInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use $whip-simulate <scenario> to run multi-agent simulations from a user-provided scenario. Concretize the scenario into test cases, execute each run in whip mode or agent mode, and analyze output patterns for consistency.
You are a simulation lead — you turn vague "run it a few times" ideas into controlled experiments with disciplined inputs and comparable outputs. You care about explicit output contracts, honest analysis, and clean evidence. If the setup is fuzzy, tighten it before spending runs.
Extract from $ARGUMENTS:
--runs N: number of simulation runs (default: 5)--agent: use inline mode (see Execution Mode below)$ARGUMENTS determines which dispatch mode this skill uses. The two modes are mutually exclusive:
| Mode | Activates when | Dispatch mechanism |
|------|---------------|-------------------|
| Tracked (default) | --agent is absent from $ARGUMENTS | $whip-start Team Flow — IRC, workspace, polling |
| Inline | --agent is present in $ARGUMENTS | Agent tool directly — no whip, no IRC, no lifecycle |
Strict rules:
--agent in arguments → tracked mode. No exceptions, no inference.--agent in arguments → inline mode. $whip-start, IRC, and lifecycle steps are all skipped.--backend specification (e.g., user says "use codex") → implies tracked mode. Backend selection is a whip concept and is incompatible with --agent.--agent from task simplicity, speed preference, or any other heuristic. The flag must be explicitly present in the user's input.If running inside an active whip workspace, use whip workspace view <workspace-name> to get the worktree path for reading code artifacts referenced in the scenario. In tracked mode, simulation tasks go in the global workspace (ephemeral — do not pollute the active workspace).
Read any files, git refs, or codebase artifacts referenced in the scenario, then transform the request into concrete test cases:
| Field | Description |
|-------|-------------|
| Name | Short identifier (for example deprecated-move-1) |
| Setup | Context the simulation run receives |
| Action | What the simulation run executes |
| Output contract | Structured format the simulation run must produce |
The output contract is critical — every run must produce the same section layout and the same payload type so results are mechanically comparable.
Use an explicit contract such as:
### Result
- pattern: [short label for the approach taken]
- output_format: [json | markdown | text | code]
- output:
[the primary artifact in the declared format]
- decisions: [key judgment calls made]
For A/B comparisons, choose a strategy:
| Strategy | When to use | Run count | |----------|-------------|-----------| | Sequential | Outputs are structured (code, configs) — one run executes A then B | N | | Isolated | Outputs involve judgment or prose — separate runs per version | 2N |
Present the test plan including:
DO NOT execute anything before the user approves the test plan.
Hand off dispatch to $whip-start. Prepare one task spec per simulation run and let $whip-start handle IRC, creation, assignment, and monitoring.
Each simulation run becomes one task:
sim-{test-case}-{run}globaleasyAfter all tasks complete, collect outputs and proceed to analysis.
--agent)Spawn one spawn_agent call per run. Keep a local ledger mapping sim-{test-case}-{run} to the returned agent id.
Each prompt must be self-contained — embed all context inline, not file paths:
Dispatch:
agent_type: default, fork_context: falsewait on each batch before launching the nextunclassifiablesend_inputClassify outputs into patterns:
A, B, C, ...).unclassifiable.Produce the final report in this shape:
## Simulation Report
### Consistency: X/N (Y%)
### Output Patterns
| Pattern | Count | Runs | Description |
|---------|-------|------|-------------|
| A | 8 | #1-6,#8,#10 | [dominant behavior] |
| B | 2 | #7,#9 | [variant behavior] |
### Divergence Analysis
For each non-dominant pattern:
- Runs: [list]
- Root cause: [why]
- Severity: cosmetic | functional | breaking
- Diff from dominant: [key differences]
### Summary
- Total: N runs across M test cases
- Dominant pattern: A (X%)
- Key findings: ...
- Recommendation: [if applicable]
Save the full report with raw run outputs to /tmp/simulate-{slug}-{timestamp}.md and tell the user the path.
global workspace and delegate dispatch to $whip-start.whip task cleandevelopment
Spawn whip agent sessions to handle tasks. Dispatch a single agent or assemble a small team with explicit backend, scope, and ownership.
testing
Run multi-agent simulations to measure consistency of non-deterministic behavior. Use when the user wants to A/B test, validate behavioral equivalence, or stress-test outputs at scale.
development
Triage unresolved PR review threads via webform and dispatch fixes through whip-start. Use after receiving review feedback on your own PR.
content-media
Analyze work, design a stacked task plan, and get user approval before execution. Use when starting a multi-task project that needs planning.