skills/gm-stress-test/SKILL.md
Automated GM consistency, calibration, and line enforcement testing. Dispatches chaos player agents and GM subagents against a scenario bank, then audits rulings. Never invoke during play sessions.
npx skillsauth add krystophny/prompts gm-stress-testInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are the orchestrator for a Daggerheart GM stress test. You dispatch chaos player agents to generate character-appropriate absurd responses, then dispatch paired GM subagents to rule on them, audit the results across three passes, and write a report. Follow every step below in exact order.
Before doing anything else, confirm these files exist:
D:\Daggerheart\testing\scenarios.md — count all situation blocks (headings matching ### S-{NN}).D:\Daggerheart\testing\characters\*.md — this must return 6 character sheet files.If either is missing or the character count is wrong, stop and tell the user what is missing. Do not proceed. The situation count is dynamic — store it as N.
Read D:\Daggerheart\testing\scenarios.md. Parse every situation block. Each block looks like:
### S-01: The Rusted Gate
{2-4 sentences of situation description}
For each situation, extract and store:
id — e.g. S-01title — e.g. The Rusted Gatesituation — the full description text following the headingConfirm you parsed N situations. List their IDs and titles to the user.
Read every file matched by D:\Daggerheart\testing\characters\*.md. For each file, store:
character_name — derived from the filename (strip the .md extension)character_sheet — the full file contentsConfirm you loaded exactly 6 characters. List their names to the user.
Read C:\Users\McDave\.claude\skills\daggerheart-gm\SKILL.md and store its full content. This is the GM behavioral spine — it must be prepended to every GM subagent prompt. Do NOT include it in chaos agent prompts.
For each situation, randomly assign a Fear context that the GM agents will use when ruling. Use Python RNG (not LLM generation) for all random values.
For each situation, generate:
[Minor, Standard, Major][1, 2][2, 3]Store as (situation_id, fear_level, scene_importance, fear_budget).
Print the Fear assignments to the user:
Fear state assigned:
S-01: Fear 7, Standard scene, budget 2
S-02: Fear 3, Major scene, budget 3
...
Both Run A and Run B for the same situation receive the same Fear state. The A/B test isolates reasoning variance, not input variance.
The test runs N situations × 6 characters × 2 GM runs = 12N total rulings, dispatched in 3 parallel waves. Chaos agents handle one situation, all 6 characters. GM agents handle one situation, 3 characters (two batches per run to prevent DC compression). Total agent count: 5N (one chaos + two GM Run A + two GM Run B per situation).
Wave 1 — Chaos (N agents in parallel)
For each situation: 1 agent produces all 6 chaos responses
Parse all chaos responses after wave completes
Wave 2 — GM Rulings (4N agents in parallel)
For each situation: 2 agents produce 3 Run A rulings each (characters 1-3, 4-6)
For each situation: 2 agents produce 3 Run B rulings each (characters 1-3, 4-6)
Run A and Run B agents are separate instances
Batch 1 and Batch 2 agents for the same run are separate instances
Parse all GM rulings after wave completes
Dispatch N agents in parallel, one per situation. Each agent receives all 6 character summaries and the situation. Use the Agent tool with this prompt:
You are running 6 chaos player agents in sequence. For each character,
generate one absurd, game-breaking response using their ACTUAL
abilities, traits, experiences, or personality. You are stress-testing
a GM's judgment.
Types of chaos you can generate:
- ability_abuse: use a real ability in a way it wasn't designed for
- physics_push: attempt something at the edge of physical possibility
- backstory_claim: invent a past event that justifies a current ability
- retcon: declare something about the character that was never established
- world_claim: declare something about the world the GM hasn't established
- social_absurdity: attempt a social action that's wildly inappropriate
- grotesque: do something physically repulsive or disturbing
- scale_violation: attempt something wildly out of proportion to the situation
## Situation
{situation}
## Characters
**1. {char1_name}** — {one-line summary: ancestry, class, level, traits, experiences, key abilities, weapons, inventory highlights, personality}
**2. {char2_name}** — {same format}
... (all 6 characters)
## Output Format (repeat for each character)
CHARACTER: {name}
RESPONSE: "{what they say or do}"
TAG: {one tag from the list above}
Character summaries must be concise (3-4 lines each) but include all mechanically relevant details: traits with modifiers, experience names with modifiers, ability names with brief descriptions, weapon names, and inventory items that could be exploited. Prepare these summaries once in Step 2 and reuse them across all chaos agents.
After each chaos agent returns, extract 6 entries. For each:
chaos_response — the text inside the quotes after RESPONSE:chaos_tag — the tag after TAG:If a character's output cannot be parsed from a consolidated agent, log the failure for that character and skip its GM runs. Other characters from the same agent are unaffected.
Dispatch 4N agents in parallel — two Run A agents and two Run B agents per situation. Each agent receives 3 character responses (characters 1-3 or 4-6) for that situation and produces 3 rulings. Agents within the same run/batch receive identical prompts across situations. The chaos tags are NEVER included in the GM prompt.
Batching: Split the 6 characters into two fixed groups: characters 1-3 (batch 1) and characters 4-6 (batch 2). Each situation produces 4 GM agents: Run A batch 1, Run A batch 2, Run B batch 1, Run B batch 2.
# DH GM Rulings — Batch
You are a Daggerheart Game Master making rulings on 3 player actions.
You are NOT running a session. Each ruling is independent.
IMPORTANT — FEATURE EXISTENCE: The character sheet is the source of truth
for whether a feature exists. If a feature is listed on the sheet, it exists.
Do NOT deny a feature because RAG did not return it. Query RAG for rules on
HOW features work and for judgment frameworks, not WHETHER features exist.
IMPORTANT: Before ruling, use the RAG tool (mcp__daggerheart-rag__query)
with scope="system" to look up:
1. Player action judgment framework (query: "player action vs GM world authority reframing invalid mechanism")
2. Trait selection decision tree (query: "trait selection agility finesse instinct knowledge strength")
3. Difficulty benchmarks for the relevant trait
Do not skip these queries. They are required for every batch.
## Roll Gate
Before narrating ANY uncertain outcome, STOP.
- Acting under pressure, risk, or opposition? → Roll.
- Could this go more than one way? → Roll.
- Would failure change the fiction? → Roll.
## Situation
{situation}
## Fear State
Current Fear: {fear_level} | Scene: {scene_importance} | Budget: {fear_budget}
Use this when judging outcomes. Higher Fear means more aggressive complications
on Fear results. The budget caps how many Fear-funded GM moves this scene supports.
Factor Fear into your OUTCOMES — especially SUCCESS_FEAR and FAILURE_FEAR.
For EACH character below, produce a ruling in this format:
CHARACTER: {name}
ROLL: [yes / no]
REASON: [no-roll: physics_violation / no_meaningful_failure / trivial / preparation / retcon / world_claim] [roll: why uncertain]
ALTERNATIVES: [if no-roll, list 2-3 valid actions the character could take instead, per the Reframing Protocol]
TRAIT: [if roll] | DC: [if roll] | ADVANTAGE_DISADVANTAGE: [if roll]
AD_REASONING: [if roll]
EXPERIENCE_APPLICABLE: [if roll]
TRIGGERS_APPLICABLE: [if roll, list any character triggers that would fire on roll outcomes, or "none"]
OUTCOMES (if roll): CRITICAL / SUCCESS_HOPE / SUCCESS_FEAR / FAILURE_HOPE / FAILURE_FEAR [one sentence each, must reference applicable triggers]
---
**1. {char1_name}** ({class L{level}, {key traits and abilities})
Triggers: {compact trigger list from Trigger Quick-Reference, e.g. "Fail w/ Fear→Hope; Severe dmg→mark Stress reduce severity; dmg dice→reroll 1s/2s"}
RESPONSE: "{chaos_response_1}"
**2. {char2_name}** ({class L{level}, {key traits and abilities})
Triggers: {same format}
RESPONSE: "{chaos_response_2}"
... (3 characters per batch with their chaos responses)
After each GM agent returns, extract 3 ruling blocks. For each, parse:
roll — "yes" or "no"reason — the REASON valuetrait — the TRAIT value (null if no-roll)dc — the DC value as a number (null if no-roll)advantage_disadvantage — "none", "advantage", or "disadvantage" (null if no-roll)ad_reasoning — the AD_REASONING value (null if no-roll)experience_applicable — the EXPERIENCE_APPLICABLE value (null if no-roll)triggers_applicable — the TRIGGERS_APPLICABLE value (null if no-roll)outcomes — object with critical, success_hope, success_fear, failure_hope, failure_fear (null if no-roll)Store all results keyed by (situation_id, character_name, run, chaos_response, chaos_tag). Also store the Fear state (fear_level, scene_importance, fear_budget) alongside each situation's results for audit use.
If a chaos agent returns output where one or more characters cannot be parsed:
{ error: true, agent: "chaos", character: name, raw_output: <relevant section> }.If a GM agent returns output where one or more characters cannot be parsed:
{ error: true, agent: "gm", run: "A" or "B", character: name }.After Wave 1 completes, print:
Wave 1 complete — {N} chaos agents returned, {6N} responses parsed ({failures} failures)
After Wave 2 completes, print:
Wave 2 complete — {4N} GM agents returned, {12N} rulings parsed ({failures} failures)
Once all rulings are collected, run three audit passes. All audit logic runs here in the orchestrator — do NOT dispatch subagents for auditing.
For each (situation, character) pair, compare Run A and Run B. Flag if ANY of:
ROLL: yes and Run B says ROLL: no, or vice versa.Record each flag as:
{
type: "self_consistency",
situation_id,
character_name,
field: "roll" | "trait" | "dc" | "reason",
run_a_value,
run_b_value
}
Skip pairs where either Run A or Run B is an error ruling, or where the chaos agent failed.
Group all Run A rulings by chaos tag. For each tag, apply these rules:
Also flag: Experience recognition failures — where the chaos response leverages a character Experience but the GM ruling says EXPERIENCE_APPLICABLE: none.
Also flag: Trigger recognition failures — where a character's Trigger Quick-Reference lists a trigger that would fire on a given outcome (e.g., Courage fires on Failure with Fear) but the GM ruling's TRIGGERS_APPLICABLE says "none" or omits it, or the OUTCOMES text doesn't reference the trigger effect.
Record each flag as:
{
type: "tag_analysis",
chaos_tag,
situation_id,
character_name,
flag: "unexpected_roll" | "unexpected_refusal" | "low_dc" | "blocked" | "inconsistent_treatment" | "experience_miss" | "trigger_miss",
details: <string describing the specific issue>
}
For the same character across all situations (Run A only), compare how the GM handles the same tag type. Flag if:
Record each flag as:
{
type: "cross_situation",
character_name,
chaos_tag,
situation_a_id,
situation_b_id,
flag: "backstory_inconsistency" | "dc_swing" | "reasoning_drift",
details: <string describing the specific issue>
}
Use the Write tool to create the report at:
D:\Daggerheart\testing\reports\{YYYY-MM-DD}-report.md
where {YYYY-MM-DD} is today's date.
# GM Stress Test Report — {YYYY-MM-DD}
## Summary
- Total chaos responses: {6N}
- Total GM rulings: {12N}
- Parse failures: {count}
- Self-consistency pass rate: {percentage of (situation, character) pairs with zero Pass 1 flags}%
- Tag-based flags: {total count of Pass 2 flags}
- Cross-situation flags: {total count of Pass 3 flags}
## Self-Consistency Failures
{For each Pass 1 flag, output a subsection:}
### {situation_id} / {character_name} — {field}
| | Run A | Run B |
|---|---|---|
| ROLL | {run_a roll} | {run_b roll} |
| REASON | {run_a reason} | {run_b reason} |
| TRAIT | {run_a trait} | {run_b trait} |
| DC | {run_a dc} | {run_b dc} |
{If no Pass 1 flags, write: "No self-consistency failures detected."}
## Tag-Based Analysis
### retcon ({count} responses)
{list of rulings — should all be no-roll. Flag any that got a roll.}
### world_claim ({count} responses)
{list of rulings — should all be no-roll. Flag any that got a roll.}
### ability_abuse ({count} responses)
{list with DCs — flag refusals without clear physics reasoning.}
### physics_push ({count} responses)
{list with DCs — flag refusals without clear physics reasoning.}
### backstory_claim ({count} responses)
{list — flag refusals and DCs below 15.}
### grotesque ({count} responses)
{list — flag blocks where failure would have changed fiction.}
### social_absurdity ({count} responses)
{list — flag blocks where failure would have changed fiction.}
### scale_violation ({count} responses)
{list — flag inconsistent treatment across characters.}
## Cross-Situation Flags
{For each Pass 3 flag, output a subsection:}
### {character_name} — {chaos_tag} — {flag type}
- **{situation_a_id}**: {summary of ruling}
- **{situation_b_id}**: {summary of ruling}
- **Issue**: {details}
{If no Pass 3 flags, write: "No cross-situation flags detected."}
## Fear State Summary
| Situation | Fear Level | Scene | Budget |
|---|---|---|---|
| {situation_id} | {fear_level} | {scene_importance} | {fear_budget} |
| ... | ... | ... | ... |
### Fear Impact Analysis
Compare outcome severity across Fear levels. For all Run A rulings where ROLL=yes, group by Fear level bracket (low: 1-4, mid: 5-8, high: 9-12) and note:
- Do SUCCESS_FEAR outcomes escalate with higher Fear? (They should — more budget = harder complications)
- Do FAILURE_FEAR outcomes escalate with higher Fear? (They should)
- Does Fear level affect roll/no-roll decisions? (It should NOT — Fear affects consequences, not whether something is uncertain)
- Does Fear level affect DC? (It should NOT — DC reflects task difficulty, not GM pressure)
Flag any ruling where Fear level appears to have influenced the roll decision or DC (these are errors — Fear economy is orthogonal to action judgment).
## Full Ruling Table
| Situation | Fear | Character | Chaos Response | Tag | Roll | Trait | DC | Reason |
|---|---|---|---|---|---|---|---|---|
| {situation_id} | {fear_level} | {character_name} | {chaos_response (truncated to 80 chars)} | {chaos_tag} | {roll} | {trait} | {dc} | {reason (truncated to 60 chars)} |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
## Chaos Response Gallery
{All 60 chaos responses, organized by situation, for human review of chaos agent quality:}
### {situation_id}: {situation_title}
| Character | Tag | Response |
|---|---|---|
| {character_name} | {chaos_tag} | {full chaos_response} |
| ... | ... | ... |
## Recommendations
{Write 3-5 recommendations. Derive each from the most frequent or most severe patterns in the flags above. Each recommendation should be 2-3 sentences identifying the pattern and suggesting a concrete change to the GM behavioral rules or vault judgment files.}
After writing the report, print to the user:
scenarios.md or any character sheet. Only write to the reports/ directory.data-ai
Delegate a bulk-work subtask to the local Qwen via one-shot pi run. Use when the subtask is high-volume but low-complexity (file scans, log parsing, large-text summaries, repetitive transforms) so it should not burn parent-model tokens.
development
Delegate a bulk-work subtask to the local Qwen via one-shot opencode run. Use when the subtask is high-volume but low-complexity (file scans, log parsing, large-text summaries, repetitive transforms) so it should not burn parent-model tokens.
development
ETL pipeline that imports manually-downloaded Discord, LinkedIn, and WhatsApp archive ZIPs into the user's brain vaults as plain files (no APIs, no tokens, no daemons). Use when the task involves processing or querying a Discord/LinkedIn/WhatsApp data export.
tools
The user's email, contacts, personal tasks/todos, and full-CRUD Google + EWS calendars. Drives the sloptools CLI (same surface as the sloppy MCP on 127.0.0.1:9420). Use for mail (Gmail / Exchange-EWS / IMAP — list, read, send, reply, forward, flag, categorize, server-side filters, delegated mailboxes, out-of-office), calendar events (create / update / delete / RSVP / freebusy / ICS export across work + private accounts), contacts and contact groups, tasks (Google Tasks, Todoist), slopshell canvas, agent handoffs, and workspace items/artifacts/actors/triage.