GM Stress Test v3

You are the orchestrator for a Daggerheart GM stress test. You dispatch chaos player agents to generate character-appropriate absurd responses, then dispatch paired GM subagents to rule on them, audit the results across three passes, and write a report. Follow every step below in exact order.

Step 0 — Validate prerequisites

Before doing anything else, confirm these files exist:

Read D:\Daggerheart\testing\scenarios.md — count all situation blocks (headings matching ### S-{NN}).
Glob D:\Daggerheart\testing\characters\*.md — this must return 6 character sheet files.

If either is missing or the character count is wrong, stop and tell the user what is missing. Do not proceed. The situation count is dynamic — store it as N.

Step 1 — Parse situations

Read D:\Daggerheart\testing\scenarios.md. Parse every situation block. Each block looks like:

### S-01: The Rusted Gate
{2-4 sentences of situation description}

For each situation, extract and store:

id — e.g. S-01
title — e.g. The Rusted Gate
situation — the full description text following the heading

Confirm you parsed N situations. List their IDs and titles to the user.

Step 2 — Read character sheets

Read every file matched by D:\Daggerheart\testing\characters\*.md. For each file, store:

character_name — derived from the filename (strip the .md extension)
character_sheet — the full file contents

Confirm you loaded exactly 6 characters. List their names to the user.

Step 2.5 — Read GM skill

Read C:\Users\McDave\.claude\skills\daggerheart-gm\SKILL.md and store its full content. This is the GM behavioral spine — it must be prepended to every GM subagent prompt. Do NOT include it in chaos agent prompts.

Step 2.75 — Generate Fear state per situation

For each situation, randomly assign a Fear context that the GM agents will use when ruling. Use Python RNG (not LLM generation) for all random values.

For each situation, generate:

Current Fear level — random integer 1-12
Scene importance — random choice from [Minor, Standard, Major]
Fear budget — derived from the solo scene budget table:
- Minor: 1
- Standard: random choice from [1, 2]
- Major: random choice from [2, 3]

Store as (situation_id, fear_level, scene_importance, fear_budget).

Print the Fear assignments to the user:

Fear state assigned:
S-01: Fear 7, Standard scene, budget 2
S-02: Fear 3, Major scene, budget 3
...

Both Run A and Run B for the same situation receive the same Fear state. The A/B test isolates reasoning variance, not input variance.

Step 3 — Dispatch agents (3 waves)

The test runs N situations × 6 characters × 2 GM runs = 12N total rulings, dispatched in 3 parallel waves. Chaos agents handle one situation, all 6 characters. GM agents handle one situation, 3 characters (two batches per run to prevent DC compression). Total agent count: 5N (one chaos + two GM Run A + two GM Run B per situation).

Wave structure

Wave 1 — Chaos (N agents in parallel)
  For each situation: 1 agent produces all 6 chaos responses
  Parse all chaos responses after wave completes

Wave 2 — GM Rulings (4N agents in parallel)
  For each situation: 2 agents produce 3 Run A rulings each (characters 1-3, 4-6)
  For each situation: 2 agents produce 3 Run B rulings each (characters 1-3, 4-6)
  Run A and Run B agents are separate instances
  Batch 1 and Batch 2 agents for the same run are separate instances
  Parse all GM rulings after wave completes

Wave 1 — Chaos agents

Dispatch N agents in parallel, one per situation. Each agent receives all 6 character summaries and the situation. Use the Agent tool with this prompt:

You are running 6 chaos player agents in sequence. For each character,
generate one absurd, game-breaking response using their ACTUAL
abilities, traits, experiences, or personality. You are stress-testing
a GM's judgment.

Types of chaos you can generate:
- ability_abuse: use a real ability in a way it wasn't designed for
- physics_push: attempt something at the edge of physical possibility
- backstory_claim: invent a past event that justifies a current ability
- retcon: declare something about the character that was never established
- world_claim: declare something about the world the GM hasn't established
- social_absurdity: attempt a social action that's wildly inappropriate
- grotesque: do something physically repulsive or disturbing
- scale_violation: attempt something wildly out of proportion to the situation

## Situation
{situation}

## Characters

**1. {char1_name}** — {one-line summary: ancestry, class, level, traits, experiences, key abilities, weapons, inventory highlights, personality}

**2. {char2_name}** — {same format}

... (all 6 characters)

## Output Format (repeat for each character)
CHARACTER: {name}
RESPONSE: "{what they say or do}"
TAG: {one tag from the list above}

Character summaries must be concise (3-4 lines each) but include all mechanically relevant details: traits with modifiers, experience names with modifiers, ability names with brief descriptions, weapon names, and inventory items that could be exploited. Prepare these summaries once in Step 2 and reuse them across all chaos agents.

Parsing chaos output

After each chaos agent returns, extract 6 entries. For each:

chaos_response — the text inside the quotes after RESPONSE:
chaos_tag — the tag after TAG:

If a character's output cannot be parsed from a consolidated agent, log the failure for that character and skip its GM runs. Other characters from the same agent are unaffected.

Wave 2 — GM agents

Dispatch 4N agents in parallel — two Run A agents and two Run B agents per situation. Each agent receives 3 character responses (characters 1-3 or 4-6) for that situation and produces 3 rulings. Agents within the same run/batch receive identical prompts across situations. The chaos tags are NEVER included in the GM prompt.

Batching: Split the 6 characters into two fixed groups: characters 1-3 (batch 1) and characters 4-6 (batch 2). Each situation produces 4 GM agents: Run A batch 1, Run A batch 2, Run B batch 1, Run B batch 2.

# DH GM Rulings — Batch

You are a Daggerheart Game Master making rulings on 3 player actions.
You are NOT running a session. Each ruling is independent.

IMPORTANT — FEATURE EXISTENCE: The character sheet is the source of truth
for whether a feature exists. If a feature is listed on the sheet, it exists.
Do NOT deny a feature because RAG did not return it. Query RAG for rules on
HOW features work and for judgment frameworks, not WHETHER features exist.

IMPORTANT: Before ruling, use the RAG tool (mcp__daggerheart-rag__query)
with scope="system" to look up:
1. Player action judgment framework (query: "player action vs GM world authority reframing invalid mechanism")
2. Trait selection decision tree (query: "trait selection agility finesse instinct knowledge strength")
3. Difficulty benchmarks for the relevant trait
Do not skip these queries. They are required for every batch.

## Roll Gate
Before narrating ANY uncertain outcome, STOP.
- Acting under pressure, risk, or opposition? → Roll.
- Could this go more than one way? → Roll.
- Would failure change the fiction? → Roll.

## Situation
{situation}

## Fear State
Current Fear: {fear_level} | Scene: {scene_importance} | Budget: {fear_budget}
Use this when judging outcomes. Higher Fear means more aggressive complications
on Fear results. The budget caps how many Fear-funded GM moves this scene supports.
Factor Fear into your OUTCOMES — especially SUCCESS_FEAR and FAILURE_FEAR.

For EACH character below, produce a ruling in this format:

CHARACTER: {name}
ROLL: [yes / no]
REASON: [no-roll: physics_violation / no_meaningful_failure / trivial / preparation / retcon / world_claim] [roll: why uncertain]
ALTERNATIVES: [if no-roll, list 2-3 valid actions the character could take instead, per the Reframing Protocol]
TRAIT: [if roll] | DC: [if roll] | ADVANTAGE_DISADVANTAGE: [if roll]
AD_REASONING: [if roll]
EXPERIENCE_APPLICABLE: [if roll]
TRIGGERS_APPLICABLE: [if roll, list any character triggers that would fire on roll outcomes, or "none"]
OUTCOMES (if roll): CRITICAL / SUCCESS_HOPE / SUCCESS_FEAR / FAILURE_HOPE / FAILURE_FEAR [one sentence each, must reference applicable triggers]

---

**1. {char1_name}** ({class L{level}, {key traits and abilities})
Triggers: {compact trigger list from Trigger Quick-Reference, e.g. "Fail w/ Fear→Hope; Severe dmg→mark Stress reduce severity; dmg dice→reroll 1s/2s"}
RESPONSE: "{chaos_response_1}"

**2. {char2_name}** ({class L{level}, {key traits and abilities})
Triggers: {same format}
RESPONSE: "{chaos_response_2}"

... (3 characters per batch with their chaos responses)

Parsing GM output

After each GM agent returns, extract 3 ruling blocks. For each, parse:

roll — "yes" or "no"
reason — the REASON value
trait — the TRAIT value (null if no-roll)
dc — the DC value as a number (null if no-roll)
advantage_disadvantage — "none", "advantage", or "disadvantage" (null if no-roll)
ad_reasoning — the AD_REASONING value (null if no-roll)
experience_applicable — the EXPERIENCE_APPLICABLE value (null if no-roll)
triggers_applicable — the TRIGGERS_APPLICABLE value (null if no-roll)
outcomes — object with critical, success_hope, success_fear, failure_hope, failure_fear (null if no-roll)

Store all results keyed by (situation_id, character_name, run, chaos_response, chaos_tag). Also store the Fear state (fear_level, scene_importance, fear_budget) alongside each situation's results for audit use.

Error handling

If a chaos agent returns output where one or more characters cannot be parsed:

Log the failure per character: { error: true, agent: "chaos", character: name, raw_output: <relevant section> }.
Skip GM runs for unparseable characters only. Other characters from the same agent proceed normally.
Count each as a parse failure in the report.

If a GM agent returns output where one or more characters cannot be parsed:

Log the failure per character: { error: true, agent: "gm", run: "A" or "B", character: name }.
Other characters from the same agent are unaffected.
Count each as a parse failure in the report.

Progress updates

After Wave 1 completes, print:

Wave 1 complete — {N} chaos agents returned, {6N} responses parsed ({failures} failures)

After Wave 2 completes, print:

Wave 2 complete — {4N} GM agents returned, {12N} rulings parsed ({failures} failures)

Step 4 — Audit passes

Once all rulings are collected, run three audit passes. All audit logic runs here in the orchestrator — do NOT dispatch subagents for auditing.

Pass 1 — Self-Consistency

For each (situation, character) pair, compare Run A and Run B. Flag if ANY of:

Roll decision differs: Run A says ROLL: yes and Run B says ROLL: no, or vice versa.
Trait differs: both are rolls but the TRAIT values are different strings.
DC drift: both are rolls but the DC values differ by more than 3.
Reason category differs: both are no-rolls but the REASON categories are different.

Record each flag as:

{
  type: "self_consistency",
  situation_id,
  character_name,
  field: "roll" | "trait" | "dc" | "reason",
  run_a_value,
  run_b_value
}

Skip pairs where either Run A or Run B is an error ruling, or where the chaos agent failed.

Pass 2 — Cross-Character by Tag

Group all Run A rulings by chaos tag. For each tag, apply these rules:

retcon and world_claim: should all be no-roll. Flag any that got a roll.
ability_abuse and physics_push: should mostly get rolls. Flag any that were refused without clear physics reasoning.
backstory_claim: should get rolls with high DC. Flag refusals. Flag DCs below 15.
grotesque and social_absurdity: should get rolls if failure changes fiction. Flag blocks (no-roll refusals).
scale_violation: case-by-case. Flag inconsistent treatment across characters (e.g. some get rolls and some get refused for the same tag in the same situation).

Also flag: Experience recognition failures — where the chaos response leverages a character Experience but the GM ruling says EXPERIENCE_APPLICABLE: none.

Also flag: Trigger recognition failures — where a character's Trigger Quick-Reference lists a trigger that would fire on a given outcome (e.g., Courage fires on Failure with Fear) but the GM ruling's TRIGGERS_APPLICABLE says "none" or omits it, or the OUTCOMES text doesn't reference the trigger effect.

Record each flag as:

{
  type: "tag_analysis",
  chaos_tag,
  situation_id,
  character_name,
  flag: "unexpected_roll" | "unexpected_refusal" | "low_dc" | "blocked" | "inconsistent_treatment" | "experience_miss" | "trigger_miss",
  details: <string describing the specific issue>
}

Pass 3 — Cross-Situation Consistency

For the same character across all situations (Run A only), compare how the GM handles the same tag type. Flag if:

Backstory claim inconsistency: backstory_claim allowed in one situation but refused in another for the same character.
DC swing: ability_abuse gets DC 15 in one situation but DC 25 in a comparable one for the same character (difference > 8 for same tag type).
Reasoning drift: refusal reasoning changes for the same type of chaos for the same character (e.g. physics_violation in one situation, no_meaningful_failure in another for the same tag).

Record each flag as:

{
  type: "cross_situation",
  character_name,
  chaos_tag,
  situation_a_id,
  situation_b_id,
  flag: "backstory_inconsistency" | "dc_swing" | "reasoning_drift",
  details: <string describing the specific issue>
}

Step 5 — Write the report

Use the Write tool to create the report at:

D:\Daggerheart\testing\reports\{YYYY-MM-DD}-report.md

where {YYYY-MM-DD} is today's date.

Report structure

# GM Stress Test Report — {YYYY-MM-DD}

## Summary
- Total chaos responses: {6N}
- Total GM rulings: {12N}
- Parse failures: {count}
- Self-consistency pass rate: {percentage of (situation, character) pairs with zero Pass 1 flags}%
- Tag-based flags: {total count of Pass 2 flags}
- Cross-situation flags: {total count of Pass 3 flags}

## Self-Consistency Failures

{For each Pass 1 flag, output a subsection:}

### {situation_id} / {character_name} — {field}
| | Run A | Run B |
|---|---|---|
| ROLL | {run_a roll} | {run_b roll} |
| REASON | {run_a reason} | {run_b reason} |
| TRAIT | {run_a trait} | {run_b trait} |
| DC | {run_a dc} | {run_b dc} |

{If no Pass 1 flags, write: "No self-consistency failures detected."}

## Tag-Based Analysis

### retcon ({count} responses)
{list of rulings — should all be no-roll. Flag any that got a roll.}

### world_claim ({count} responses)
{list of rulings — should all be no-roll. Flag any that got a roll.}

### ability_abuse ({count} responses)
{list with DCs — flag refusals without clear physics reasoning.}

### physics_push ({count} responses)
{list with DCs — flag refusals without clear physics reasoning.}

### backstory_claim ({count} responses)
{list — flag refusals and DCs below 15.}

### grotesque ({count} responses)
{list — flag blocks where failure would have changed fiction.}

### social_absurdity ({count} responses)
{list — flag blocks where failure would have changed fiction.}

### scale_violation ({count} responses)
{list — flag inconsistent treatment across characters.}

## Cross-Situation Flags

{For each Pass 3 flag, output a subsection:}

### {character_name} — {chaos_tag} — {flag type}
- **{situation_a_id}**: {summary of ruling}
- **{situation_b_id}**: {summary of ruling}
- **Issue**: {details}

{If no Pass 3 flags, write: "No cross-situation flags detected."}

## Fear State Summary

| Situation | Fear Level | Scene | Budget |
|---|---|---|---|
| {situation_id} | {fear_level} | {scene_importance} | {fear_budget} |
| ... | ... | ... | ... |

### Fear Impact Analysis

Compare outcome severity across Fear levels. For all Run A rulings where ROLL=yes, group by Fear level bracket (low: 1-4, mid: 5-8, high: 9-12) and note:
- Do SUCCESS_FEAR outcomes escalate with higher Fear? (They should — more budget = harder complications)
- Do FAILURE_FEAR outcomes escalate with higher Fear? (They should)
- Does Fear level affect roll/no-roll decisions? (It should NOT — Fear affects consequences, not whether something is uncertain)
- Does Fear level affect DC? (It should NOT — DC reflects task difficulty, not GM pressure)

Flag any ruling where Fear level appears to have influenced the roll decision or DC (these are errors — Fear economy is orthogonal to action judgment).

## Full Ruling Table
| Situation | Fear | Character | Chaos Response | Tag | Roll | Trait | DC | Reason |
|---|---|---|---|---|---|---|---|---|
| {situation_id} | {fear_level} | {character_name} | {chaos_response (truncated to 80 chars)} | {chaos_tag} | {roll} | {trait} | {dc} | {reason (truncated to 60 chars)} |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |

## Chaos Response Gallery

{All 60 chaos responses, organized by situation, for human review of chaos agent quality:}

### {situation_id}: {situation_title}
| Character | Tag | Response |
|---|---|---|
| {character_name} | {chaos_tag} | {full chaos_response} |
| ... | ... | ... |

## Recommendations

{Write 3-5 recommendations. Derive each from the most frequent or most severe patterns in the flags above. Each recommendation should be 2-3 sentences identifying the pattern and suggesting a concrete change to the GM behavioral rules or vault judgment files.}

Step 6 — Final summary

After writing the report, print to the user:

The report file path
The summary stats (total responses, total rulings, parse failures, pass rate, flag counts)
The 3 recommendations (copied from the report)

Constraints

Parallel wave dispatch: Dispatch all agents within a wave in parallel. Wait for the wave to complete before starting the next. Never dispatch Wave 2 before Wave 1 results are parsed — GM agents depend on chaos outputs.
Sonnet model: Use Sonnet for ALL subagents (chaos and GM). Haiku is unreliable for GM judgment.
Cross-scenario isolation: Never include rulings from other situations, audit criteria, or cross-scenario information in agent prompts. Characters within the same situation may share an agent — situations may not.
Run A / Run B isolation: Run A and Run B agents for the same situation must be separate agent instances. Never put both runs in the same agent — the A/B test requires independent reasoning.
Chaos agent never sees GM skill or audit criteria: The chaos agent prompt contains only character summaries, the situation, and chaos instructions.
GM agent never sees chaos tag: The GM prompt contains only the situation, character details, and the chaos response text. The tag is withheld.
Mandatory RAG queries in GM agents: The GM prompt must instruct the agent to query RAG for three things before ruling: (1) the player action judgment framework, (2) the trait selection decision tree, and (3) difficulty benchmarks for the relevant trait. This is not optional — agents that skip RAG produce the majority of consistency failures.
Audit in the orchestrator: All three audit passes run in the orchestrator's context, not in subagents. The orchestrator has all results in memory.
No modification of test files: Never write to scenarios.md or any character sheet. Only write to the reports/ directory.
Graceful failure: If more than 20% of expected rulings are parse failures, stop collection and write a partial report noting the high failure rate. Do not run audit passes on unreliable data.
Date: Use today's actual date for the report filename, not a placeholder.

GM Stress Test v3

Step 0 — Validate prerequisites

Before doing anything else, confirm these files exist:

Read D:\Daggerheart\testing\scenarios.md — count all situation blocks (headings matching ### S-{NN}).
Glob D:\Daggerheart\testing\characters\*.md — this must return 6 character sheet files.

If either is missing or the character count is wrong, stop and tell the user what is missing. Do not proceed. The situation count is dynamic — store it as N.

Step 1 — Parse situations

Read D:\Daggerheart\testing\scenarios.md. Parse every situation block. Each block looks like:

### S-01: The Rusted Gate
{2-4 sentences of situation description}

For each situation, extract and store:

id — e.g. S-01
title — e.g. The Rusted Gate
situation — the full description text following the heading

Confirm you parsed N situations. List their IDs and titles to the user.

Step 2 — Read character sheets

Read every file matched by D:\Daggerheart\testing\characters\*.md. For each file, store:

character_name — derived from the filename (strip the .md extension)
character_sheet — the full file contents

Confirm you loaded exactly 6 characters. List their names to the user.

Step 2.5 — Read GM skill

Step 2.75 — Generate Fear state per situation

For each situation, randomly assign a Fear context that the GM agents will use when ruling. Use Python RNG (not LLM generation) for all random values.

For each situation, generate:

Current Fear level — random integer 1-12
Scene importance — random choice from [Minor, Standard, Major]
Fear budget — derived from the solo scene budget table:
- Minor: 1
- Standard: random choice from [1, 2]
- Major: random choice from [2, 3]

Store as (situation_id, fear_level, scene_importance, fear_budget).

Print the Fear assignments to the user:

Fear state assigned:
S-01: Fear 7, Standard scene, budget 2
S-02: Fear 3, Major scene, budget 3
...

Both Run A and Run B for the same situation receive the same Fear state. The A/B test isolates reasoning variance, not input variance.

Step 3 — Dispatch agents (3 waves)

Wave structure

Wave 1 — Chaos (N agents in parallel)
  For each situation: 1 agent produces all 6 chaos responses
  Parse all chaos responses after wave completes

Wave 2 — GM Rulings (4N agents in parallel)
  For each situation: 2 agents produce 3 Run A rulings each (characters 1-3, 4-6)
  For each situation: 2 agents produce 3 Run B rulings each (characters 1-3, 4-6)
  Run A and Run B agents are separate instances
  Batch 1 and Batch 2 agents for the same run are separate instances
  Parse all GM rulings after wave completes

Wave 1 — Chaos agents

Dispatch N agents in parallel, one per situation. Each agent receives all 6 character summaries and the situation. Use the Agent tool with this prompt:

You are running 6 chaos player agents in sequence. For each character,
generate one absurd, game-breaking response using their ACTUAL
abilities, traits, experiences, or personality. You are stress-testing
a GM's judgment.

Types of chaos you can generate:
- ability_abuse: use a real ability in a way it wasn't designed for
- physics_push: attempt something at the edge of physical possibility
- backstory_claim: invent a past event that justifies a current ability
- retcon: declare something about the character that was never established
- world_claim: declare something about the world the GM hasn't established
- social_absurdity: attempt a social action that's wildly inappropriate
- grotesque: do something physically repulsive or disturbing
- scale_violation: attempt something wildly out of proportion to the situation

## Situation
{situation}

## Characters

**1. {char1_name}** — {one-line summary: ancestry, class, level, traits, experiences, key abilities, weapons, inventory highlights, personality}

**2. {char2_name}** — {same format}

... (all 6 characters)

## Output Format (repeat for each character)
CHARACTER: {name}
RESPONSE: "{what they say or do}"
TAG: {one tag from the list above}

Parsing chaos output

After each chaos agent returns, extract 6 entries. For each:

chaos_response — the text inside the quotes after RESPONSE:
chaos_tag — the tag after TAG:

If a character's output cannot be parsed from a consolidated agent, log the failure for that character and skip its GM runs. Other characters from the same agent are unaffected.

Wave 2 — GM agents

# DH GM Rulings — Batch

You are a Daggerheart Game Master making rulings on 3 player actions.
You are NOT running a session. Each ruling is independent.

IMPORTANT — FEATURE EXISTENCE: The character sheet is the source of truth
for whether a feature exists. If a feature is listed on the sheet, it exists.
Do NOT deny a feature because RAG did not return it. Query RAG for rules on
HOW features work and for judgment frameworks, not WHETHER features exist.

IMPORTANT: Before ruling, use the RAG tool (mcp__daggerheart-rag__query)
with scope="system" to look up:
1. Player action judgment framework (query: "player action vs GM world authority reframing invalid mechanism")
2. Trait selection decision tree (query: "trait selection agility finesse instinct knowledge strength")
3. Difficulty benchmarks for the relevant trait
Do not skip these queries. They are required for every batch.

## Roll Gate
Before narrating ANY uncertain outcome, STOP.
- Acting under pressure, risk, or opposition? → Roll.
- Could this go more than one way? → Roll.
- Would failure change the fiction? → Roll.

## Situation
{situation}

## Fear State
Current Fear: {fear_level} | Scene: {scene_importance} | Budget: {fear_budget}
Use this when judging outcomes. Higher Fear means more aggressive complications
on Fear results. The budget caps how many Fear-funded GM moves this scene supports.
Factor Fear into your OUTCOMES — especially SUCCESS_FEAR and FAILURE_FEAR.

For EACH character below, produce a ruling in this format:

CHARACTER: {name}
ROLL: [yes / no]
REASON: [no-roll: physics_violation / no_meaningful_failure / trivial / preparation / retcon / world_claim] [roll: why uncertain]
ALTERNATIVES: [if no-roll, list 2-3 valid actions the character could take instead, per the Reframing Protocol]
TRAIT: [if roll] | DC: [if roll] | ADVANTAGE_DISADVANTAGE: [if roll]
AD_REASONING: [if roll]
EXPERIENCE_APPLICABLE: [if roll]
TRIGGERS_APPLICABLE: [if roll, list any character triggers that would fire on roll outcomes, or "none"]
OUTCOMES (if roll): CRITICAL / SUCCESS_HOPE / SUCCESS_FEAR / FAILURE_HOPE / FAILURE_FEAR [one sentence each, must reference applicable triggers]

---

**1. {char1_name}** ({class L{level}, {key traits and abilities})
Triggers: {compact trigger list from Trigger Quick-Reference, e.g. "Fail w/ Fear→Hope; Severe dmg→mark Stress reduce severity; dmg dice→reroll 1s/2s"}
RESPONSE: "{chaos_response_1}"

**2. {char2_name}** ({class L{level}, {key traits and abilities})
Triggers: {same format}
RESPONSE: "{chaos_response_2}"

... (3 characters per batch with their chaos responses)

Parsing GM output

After each GM agent returns, extract 3 ruling blocks. For each, parse:

roll — "yes" or "no"
reason — the REASON value
trait — the TRAIT value (null if no-roll)
dc — the DC value as a number (null if no-roll)
advantage_disadvantage — "none", "advantage", or "disadvantage" (null if no-roll)
ad_reasoning — the AD_REASONING value (null if no-roll)
experience_applicable — the EXPERIENCE_APPLICABLE value (null if no-roll)
triggers_applicable — the TRIGGERS_APPLICABLE value (null if no-roll)
outcomes — object with critical, success_hope, success_fear, failure_hope, failure_fear (null if no-roll)

Error handling

If a chaos agent returns output where one or more characters cannot be parsed:

Log the failure per character: { error: true, agent: "chaos", character: name, raw_output: <relevant section> }.
Skip GM runs for unparseable characters only. Other characters from the same agent proceed normally.
Count each as a parse failure in the report.

If a GM agent returns output where one or more characters cannot be parsed:

Log the failure per character: { error: true, agent: "gm", run: "A" or "B", character: name }.
Other characters from the same agent are unaffected.
Count each as a parse failure in the report.

Progress updates

After Wave 1 completes, print:

Wave 1 complete — {N} chaos agents returned, {6N} responses parsed ({failures} failures)

After Wave 2 completes, print:

Wave 2 complete — {4N} GM agents returned, {12N} rulings parsed ({failures} failures)

Step 4 — Audit passes

Once all rulings are collected, run three audit passes. All audit logic runs here in the orchestrator — do NOT dispatch subagents for auditing.

Pass 1 — Self-Consistency

For each (situation, character) pair, compare Run A and Run B. Flag if ANY of:

Roll decision differs: Run A says ROLL: yes and Run B says ROLL: no, or vice versa.
Trait differs: both are rolls but the TRAIT values are different strings.
DC drift: both are rolls but the DC values differ by more than 3.
Reason category differs: both are no-rolls but the REASON categories are different.

Record each flag as:

{
  type: "self_consistency",
  situation_id,
  character_name,
  field: "roll" | "trait" | "dc" | "reason",
  run_a_value,
  run_b_value
}

Skip pairs where either Run A or Run B is an error ruling, or where the chaos agent failed.

Pass 2 — Cross-Character by Tag

Group all Run A rulings by chaos tag. For each tag, apply these rules:

retcon and world_claim: should all be no-roll. Flag any that got a roll.
ability_abuse and physics_push: should mostly get rolls. Flag any that were refused without clear physics reasoning.
backstory_claim: should get rolls with high DC. Flag refusals. Flag DCs below 15.
grotesque and social_absurdity: should get rolls if failure changes fiction. Flag blocks (no-roll refusals).
scale_violation: case-by-case. Flag inconsistent treatment across characters (e.g. some get rolls and some get refused for the same tag in the same situation).

Also flag: Experience recognition failures — where the chaos response leverages a character Experience but the GM ruling says EXPERIENCE_APPLICABLE: none.

Record each flag as:

{
  type: "tag_analysis",
  chaos_tag,
  situation_id,
  character_name,
  flag: "unexpected_roll" | "unexpected_refusal" | "low_dc" | "blocked" | "inconsistent_treatment" | "experience_miss" | "trigger_miss",
  details: <string describing the specific issue>
}

Pass 3 — Cross-Situation Consistency

For the same character across all situations (Run A only), compare how the GM handles the same tag type. Flag if:

Backstory claim inconsistency: backstory_claim allowed in one situation but refused in another for the same character.
DC swing: ability_abuse gets DC 15 in one situation but DC 25 in a comparable one for the same character (difference > 8 for same tag type).
Reasoning drift: refusal reasoning changes for the same type of chaos for the same character (e.g. physics_violation in one situation, no_meaningful_failure in another for the same tag).

Record each flag as:

{
  type: "cross_situation",
  character_name,
  chaos_tag,
  situation_a_id,
  situation_b_id,
  flag: "backstory_inconsistency" | "dc_swing" | "reasoning_drift",
  details: <string describing the specific issue>
}

Step 5 — Write the report

Use the Write tool to create the report at:

D:\Daggerheart\testing\reports\{YYYY-MM-DD}-report.md

where {YYYY-MM-DD} is today's date.

Report structure

# GM Stress Test Report — {YYYY-MM-DD}

## Summary
- Total chaos responses: {6N}
- Total GM rulings: {12N}
- Parse failures: {count}
- Self-consistency pass rate: {percentage of (situation, character) pairs with zero Pass 1 flags}%
- Tag-based flags: {total count of Pass 2 flags}
- Cross-situation flags: {total count of Pass 3 flags}

## Self-Consistency Failures

{For each Pass 1 flag, output a subsection:}

### {situation_id} / {character_name} — {field}
| | Run A | Run B |
|---|---|---|
| ROLL | {run_a roll} | {run_b roll} |
| REASON | {run_a reason} | {run_b reason} |
| TRAIT | {run_a trait} | {run_b trait} |
| DC | {run_a dc} | {run_b dc} |

{If no Pass 1 flags, write: "No self-consistency failures detected."}

## Tag-Based Analysis

### retcon ({count} responses)
{list of rulings — should all be no-roll. Flag any that got a roll.}

### world_claim ({count} responses)
{list of rulings — should all be no-roll. Flag any that got a roll.}

### ability_abuse ({count} responses)
{list with DCs — flag refusals without clear physics reasoning.}

### physics_push ({count} responses)
{list with DCs — flag refusals without clear physics reasoning.}

### backstory_claim ({count} responses)
{list — flag refusals and DCs below 15.}

### grotesque ({count} responses)
{list — flag blocks where failure would have changed fiction.}

### social_absurdity ({count} responses)
{list — flag blocks where failure would have changed fiction.}

### scale_violation ({count} responses)
{list — flag inconsistent treatment across characters.}

## Cross-Situation Flags

{For each Pass 3 flag, output a subsection:}

### {character_name} — {chaos_tag} — {flag type}
- **{situation_a_id}**: {summary of ruling}
- **{situation_b_id}**: {summary of ruling}
- **Issue**: {details}

{If no Pass 3 flags, write: "No cross-situation flags detected."}

## Fear State Summary

| Situation | Fear Level | Scene | Budget |
|---|---|---|---|
| {situation_id} | {fear_level} | {scene_importance} | {fear_budget} |
| ... | ... | ... | ... |

### Fear Impact Analysis

Compare outcome severity across Fear levels. For all Run A rulings where ROLL=yes, group by Fear level bracket (low: 1-4, mid: 5-8, high: 9-12) and note:
- Do SUCCESS_FEAR outcomes escalate with higher Fear? (They should — more budget = harder complications)
- Do FAILURE_FEAR outcomes escalate with higher Fear? (They should)
- Does Fear level affect roll/no-roll decisions? (It should NOT — Fear affects consequences, not whether something is uncertain)
- Does Fear level affect DC? (It should NOT — DC reflects task difficulty, not GM pressure)

Flag any ruling where Fear level appears to have influenced the roll decision or DC (these are errors — Fear economy is orthogonal to action judgment).

## Full Ruling Table
| Situation | Fear | Character | Chaos Response | Tag | Roll | Trait | DC | Reason |
|---|---|---|---|---|---|---|---|---|
| {situation_id} | {fear_level} | {character_name} | {chaos_response (truncated to 80 chars)} | {chaos_tag} | {roll} | {trait} | {dc} | {reason (truncated to 60 chars)} |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |

## Chaos Response Gallery

{All 60 chaos responses, organized by situation, for human review of chaos agent quality:}

### {situation_id}: {situation_title}
| Character | Tag | Response |
|---|---|---|
| {character_name} | {chaos_tag} | {full chaos_response} |
| ... | ... | ... |

## Recommendations

{Write 3-5 recommendations. Derive each from the most frequent or most severe patterns in the flags above. Each recommendation should be 2-3 sentences identifying the pattern and suggesting a concrete change to the GM behavioral rules or vault judgment files.}

Step 6 — Final summary

After writing the report, print to the user:

The report file path
The summary stats (total responses, total rulings, parse failures, pass rate, flag counts)
The 3 recommendations (copied from the report)

Constraints

Parallel wave dispatch: Dispatch all agents within a wave in parallel. Wait for the wave to complete before starting the next. Never dispatch Wave 2 before Wave 1 results are parsed — GM agents depend on chaos outputs.
Sonnet model: Use Sonnet for ALL subagents (chaos and GM). Haiku is unreliable for GM judgment.
Cross-scenario isolation: Never include rulings from other situations, audit criteria, or cross-scenario information in agent prompts. Characters within the same situation may share an agent — situations may not.
Run A / Run B isolation: Run A and Run B agents for the same situation must be separate agent instances. Never put both runs in the same agent — the A/B test requires independent reasoning.
Chaos agent never sees GM skill or audit criteria: The chaos agent prompt contains only character summaries, the situation, and chaos instructions.
GM agent never sees chaos tag: The GM prompt contains only the situation, character details, and the chaos response text. The tag is withheld.
Mandatory RAG queries in GM agents: The GM prompt must instruct the agent to query RAG for three things before ruling: (1) the player action judgment framework, (2) the trait selection decision tree, and (3) difficulty benchmarks for the relevant trait. This is not optional — agents that skip RAG produce the majority of consistency failures.
Audit in the orchestrator: All three audit passes run in the orchestrator's context, not in subagents. The orchestrator has all results in memory.
No modification of test files: Never write to scenarios.md or any character sheet. Only write to the reports/ directory.
Graceful failure: If more than 20% of expected rulings are parse failures, stop collection and write a partial report noting the high failure rate. Do not run audit passes on unreliable data.
Date: Use today's actual date for the report filename, not a placeholder.

Adoption

krystophny/gm-stress-test

$ install --global

Security Scan Results

SKILL.md

GM Stress Test v3

Step 0 — Validate prerequisites

Step 1 — Parse situations

Step 2 — Read character sheets

Step 2.5 — Read GM skill

Step 2.75 — Generate Fear state per situation

Step 3 — Dispatch agents (3 waves)

Wave structure

Wave 1 — Chaos agents

Parsing chaos output

Wave 2 — GM agents

Parsing GM output

Error handling

Progress updates

Step 4 — Audit passes

Pass 1 — Self-Consistency

Pass 2 — Cross-Character by Tag

Pass 3 — Cross-Situation Consistency

Step 5 — Write the report

Report structure

Step 6 — Final summary

Constraints

Related Skills

krystophny/pi

krystophny/opencode

krystophny/slopvault

krystophny/sloptools

krystophny/gm-stress-test

$ install --global

Security Scan Results

SKILL.md

GM Stress Test v3

Step 0 — Validate prerequisites

Step 1 — Parse situations

Step 2 — Read character sheets

Step 2.5 — Read GM skill

Step 2.75 — Generate Fear state per situation

Step 3 — Dispatch agents (3 waves)

Wave structure

Wave 1 — Chaos agents

Parsing chaos output

Wave 2 — GM agents

Parsing GM output

Error handling

Progress updates

Step 4 — Audit passes

Pass 1 — Self-Consistency

Pass 2 — Cross-Character by Tag

Pass 3 — Cross-Situation Consistency

Step 5 — Write the report

Report structure

Step 6 — Final summary

Constraints

Related Skills

krystophny/pi

krystophny/opencode

krystophny/slopvault

krystophny/sloptools