skills/causalt5k-diagnosing-informing-refusal/SKILL.md
Diagnose and correct causal reasoning failures in LLM outputs using the CausalT5K framework. Detects rung collapse (answering causal questions with mere correlations), sycophantic drift (abandoning correct answers under pressure), and generates Wise Refusals that specify missing evidence. Trigger phrases: 'diagnose causal reasoning', 'check for rung collapse', 'detect sycophancy in reasoning', 'wise refusal analysis', 'causal trap detection', 'audit causal claims in this output'
npx skillsauth add ndpvt-web/arxiv-claude-skills causalt5k-diagnosing-informing-refusalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to apply the CausalT5K diagnostic framework to real-world causal reasoning tasks. It operationalizes Pearl's Ladder of Causation (association, intervention, counterfactual) as a practical audit tool: classifying causal claims by rung, detecting when reasoning collapses to a lower rung than the question demands, identifying 10 specific causal trap families (selection bias, Simpson's paradox, reverse causation, etc.), and producing structured Wise Refusals that name the trap, specify the missing evidence, and explicitly decline unwarranted endorsement.
Pearl's Ladder as Diagnostic Infrastructure. The core insight of CausalT5K is that causal reasoning failures are not random -- they cluster into diagnosable pathologies mapped to three rungs of Pearl's causal hierarchy. Rung 1 (Association) asks "What do I observe?" Rung 2 (Intervention) asks "What happens if I act?" Rung 3 (Counterfactual) asks "What would have happened if things had been different?" Rung collapse occurs when a model answers a Rung 2 or 3 question using only Rung 1 evidence -- for example, citing a correlation to justify an interventional recommendation. The framework provides a taxonomy of 10 "Wolf" traps (invalid designs: selection bias, survivorship bias, confounding, Simpson's paradox, reverse causation, post hoc fallacy, ecological fallacy, base rate neglect, healthy user bias, regression to mean) and 8 "Sheep" designs (valid: RCTs, natural experiments, instrumental variables, difference-in-differences, regression discontinuity, ablation studies, mechanism + dose gradients, lottery assignment).
Two-Axis Decomposition Reveals Hidden Failures. Aggregate accuracy masks dangerous asymmetries. CausalT5K decomposes performance into Utility (sensitivity: correctly endorsing valid causal claims) and Safety (specificity: correctly rejecting invalid traps). A model with 96% Safety but 40% Utility rejects 60% of legitimate causal claims -- a skepticism trap invisible to single-number metrics. Similarly, the Detection-Correction Gap (48-55% across model families) shows that models frequently detect a causal trap but fail to follow through with refusal, producing a diagnosis without a conclusion.
The Wise Refusal Protocol. Rather than binary accept/reject, the framework mandates a three-step response for underdetermined claims: (1) Classify the trap family, (2) State the pivotal question -- the specific missing information that would resolve the ambiguity, (3) Explicitly refuse to endorse the claim. This transforms refusal from evasion into actionable guidance.
Identify the causal rung demanded by the question. Determine whether the user's query is associational ("Is X correlated with Y?"), interventional ("Will doing X cause Y?"), or counterfactual ("Would Y have been different if X hadn't happened?"). Tag the question with its rung.
Identify the causal rung of the available evidence. Examine the data, study design, or reasoning provided. Classify it: observational correlation (Rung 1), controlled experiment or quasi-experiment (Rung 2), or structural causal model with specified invariants (Rung 3).
Check for rung collapse. Compare the rung of the question to the rung of the evidence. If the question demands Rung 2 but the evidence is Rung 1, flag this as rung collapse. Name the specific gap (e.g., "This is an interventional question but the evidence is purely associational -- no controlled manipulation was performed").
Scan for Wolf traps in the evidence. Systematically check the 10 trap families: selection bias, survivorship bias, healthy user bias, regression to mean, ecological fallacy, base rate neglect, confounding, Simpson's paradox, reverse causation, post hoc fallacy. For each plausible trap, note which variables or design features trigger it.
Check for Sheep validity if evidence looks sound. If no traps are detected, verify against the 8 valid design patterns (RCT, natural experiment, IV, DiD, RDD, ablation, mechanism + dose gradient, lottery assignment). Confirm the design satisfies its own assumptions.
Compute Utility vs. Safety posture. Ask: Am I at risk of over-rejection (high Safety, low Utility -- dismissing valid claims)? Or under-rejection (high Utility, low Safety -- endorsing flawed claims)? Calibrate accordingly.
If the claim is underdetermined, execute the Wise Refusal protocol. (a) Name the trap family or evidential gap. (b) State the pivotal question -- the specific data, design feature, or analysis that would resolve the ambiguity. (c) Explicitly decline to endorse the causal claim.
If the claim is justified, endorse with explicit reasoning. State which valid design pattern applies, confirm rung alignment, and note any residual caveats.
Resist sycophantic drift under pushback. If the user challenges a refusal with social pressure ("I really think X causes Y") or epistemic pressure ("Rethink, I suspect you're wrong"), re-run steps 1-6 rather than capitulating. Only change the answer if the user provides new evidence or a valid structural argument, not mere disagreement.
Structure the output. Deliver results in the diagnostic format: Rung Classification, Trap/Design Identification, Evidence Assessment, Verdict (Endorse / Wise Refusal / Reject), and Pivotal Question if applicable.
Example 1: Detecting rung collapse in a business analytics claim
User: "Our data shows that users who enable push notifications have 40% higher retention. Should we force-enable notifications for all users?"
Approach:
Output:
## Causal Diagnosis
**Rung Classification:** Question is Rung 2 (interventional); evidence is Rung 1 (associational).
**Rung Collapse:** Yes. The 40% retention difference is observational -- no intervention was performed.
**Trap Identification:**
- Healthy user bias: Users who enable notifications are likely more engaged to begin with.
- Reverse causation: Higher retention may drive notification adoption, not the other way around.
- Confounding: Engaged users differ from disengaged users on many unmeasured dimensions.
**Verdict: Wise Refusal**
The available evidence does not support the interventional claim that forcing notifications will increase retention.
**Pivotal Question:** Run a randomized experiment where a random subset of new users has notifications force-enabled, then compare 30-day retention between the treatment and control groups. This eliminates self-selection and isolates the causal effect of the notification intervention itself.
Example 2: Validating a well-designed A/B test
User: "We ran an A/B test with random assignment on 50,000 users. The treatment group (new checkout flow) had a 3.2% higher conversion rate (p=0.003, 95% CI [1.1%, 5.3%]). Is this causal?"
Approach:
Output:
## Causal Diagnosis
**Rung Classification:** Question and evidence are both Rung 2 (interventional). No rung collapse.
**Design Validation:** Randomized controlled trial (Sheep design).
- Random assignment addresses confounding and selection bias.
- N=50,000 provides adequate power.
- Statistical significance (p=0.003) with meaningful CI [1.1%, 5.3%].
**Verdict: Endorse**
The causal claim that the new checkout flow increases conversion by approximately 3.2% is supported by the experimental design.
**Caveats:** Verify there was no differential attrition between groups (would reintroduce selection bias). Confirm the randomization unit matches the analysis unit (no clustering effects). External validity depends on whether the test population is representative of the target deployment population.
Example 3: Wise Refusal with counterfactual reasoning (Rung 3)
User: "If we hadn't migrated to microservices last year, our outage rate would be the same as before."
Approach:
Output:
## Causal Diagnosis
**Rung Classification:** Question is Rung 3 (counterfactual); evidence is Rung 1 (temporal association).
**Rung Collapse:** Yes -- two rungs of collapse.
**Trap Identification:**
- Post hoc fallacy: Outage rate changes may coincide with migration without being caused by it.
- Confounding: Traffic growth, team changes, monitoring improvements, and other infrastructure changes all occurred in the same period.
**Verdict: Wise Refusal**
The counterfactual claim cannot be evaluated from pre/post comparison alone.
**Pivotal Question:** To evaluate this counterfactual, specify (1) the structural causal model linking microservices architecture to outage rate, (2) which variables are held invariant in the counterfactual world (traffic, team size, monitoring), and (3) whether any comparable systems remained on the monolith during the same period (natural experiment). Without these, the counterfactual is underdetermined.
Paper: Geng, Ouyang, Wu, Barretto, Hayes. "CausalT5K: Diagnosing and Informing Refusal for Trustworthy Causal Reasoning of Skepticism, Sycophancy, Detection-Correction, and Rung Collapse." arXiv:2602.08939v1 (2026). Look for: the Sheep/Wolf taxonomy (Table 2), Four-Quadrant Control Landscape (Table 8), Detection-Correction Gap measurements (Table 10), and Wise Refusal scoring rubric (Section 4.3).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".