skills/can-post-training-transform-causal/SKILL.md
Perform rigorous causal inference tasks using structured reasoning pipelines inspired by CauGym. Estimate treatment effects (ATE, CDE, ETT, NDE, NIE), compute probabilities of necessity/sufficiency, apply the backdoor criterion for deconfounding, and build causal DAGs from domain knowledge. Trigger phrases: 'estimate causal effect', 'what is the treatment effect of', 'causal reasoning over this data', 'apply backdoor adjustment', 'counterfactual analysis', 'build a causal DAG for this problem'.
npx skillsauth add ndpvt-web/arxiv-claude-skills can-post-training-transform-causalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to perform structured causal inference on user-provided data or domain problems. Based on the CauGym framework (Chen et al., 2026), it applies Pearl's three-level causal hierarchy — association, intervention, and counterfactual — through explicit step-by-step reasoning over structural causal models (SCMs). Rather than relying on intuition or correlation, this skill enforces a disciplined pipeline: construct a causal DAG, identify adjustment sets via the backdoor criterion, compute causal quantities symbolically, and present results with clear assumptions stated.
The CauGym approach demonstrates that structured causal reasoning follows a learnable pipeline. The core insight is that causal inference tasks decompose into: (1) constructing a directed acyclic graph (DAG) encoding causal assumptions, (2) applying identification rules (backdoor criterion, front-door criterion, do-calculus) to determine whether a causal quantity is estimable from observational data, and (3) computing the quantity using the appropriate adjustment formula. This structured decomposition is what allows a 14B model to outperform GPT-o3 on causal benchmarks — the reasoning is procedural, not intuitive.
The seven core tasks span Pearl's causal hierarchy. At the interventional level: ATE (Average Treatment Effect — population-wide effect of treatment), CDE (Controlled Direct Effect — effect while holding mediators fixed), ETT (Effect of Treatment on the Treated — effect on the subpopulation that was actually treated). At the counterfactual level: NDE (Natural Direct Effect — direct pathway only), NIE (Natural Indirect Effect — mediated pathway only), PN (Probability of Necessity — was the treatment necessary for the outcome?), PS (Probability of Sufficiency — would treatment be sufficient to produce the outcome?). Each has a specific formula and identification condition.
Critical robustness principle: A rigorous causal reasoner must detect when data is insufficient. CauGym trains models to output LACK_CONDITION when required conditional probabilities are missing, and to ignore irrelevant variables (redundant information). Always verify that the data provided is sufficient before computing a causal estimate.
Identify the causal question type. Classify the user's request into one of: ATE, CDE, ETT, NDE, NIE, PN, PS, causal discovery, or deconfounding. If ambiguous, ask the user to clarify whether they want a population-level effect (ATE), a subgroup effect (ETT), or a counterfactual (PN/PS).
Construct or validate the causal DAG. From the user's domain description or provided graph, build a directed acyclic graph with named nodes (variables) and directed edges (causal relationships). Explicitly state every assumption: "We assume X causes Y directly, and Z is a common cause of both X and Y." If the user provides data but no DAG, propose one based on domain knowledge and ask for confirmation.
Identify confounders using the backdoor criterion. For the treatment-outcome pair (X, Y), find a set Z such that: (a) no variable in Z is a descendant of X, and (b) Z blocks every backdoor path from X to Y (paths with an arrow into X). State the identified adjustment set explicitly.
Check data sufficiency. Verify that the observational data contains all conditional probabilities required by the adjustment formula. If any P(Y|X, Z) term cannot be computed from the data, report LACK_CONDITION with a specific explanation of what is missing, rather than guessing or approximating.
Write the identification formula. Express the causal quantity in terms of observational distributions using the appropriate formula:
P(Y=1|do(X=1)) - P(Y=1|do(X=0)) = Sum_z [P(Y=1|X=1,Z=z) - P(Y=1|X=0,Z=z)] * P(Z=z)P(Y_{X=1}=1|X=1) - P(Y_{X=0}=1|X=1) using backdoor adjustment conditioned on treatedP(Y_{X=0}=0|X=1,Y=1) — requires both observational and interventional data or monotonicityP(Y_{X=1}=1|X=0,Y=0) — dual of necessityCompute the numerical result. Plug in the provided probabilities or data summaries. Show each arithmetic step. For binary variables, enumerate all values of the adjustment set. For continuous variables, describe the integral or suggest an estimation method (e.g., inverse probability weighting, regression adjustment).
Interpret the result in domain terms. Translate the numerical answer back into the user's domain language: "The average causal effect of the drug on recovery is 0.15, meaning treatment increases recovery probability by 15 percentage points after adjusting for age and severity."
State assumptions and limitations. Explicitly list: (a) causal sufficiency (no unmeasured confounders), (b) positivity (all subgroups have nonzero probability of treatment), (c) consistency (well-defined interventions), and any domain-specific assumptions encoded in the DAG.
Validate robustness. If the user provides enough data, suggest sensitivity analyses: What if there is an unmeasured confounder? How large would its effect need to be to nullify the result? Reference E-values or bounds where appropriate.
Generate code if requested. Produce Python code using dowhy, pgmpy, or networkx for DAG construction, identification, and estimation. Include the full pipeline from DAG specification to effect estimation.
Example 1: Estimating ATE with confounding
User: "I have data on whether patients took a drug (D), their age group (A: young/old), and recovery (R). I think age affects both drug-taking and recovery. What is the causal effect of the drug on recovery? Here are the probabilities: P(R=1|D=1,A=young)=0.8, P(R=1|D=0,A=young)=0.6, P(R=1|D=1,A=old)=0.5, P(R=1|D=0,A=old)=0.3, P(A=young)=0.4, P(A=old)=0.6."
Approach:
Output:
Causal DAG: Age -> Drug, Age -> Recovery, Drug -> Recovery
Adjustment set: {Age}
ATE = [P(R=1|D=1,A=young) - P(R=1|D=0,A=young)] * P(A=young)
+ [P(R=1|D=1,A=old) - P(R=1|D=0,A=old)] * P(A=old)
= (0.8 - 0.6)(0.4) + (0.5 - 0.3)(0.6)
= 0.08 + 0.12
= 0.20
Interpretation: The drug increases recovery probability by 20 percentage
points on average, after adjusting for age confounding.
Assumptions: No unmeasured confounders beyond age, positivity holds
(both age groups can receive or not receive the drug).
Example 2: Detecting insufficient data
User: "Does smoking (S) cause cancer (C)? I know P(C=1|S=1)=0.4 and P(C=1|S=0)=0.1. There might be a genetic factor (G) that affects both."
Approach:
Output:
Causal DAG: Genetics -> Smoking, Genetics -> Cancer, Smoking -> Cancer
Required adjustment set: {Genetics}
LACK_CONDITION: Cannot compute the causal effect.
Missing: P(C=1|S=1,G=g), P(C=1|S=0,G=g), and P(G=g) for each
value of the genetic factor.
The naive difference P(C=1|S=1) - P(C=1|S=0) = 0.3 is a biased
associational estimate that conflates the causal effect of smoking
with the confounding influence of genetics.
To proceed, you need either:
(a) Data on P(C|S,G) and P(G), or
(b) A randomized experiment removing the S<-G confounding path, or
(c) An instrument variable that affects S but not C except through S.
Example 3: Probability of Necessity (counterfactual)
User: "A patient took the drug (D=1) and recovered (R=1). Would they have NOT recovered if they hadn't taken the drug? Assume monotonicity (the drug never hurts). P(R=1|do(D=1))=0.7, P(R=1|do(D=0))=0.3."
Approach:
Output:
Task: Probability of Necessity
Question: Given the patient took the drug and recovered,
what is the probability they would NOT have recovered without it?
Under monotonicity assumption (drug never prevents recovery):
PN = [P(R=1|do(D=1)) - P(R=1|do(D=0))] / P(R=1|do(D=1))
= (0.7 - 0.3) / 0.7
= 0.571
Interpretation: There is a 57.1% probability that this patient's
recovery was causally due to the drug — i.e., they would not have
recovered without it.
Assumption: Monotonicity (no individual for whom the drug prevents
recovery). If monotonicity is violated, PN has wider bounds.
Do:
LACK_CONDITION honestly when data is insufficient rather than making unstated assumptions to fill gaps.Avoid:
dowhy or statsmodels code.Chen, J., Chen, S., & Lu, C. (2026). Can Post-Training Transform LLMs into Causal Reasoners? arXiv:2602.06337. https://arxiv.org/abs/2602.06337
Key takeaway: Structured decomposition of causal tasks into DAG construction, identification via backdoor/front-door criteria, and formula-based computation enables reliable causal reasoning. The CauGym framework and its seven tasks (ATE, CDE, ETT, NDE, NIE, PN, PS) provide the taxonomy. Code and GRPO-trained model at https://github.com/OpenCausaLab/CauGym.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".