clinical-biostatistics/categorical-tests/SKILL.md
Tests associations between categorical variables in clinical data using chi-square, Fisher's exact, Boschloo, Cochran-Mantel-Haenszel, and modern McNemar variants with calibrated confidence intervals (Wilson, Newcombe, Miettinen-Nurminen). Use when analyzing categorical outcomes, paired binary endpoints, or testing treatment-outcome independence in confirmatory or exploratory clinical trials.
npx skillsauth add GPTomics/bioSkills bio-clinical-biostatistics-categorical-testsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: scipy 1.12+ (Boschloo and Barnard added in 1.7), statsmodels 0.14+, pingouin 0.5+, exact2x2 (R) 1.6+, pandas 2.1+, numpy 1.26+.
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturespackageVersion() then ?function_name to verify parametersIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Test association between categorical variables" -> Determine whether treatment and a categorical clinical outcome are statistically independent (or that marginal proportions agree, for paired data) using a test calibrated to the design, the sample size, and the regulatory question.
| Test | Design | Asymptotic / exact | Conditioning | Strength | Fails when |
|------|--------|--------------------|--------------|----------|------------|
| Pearson chi-square (no continuity correction) | Independent groups, any RxC | Asymptotic | None | Standard for n>=40 with all expected counts >=5; matches Miettinen-Nurminen score CI | Any expected cell <1; >20% of cells with expected <5 (Cochran 1954) |
| Fisher's exact (conditional) | Independent 2x2 | Exact | Conditions on BOTH margins | Exact small-sample guarantee on level | Conservative (true alpha << nominal); discards information by double conditioning (Mehta-Senchaudhuri 2003) |
| Boschloo's exact | Independent 2x2 | Exact unconditional | Conditions on ONE margin only | Uniformly more powerful than Fisher (Boschloo 1970; Mehta-Senchaudhuri 2003); preserves nominal alpha exactly | Computationally heavier; RxC extensions limited |
| Barnard's exact | Independent 2x2 | Exact unconditional | Conditions on ONE margin only | Maximises nuisance parameter; well-calibrated | Slightly less powerful than Boschloo on average; compute scales O(n^2) |
| CMH (Mantel-Haenszel) | Stratified independent groups | Asymptotic | Conditions within strata | Tests common-OR null across strata; pooled OR estimator | Assumes no qualitative interaction; misleading when ORs reverse direction across strata |
| Breslow-Day | Stratified independent groups | Asymptotic | Within strata | Tests homogeneity of stratum ORs | Underpowered with few strata or sparse strata; non-significance does NOT prove homogeneity |
| McNemar (asymptotic, no continuity correction) | Paired binary | Asymptotic | Conditions on discordant pairs | Fagerland 2013 default; outperforms exact conditional | Discordant pair count b+c < 25 (chi-square approximation breaks) |
| Mid-p McNemar | Paired binary | Quasi-exact | Discordant pairs | Fagerland-Lydersen-Laake 2013 recommended default; less conservative than exact conditional | Slight under-coverage tolerable at small b+c |
| Exact conditional McNemar (Liddell 1983) | Paired binary | Exact | Discordant pairs only | Guaranteed coverage | Over-conservative; loses power vs mid-p or unconditional |
| Suissa-Shuster exact unconditional | Paired binary | Exact unconditional | All N pairs | Uniformly more powerful than exact conditional McNemar; 20-40% smaller n for same power | Implementation only in R exact2x2::mcnemarExactDP and SAS macros |
Postdoc reading: Lydersen, Fagerland & Laake 2009 Stat Med 28:1159 ("Recommended tests for association in 2x2 tables") argues Fisher's exact should be retired from routine use in favour of Boschloo or asymptotic Pearson; Fagerland-Lydersen-Laake 2013 BMC Med Res Methodol 13:91 makes the parallel case for mid-p or asymptotic McNemar over exact conditional. Regulatory practice (FDA reviewers) is moving in this direction but Fisher and exact conditional McNemar remain entrenched in many SAPs by inertia.
| Scenario | Recommended test | Why |
|----------|------------------|-----|
| Independent 2x2, all expected >=5, n>=40 | Pearson chi-square, correction=False | Standard asymptotic; Yates' continuity correction is overly conservative and now discouraged |
| Independent 2x2, expected <5 in any cell OR n<40 | Boschloo's exact (scipy.stats.boschloo_exact) | Uniformly more powerful than Fisher; preserves exact Type-I control |
| Independent RxC, expected <5 in >20% of cells | Permutation chi-square or Fisher-Freeman-Halton (R coin::chisq_test(distribution = approximate())) | Exact RxC asymptotic invalid; permutation preserves level |
| Stratified design (multi-site, multi-stratum randomisation) | CMH for the pooled test + Breslow-Day for homogeneity + per-stratum ORs | Stratification factor in randomisation MUST appear in analysis (Kahan-Morris 2012 Stat Med 31:328: ignoring inflates Type-I error by up to 30%) |
| Stratified with sign-reversing effect (Simpson's paradox suspected) | Always report stratum-specific ORs + visual diagnostic; consider logistic regression with interaction | CMH pooled OR can mask sign reversal; homogeneity test underpowered |
| Paired binary (pre/post on same subjects; matched case-control) | Asymptotic McNemar (mcnemar(table, exact=False, correction=False)) when b+c >= 25; mid-p when b+c < 25 | Fagerland 2013 simulations show mid-p and asymptotic outperform exact conditional |
| Matched-pair non-inferiority (especially diagnostics) | Suissa-Shuster exact unconditional via R exact2x2 | 20-40% smaller n than exact conditional for same power |
| Composite endpoint (any of several events) | Logistic regression with covariate adjustment, not chi-square | Composite changes the estimand under ICH E9(R1); see clinical-biostatistics/effect-measures |
Goal: Test whether treatment group and outcome category are independent under asymptotic Type-I control.
Approach: Construct a contingency table, verify expected cell counts, compute the Pearson chi-square statistic without Yates' continuity correction.
from scipy.stats import chi2_contingency
import pandas as pd
table = pd.crosstab(df['treatment'], df['outcome'])
chi2, p, dof, expected = chi2_contingency(table, correction=False)
if (expected < 5).any():
print('WARNING: switch to Boschloo (2x2) or permutation chi-square (RxC)')
Cochran 1954 rule (the precise version, not the textbook caricature): "no expected cell should be <1 AND no more than 20% of cells should have expected <5." The textbook "all >=5" rule is the conservative simplification. With well-balanced 2x2 trials this rarely matters; with sparse RxC tables it materially expands the asymptotic range. The R chisq.test issues a warning under the strict Cochran rule; Python users must check manually.
Why Yates' correction is now discouraged: continuity correction was introduced to approximate the exact distribution under H0 but inflates Type-II error by ~10% (D'Agostino, Casagrande, Pike 1988 Stat Med 7:347). Modern computing makes Boschloo's exact test cheap; the correct fix for sparse 2x2 is Boschloo, not chi-square + continuity.
Goal: Test 2x2 association with exact Type-I control.
Approach: Use Fisher's exact only when historical SAP requires it; otherwise prefer Boschloo's test.
from scipy.stats import fisher_exact, boschloo_exact
odds_ratio, p_fisher = fisher_exact(table.values, alternative='two-sided')
# Boschloo (uniformly more powerful than Fisher):
# n= controls Sobol sampling resolution for the null distribution (scipy 1.12+; default 32);
# higher = more precise p-value at higher CPU cost. NOT the sample size per arm.
result = boschloo_exact(table.values, alternative='two-sided', n=64)
p_boschloo = result.pvalue
The conditioning critique (Mehta-Senchaudhuri 2003): Fisher's exact conditions on both margins of the 2x2 table, discarding information about the marginal totals. Boschloo conditions on one margin only and treats the second as a nuisance to be maximised over -- recovering the discarded information. Power gain at n=10/arm is 16-20 percentage points for moderate effects. Boschloo uses Fisher's p-value as its test statistic, then computes the exact unconditional null distribution of that p-value -- so it is automatically at least as powerful as Fisher.
Since scipy 1.10, fisher_exact returns the sample (unconditional) odds ratio, not the conditional MLE. For the conditional MLE matching R's fisher.test, use scipy.stats.contingency.odds_ratio(table, kind='conditional').
Goal: Test treatment-outcome association while controlling for a stratification variable; quantify the common odds ratio across strata.
Approach: Construct per-stratum 2x2 tables, compute MH pooled OR and CMH test of H0: common-OR = 1; test homogeneity via Breslow-Day.
from statsmodels.stats.contingency_tables import StratifiedTable
import pandas as pd
tables = []
for stratum in df['site'].unique():
stratum_data = df[df['site'] == stratum]
t = pd.crosstab(stratum_data['treatment'], stratum_data['outcome']).values
if t.shape == (2, 2) and t.min() > 0:
tables.append(t)
st = StratifiedTable(tables)
print(st.test_null_odds()) # CMH H0: common OR = 1
print(st.oddsratio_pooled) # MH pooled OR
print(st.oddsratio_pooled_confint(method='normal'))
print(st.test_equal_odds()) # Breslow-Day H0: equal stratum ORs
CMH -- Simpson's paradox masking
Breslow-Day -- low-power false reassurance
CMH -- ignoring randomisation stratification factors
Goal: Test the null of marginal homogeneity (P(positive at time 1) = P(positive at time 2)) for paired binary observations.
Approach: Default to asymptotic McNemar without continuity correction when discordant pairs >=25; switch to mid-p when discordant pairs <25; reserve exact conditional only when regulator-mandated.
from statsmodels.stats.contingency_tables import mcnemar
import numpy as np
# table[i,j] = count with outcome i at time 1 and j at time 2
table = np.array([[45, 15], [5, 35]]) # b=15, c=5 discordant
# Asymptotic, no continuity correction -- the Fagerland 2013 recommended default
result = mcnemar(table, exact=False, correction=False)
print(result.statistic, result.pvalue)
# Exact conditional (Liddell 1983) -- only when b+c is very small or required by SAP
result_exact = mcnemar(table, exact=True)
Fagerland-Lydersen-Laake 2013 BMC Med Res Methodol 13:91 simulation findings: mid-p McNemar and asymptotic McNemar (no continuity correction) outperform exact conditional McNemar across small-to-moderate samples. The exact conditional is too conservative because it conditions on a discrete margin (the discordant pair count). Their title is the methodological provocation -- "The McNemar test: asymptotic and mid-p are better than exact conditional."
Suissa-Shuster 1991 Biometrics 47:361 exact unconditional uses all N pairs (not just discordant) -- uniformly more powerful than exact conditional McNemar; sample sizes 20-40% smaller for the same power. Available in R exact2x2::mcnemarExactDP. Practically essential for matched-pair non-inferiority in diagnostic device trials.
Goal: Quantify association strength beyond p-values.
Approach: Phi for 2x2, Cramer's V for RxC; bias-corrected variants in pingouin.
import numpy as np
import pingouin as pg
n = table.values.sum()
phi = np.sqrt(chi2 / n)
k = min(table.shape) - 1
cramers_v = np.sqrt(chi2 / (n * k))
# Pingouin with multiple test variants and bias correction:
expected, observed, stats = pg.chi2_independence(df, x='treatment', y='outcome')
# stats columns: test, lambda, chi2, dof, pval, cramer, power
Cohen 1988 effect-size benchmarks:
| df | Small | Medium | Large | |----|-------|--------|-------| | 1 | 0.10 | 0.30 | 0.50 | | 2 | 0.07 | 0.21 | 0.35 | | 3 | 0.06 | 0.17 | 0.29 |
Phi equals Cramer's V for 2x2 (k=1). For RxC, only Cramer's V is valid because phi can exceed 1.
| Pattern | Likely cause | Action | |---------|--------------|--------| | Fisher exact p > 0.05 but Boschloo p < 0.05 | Fisher over-conservative via double-margin conditioning (Mehta-Senchaudhuri 2003); Boschloo recovers power | Cite Boschloo as primary; provide Fisher in appendix for transparency | | Pearson chi-square p < 0.05 but Fisher exact p > 0.05 | Asymptotic approximation breaking down at small expected counts (Cochran rule violation) | Use Boschloo (exact unconditional, more powerful than Fisher); document expected-count diagnostic in SAP | | CMH pooled OR not significant, stratum-specific ORs strongly differ | Simpson's paradox -- opposite-sign cancellation OR effect modification | Report stratum-specific ORs as primary; switch to logistic regression with treatment-by-stratum interaction; forest plot stratum ORs | | Breslow-Day non-significant but stratum-OR forest plot visually heterogeneous | Low power of Breslow-Day with few/sparse strata | Cite low-power caveat; report LR interaction test from logistic as secondary; do NOT claim homogeneity | | McNemar exact conditional p > 0.05 but mid-p McNemar p < 0.05 | Exact conditional over-conservative due to discrete-margin conditioning | Cite Fagerland-Lydersen-Laake 2013; mid-p or asymptotic recommended; exact conditional only when SAP-mandated | | Suissa-Shuster unconditional p < exact conditional McNemar p | Unconditional uses all N pairs; conditional discards concordant pairs | Suissa-Shuster preferred for matched-pair NI (esp. diagnostic devices) due to 20-40% smaller n | | Wald CI excludes null but Wilson/Newcombe CI overlaps null | Wald has poor coverage near 0 and 1 (Brown-Cai-DasGupta 2001) | Wilson/Newcombe/MN preferred; cite as regulatory standard | | Multiple categorical secondary endpoints, correlation structure unclear | Bonferroni overly conservative; Hochberg requires PRDS (Sarkar 1998) | See clinical-biostatistics/multiplicity-graphical for Bretz-Maurer graphical procedures and PRDS check |
For a single proportion, Wald is bad for small samples and extreme p (Brown-Cai-DasGupta 2001 Stat Sci 16:101 documents "chaotic" coverage with coverage dropping to 0.0 in extreme cells). Use Wilson score or Jeffreys.
For a 2x2 risk difference or risk ratio, the regulatory standard for CI is Miettinen-Nurminen score-based (1985 Stat Med 4:213) -- consistent with the Pearson chi-square test and accepted by FDA/EMA for noninferiority margins.
from statsmodels.stats.proportion import proportion_confint, proportions_ztest
# Single proportion: Wilson is the modern default
ci = proportion_confint(45, 60, alpha=0.05, method='wilson')
# Also available: 'jeffreys', 'agresti_coull', 'beta' (Clopper-Pearson exact)
# Difference of proportions: Newcombe-Wilson hybrid / MOVER
from statsmodels.stats.proportion import confint_proportions_2indep
ci_diff = confint_proportions_2indep(45, 60, 30, 60, method='newcomb', alpha=0.05)
# 'wald' is discouraged; 'newcomb' (Newcombe-Wilson hybrid) and 'agresti-caffo' are calibrated
For Miettinen-Nurminen CIs (the regulatory standard for stratified RD or RR), use R ratesci::scoreci(contrast='RD'|'RR', distrib='bin', stratified=TRUE) -- there is no production-grade Python implementation as of 2026.
from statsmodels.stats.multitest import multipletests
from itertools import combinations
categories = df['outcome'].unique()
pvalues, comparisons = [], []
for cat1, cat2 in combinations(categories, 2):
subset = df[df['outcome'].isin([cat1, cat2])]
sub_table = pd.crosstab(subset['treatment'], subset['outcome'])
_, p_val, _, _ = chi2_contingency(sub_table, correction=False)
pvalues.append(p_val)
comparisons.append(f'{cat1} vs {cat2}')
reject, adjusted_p, _, _ = multipletests(pvalues, method='holm')
method='holm' (FWER) for confirmatory; method='fdr_bh' for exploratory. Critical bug: multipletests default is method='hs' (Holm-Sidak), NOT Holm or Bonferroni -- always specify explicitly. The FDA Multiple Endpoints Final Guidance (October 2022) requires FWER control for key secondary endpoints in regulatory submissions; FDR is acceptable for exploratory subgroup screens only.
| Threshold | Source | Rationale | |-----------|--------|-----------| | n >= 40 for 2x2 chi-square | Cochran 1954 Biometrics 10:417 | Below this, asymptotic chi-square distribution approximation degrades regardless of expected counts | | All expected >=5 OR <=20% with expected <5 AND none <1 | Cochran 1954 (strict) | Textbook "all >=5" is overconservative; the strict rule expands chi-square's valid range | | Yates' correction discouraged | D'Agostino, Casagrande, Pike 1988 Stat Med 7:347 | Overly conservative; correct fix for sparse 2x2 is Boschloo's exact, not continuity-corrected chi-square | | Discordant pairs >=25 for asymptotic McNemar | Fagerland-Lydersen-Laake 2013 BMC MRM 13:91 | Below this, chi-square approximation breaks; switch to mid-p, not exact conditional | | Newcombe-Wilson / Miettinen-Nurminen for RD CI | Newcombe 1998a Stat Med 17:873 | Wald CI for RD has poor coverage and can produce limits outside [-1, 1] | | Boschloo > Fisher for small 2x2 | Mehta-Senchaudhuri 2003; Lydersen-Fagerland-Laake 2009 Stat Med 28:1159 | Boschloo uniformly more powerful at same Type-I |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| multipletests(p) returns Holm-Sidak adjusted p | Default method is 'hs' not Holm/Bonferroni | Always specify method='holm' or 'bonferroni' explicitly |
| Table2x2(crosstab.values) gives reciprocal OR | pd.crosstab orders columns alphabetically; statsmodels expects event-positive column first | Reorder: cross[[1, 0]] or cross[['Yes', 'No']] |
| Fisher's exact in published paper, Boschloo missing | SAP inertia; reviewers unfamiliar with Boschloo | Cite Mehta-Senchaudhuri 2003 in the SAP; use Boschloo as primary with Fisher in appendix |
| fisher_exact returns "wrong" OR vs R | Since scipy 1.10, scipy returns sample (unconditional) OR; R returns conditional MLE | Use scipy.stats.contingency.odds_ratio(table, kind='conditional') to match R |
| CMH significant but stratum ORs reverse direction | Simpson's paradox; Breslow-Day underpowered | Forest plot stratum ORs; report stratum-specific as primary; switch to logistic with interaction |
| Yates' correction enabled by default in chi2_contingency | scipy default is correction=True for 2x2 | Always pass correction=False for Pearson chi-square |
| McNemar p-value much larger than expected | Default may be exact conditional in some packages; over-conservative | Use asymptotic without continuity correction (Fagerland 2013) |
| Stratified randomisation ignored in primary analysis | Common SAP error | Include strata in CMH, logistic, or stratified log-rank; ignoring inflates Type-I (Kahan-Morris 2012) |
| Pushback | Response | |----------|----------| | "Why not Fisher's exact?" | Cite Lydersen-Fagerland-Laake 2009; Boschloo is uniformly more powerful at same alpha. Provide Fisher p in appendix for direct comparison. | | "Why no continuity correction?" | D'Agostino-Casagrande-Pike 1988 -- Yates' inflates Type-II by ~10%. The correct fix for sparse 2x2 is Boschloo's exact, not Yates'. | | "Are these ORs collapsible?" | OR is non-collapsible (see clinical-biostatistics/effect-measures); marginal and conditional ORs differ even without confounding. Cite Permutt 2020. | | "Why mid-p McNemar over exact conditional?" | Fagerland-Lydersen-Laake 2013 simulations show exact conditional is over-conservative; mid-p and asymptotic maintain nominal Type-I with better power. | | "Adjustment for stratification factors?" | Per ICH E9 and FDA 2023 covariate adjustment guidance, strata from randomisation must appear in analysis. CMH or logistic with strata as covariates. | | "What is the estimand?" | Per ICH E9(R1), categorical-test analyses target a specific estimand (treatment policy is implicit if all randomised analysed). Articulate explicitly. |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.