Version Compatibility

Reference examples tested with: statsmodels 0.14+, scipy 1.12+, numpy 1.26+, pandas 2.1+, firthmodels 0.3+, marginaleffects 0.0.13+ (Python) / 0.20+ (R). R packages cited: RobinCar, marginaleffects, brant, MASS, VGAM.

Before using code patterns, verify installed versions match. If versions differ:

Python: pip show <package> then help(module.function) to check signatures
R: packageVersion('<pkg>') then ?function_name

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Logistic Regression for Clinical Outcomes

"Model clinical outcomes with logistic regression" -> Estimate the marginal or conditional treatment effect on a binary or ordinal endpoint using a model that respects randomisation stratification, declares its estimand, and survives covariate misspecification.

Conditional vs marginal (the non-collapsibility subtlety): the OR is non-collapsible. The conditional OR from logistic regression is a different parameter than the marginal OR, even when there is NO confounding and randomisation is perfect. This is mathematical, not statistical bias. FDA 2023 favours marginal RD (via g-computation) for primary reporting to avoid parameter ambiguity. See clinical-biostatistics/effect-measures and Permutt 2020.

Algorithmic Taxonomy

| Approach | Estimand | Inference | Strength | Fails when | |----------|----------|-----------|----------|------------| | Unadjusted logistic / chi-square | Marginal OR | Wald or LR | Simple; transparent | Loses efficiency vs adjusted (Senn 2013); inflates SE under stratified randomisation (Kahan-Morris 2012) | | Logistic with covariates (ML, Wald CI) | Conditional log-OR | Wald | Standard; widely available | Conditional OR != marginal OR due to non-collapsibility (Permutt 2020); not the FDA 2023 primary estimand | | Logistic + g-computation / standardisation | Marginal RD/RR/OR | Influence function or bootstrap SE | FDA 2023 recommended primary estimand for binary | Requires correct outcome model AND post-fit standardisation; needs robust SE machinery | | Targeted Maximum Likelihood (TMLE) | Marginal RD/RR/OR | Influence function | Provably efficient; doubly robust in observational | Implementation heavier; mostly R (tmle, tmle3); rare in confirmatory submissions | | Modified Poisson with sandwich SE | Marginal RR | HC1/HC3 sandwich | Direct RR estimation when prevalence >10% | Slightly less efficient than log-binomial when log-binomial converges | | Log-binomial regression | Marginal RR | Wald | Direct RR estimation | Frequent convergence failure when predicted risk near 1 | | Firth penalised logistic | Conditional OR (penalised) | Penalised LR test preferred | Handles separation, rare events (<5% prevalence) | Wald CI/p liberal; must use PLR test (Heinze-Schemper 2002) | | Ordinal logistic (proportional odds) | Common conditional OR across cut-points | Wald or LR | Preserves ordering information | Proportional odds assumption violation (Brant test) | | Partial proportional odds | PO holds for some covariates, not others | Hybrid | Salvages ordinal model when PO fails for one predictor | Increased complexity; harder interpretation | | Multinomial logistic | Per-category log-OR | Wald | No PO assumption needed | Loses efficiency; harder communication |

Postdoc reading list: Permutt 2020 Stat Biopharm Res 12:45 (conditional vs marginal estimand); FDA May 2023 Final Guidance "Adjusting for Covariates in RCTs" (the regulatory rulebook); Tsiatis et al 2008 Stat Med 27:4658 (robust ANCOVA framework); Senn 2013 Stat Med 32:1439 (precision from adjustment even under perfect balance); Kahan-Morris 2012 Stat Med 31:328 (must adjust for stratification factors).

Decision Tree by Scenario

| Scenario | Recommended approach | Why | |----------|---------------------|-----| | RCT, binary primary endpoint, no covariate adjustment in SAP | Unadjusted logistic with marginal OR; chi-square supportive | Simple regulatory case; consider whether covariate adjustment would gain power (Senn 2013) | | RCT, binary primary, covariates pre-specified, FDA 2023 compliant | Logistic adjusted + g-computation for marginal RD; conditional OR supportive | FDA 2023 final: marginal estimand for primary; mention HC3 sandwich SE | | RCT, stratified randomisation (sex/region/severity) | Include strata as covariates in logistic; the analytic decision is non-optional | Kahan-Morris 2012: ignoring strata is over-conservative -- SE biased upward, CIs too wide, power loss | | Common outcome (prevalence >10%) where RR is the policy quantity | Modified Poisson + HC1/HC3 OR log-binomial regression | Avoid OR misinterpretation; Zou 2004; cite EMA 2015 on covariate adjustment | | Rare event (<5% prevalence) or separation observed | Firth penalty + penalised LR test for p-value | Heinze-Schemper 2002; Wald p from Firth is liberal | | Ordinal outcome, PO assumption supportable | Proportional odds; verify with Brant test (R brant::brant) | Most efficient when PO holds; common in toxicity grading and PROs | | Ordinal outcome, PO fails on one or two predictors | Partial PO model (R VGAM::vglm(..., cumulative(parallel=FALSE~X))) | Salvages most efficiency; only the offending coefficient gets per-cut-point estimate | | Ordinal outcome, PO fails widely | Multinomial logistic | Worst case; loses ordering info but valid | | Single arm observational with strong confounding | Logistic adjusted + g-computation + propensity weighting | Doubly robust; cite Hernan-Robins; consider TMLE | | Binary endpoint, longitudinal repeated measures | GEE or generalised linear mixed model | Sandwich SE for GEE; mixed-model OR is subject-specific not marginal |

Standard Workflow

Goal: Fit a covariate-adjusted logistic regression for a binary clinical endpoint with explicit reference category and pre-specified covariates.

Approach: Use the formula API for automatic categorical handling and explicit reference; report both the conditional log-OR and the marginal RD via g-computation.

import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

# CRITICAL: set reference category explicitly. statsmodels default is alphabetical,
# so 'Active' sorts before 'Placebo' and the OR direction silently flips.
model = smf.logit(
    'outcome ~ C(ARM, Treatment(reference="Placebo")) + age + C(sex) + baseline_score',
    data=df
).fit()

# Conditional ORs (Wald)
or_table = pd.DataFrame({
    'OR': np.exp(model.params),
    'Lower_CI': np.exp(model.conf_int()[0]),
    'Upper_CI': np.exp(model.conf_int()[1]),
    'p_value': model.pvalues
})

# Marginal RD via g-computation (the FDA 2023-recommended primary estimand):
df_active = df.assign(ARM='Active')
df_placebo = df.assign(ARM='Placebo')
risk_active = model.predict(df_active).mean()
risk_placebo = model.predict(df_placebo).mean()
marginal_rd = risk_active - risk_placebo
# For SE, use influence-function or bootstrap; see RobinCar / marginaleffects in R

The reference-category trap is the single most common silent bug. smf.logit('y ~ C(ARM)') uses alphabetical ordering, so ARM='Active' becomes 1 and ARM='Placebo' becomes 0 -- but with ARM=['Active','Placebo','Control'], 'Active' is the reference and 'Placebo' is the comparison. Always pass Treatment(reference="Placebo") or equivalent. R glm(y ~ relevel(ARM, ref='Placebo')).

Marginal vs Conditional Estimand -- The FDA 2023 Pivot

The single most important methodological shift in clinical biostatistics 2020-2025. Permutt 2020 established and FDA 2023 codified: the maximum-likelihood coefficient on Z in glm(Y ~ Z + X, family=binomial) is the conditional log-OR -- the OR comparing Z=1 to Z=0 holding X fixed. Due to OR non-collapsibility, this is a different parameter than the marginal log-OR (the OR comparing the entire treated population to the entire control population), even under perfect randomisation.

The FDA 2023 final guidance ("Adjusting for Covariates in Randomized Clinical Trials," May 2023) endorses covariate-adjusted nonlinear-model analysis provided the analyst targets a marginal estimand. Reporting the conditional OR from a multivariable logistic regression as "the treatment effect" silently changes the estimand from what was pre-specified.

G-computation / Standardisation

# Marginal effects on three scales via g-computation:
df_z1 = df.assign(ARM='Active')
df_z0 = df.assign(ARM='Placebo')
p_z1 = model.predict(df_z1)
p_z0 = model.predict(df_z0)

marg_p1 = p_z1.mean()
marg_p0 = p_z0.mean()
marg_rd = marg_p1 - marg_p0
marg_rr = marg_p1 / marg_p0
marg_or = (marg_p1 / (1 - marg_p1)) / (marg_p0 / (1 - marg_p0))

For valid SE: the delta-method/influence-function variance and HC3 sandwich are required. Python marginaleffects v0.0.13+ implements this via marginaleffects.avg_comparisons(model, variables='ARM', vcov='HC3'). R is more mature: marginaleffects::avg_comparisons (Arel-Bundock-Greifer-Heiss 2024 JSS 111:9), RobinCar (purpose-built for FDA 2023), riskCommunicator.

Tsiatis et al 2008 robustness guarantee: under randomisation Z ⊥ X, the g-computation marginal estimator is consistent for the marginal ATE even if the outcome model is misspecified. Efficiency depends on model quality; consistency does not. This is why FDA 2023 accepts the marginal RD via g-computation without requiring proof of correct logistic mean structure.

Reporting template (post-FDA-2023):

"Primary estimand: marginal risk difference of -8.5 percentage points (95% CI -12.7 to -4.2, HC3 SE), computed by g-computation/standardisation from a logistic regression adjusted for age, sex, and baseline severity. Supportive: conditional OR 0.48 (95% CI 0.34-0.67, Wald) from the same model."

Modified Poisson for Common Outcomes -- Direct RR

When prevalence > 10% and the policy quantity is RR (not OR), modified Poisson with sandwich SE is the de facto modern standard (Zou 2004 AJE 159:702):

import statsmodels.api as sm
import numpy as np

X = sm.add_constant(df[['treatment', 'age', 'baseline_score']])
poisson_model = sm.GLM(df['outcome'], X, family=sm.families.Poisson()).fit(cov_type='HC1')
rr = np.exp(poisson_model.params)
rr_ci = np.exp(poisson_model.conf_int())

cov_type='HC1' corrects the over-dispersion that Poisson inherently assumes. Without it the variance is wrong because binary data are NOT Poisson. For n < 250, HC3 is preferred (Long-Ervin 2000). What postdocs argue about: modified Poisson vs log-binomial -- log-binomial directly models RR but frequently fails to converge when predicted risk approaches 1; modified Poisson always converges but is marginally less efficient when log-binomial works.

Proportional Odds and Brant Test

Proportional odds (PO) assumption: the effect of each predictor is constant across all cut-points of the ordinal outcome. Must be tested or the model parameters are invalid.

from statsmodels.miscmodels.ordinal_model import OrderedModel
from statsmodels.api import MNLogit
import pandas as pd

df['severity'] = pd.Categorical(df['severity'], categories=['mild', 'moderate', 'severe'], ordered=True)

po_model = OrderedModel.from_formula('severity ~ treatment + age', data=df, distr='logit').fit(method='bfgs', disp=0)
mn_model = MNLogit.from_formula('severity ~ treatment + age', data=df).fit(disp=0)

# LR test for PO vs MN (omnibus)
lr_stat = 2 * (mn_model.llf - po_model.llf)
lr_df = mn_model.df_model - po_model.df_model
from scipy.stats import chi2
lr_p = 1 - chi2.cdf(lr_stat, lr_df)

# Per-coefficient Brant test in R: brant::brant(po_model_object)
# The Brant test localises which predictor(s) violate PO -- essential before deciding on partial PO

The omnibus LR test is necessary but not sufficient. Brant test (Brant 1990 Biometrics 46:1171; R brant::brant) provides per-coefficient PO tests, identifying which predictor violates PO. If only one or two predictors violate PO, fit a partial proportional odds model (R VGAM::vglm(..., cumulative(parallel=FALSE~X1+X2))) rather than abandoning the model entirely.

OrderedModel intercept gotcha: do NOT add an intercept term. Threshold parameters (cut-points between ordinal levels) replace the intercept. An explicit constant causes non-identifiability and optimiser failure.

Separation and Firth Penalty

Detection:

from firthmodels import FirthLogisticRegression, detect_separation
import numpy as np

sep_result = detect_separation(X, y)
if sep_result.separation:
    print(sep_result.summary())
# Manual signs: coefficient > 10, SE > 100, convergence warnings

Firth penalty (Firth 1993 Biometrika 80:27):

firth = FirthLogisticRegression()
firth.fit(X, y)
or_firth = np.exp(firth.coef_)

# IMPORTANT: Wald p-values from Firth are LIBERAL (anti-conservative).
# Prefer the penalised likelihood-ratio test (PLRT).
# Some Python `firthlogist` releases expose a PLRT attribute (e.g. `pvalues_lrt_`)
# but its presence and name vary by release -- check `dir(firth)` against the
# installed version, otherwise compute the PLRT manually from the penalised
# log-likelihoods of nested models.

Heinze-Schemper 2002 Stat Med 21:2409 showed Wald inference from Firth is liberal (anti-conservative); the penalised likelihood-ratio test (PLRT) is the recommended inference. PLRT attributes (e.g. pvalues_lrt_) appear in some Python Firth packages but the exact attribute name varies by release -- introspect the installed package; compute the PLRT manually if no attribute is exposed:

def penalised_lrt(firth_full, firth_reduced):
    # Compute 2 * (penalised log-lik full - penalised log-lik reduced) ~ chi-square_df
    pass  # implementation depends on package version; see Heinze-Schemper 2002

Firth's method was originally designed for finite-sample bias reduction, not separation per se -- it adds the Jeffreys prior penalty to the likelihood, keeping coefficients finite under separation as a side effect. Also recommended for rare events (<5% prevalence) where ML bias is non-negligible.

Hauck-Donner Effect Detection

The Hauck-Donner effect (1977 JASA 72:851; revived by Yee 2022 JASA 117:1763): the Wald test statistic is non-monotonic in the parameter estimate near the boundary. A large log-OR can produce a small Wald chi-square -- so the Wald test fails to reject when LR/profile-likelihood would. Common in small samples with strong predictors.

# In R, detect with VGAM::hdeff() and replace Wald with profile likelihood:
# MASS::confint.glm(model) returns profile-likelihood CIs as default
# Python equivalent: bootstrap or manual profile likelihood

When to suspect Hauck-Donner: large coefficient magnitude with non-significant Wald p; large SE with finite estimate; switching from Wald to LR test changes significance. The fix is always: switch to profile likelihood or LR inference.

Reconciliation: When Methods Disagree

| Pattern | Likely cause | Action | |---------|--------------|--------| | Conditional OR from logistic vs marginal RD from g-computation give different conclusions | Non-collapsibility (OR is non-collapsible; RD is collapsible) | Report marginal RD as primary per FDA 2023; conditional OR as supportive with explicit parameter label; cite Permutt 2020 | | ML logistic diverges (large coefficient, huge SE); Firth penalised converges | Complete or quasi-complete separation in covariate-outcome relationship | Firth penalty with penalised LR test (NOT Wald p-values); cite Heinze-Schemper 2002 | | Modified Poisson and log-binomial give different RR estimates | Log-binomial convergence fragility when predicted risk approaches 1 | Modified Poisson with sandwich SE preferred for robust convergence; cite Zou 2004 | | Proportional-odds (cumulative logit) and multinomial give different inferences | PO assumption violated for one or more predictors (Brant test rejects) | Localise via Brant test; if 1-2 predictors violate, partial-PO model in VGAM; if widely, multinomial | | Adjusted vs unadjusted treatment effect differ substantially | Confounding (in observational) OR non-collapsibility (in RCT) | In RCT, non-collapsibility expected (Permutt 2020); in observational, investigate confounding via DAG | | Large OR with non-significant Wald p-value | Hauck-Donner effect: Wald non-monotone near boundary (Yee 2022) | Use profile-likelihood CI (R MASS::confint.glm); detect via VGAM::hdeff() | | Stratified analysis significant; unstratified not | Achieved SE smaller in stratified analysis | Include stratification factors in primary model (Kahan-Morris 2012); ignoring biases the SE upward and loses power (over-conservative) | | Significant treatment effect on conditional model vanishes when mediator included | Adjusting for post-treatment variable on causal pathway | Never adjust for mediators in primary; use causal DAG to distinguish confounder vs mediator |

Per-Method Failure Modes

Reference-category silent reversal

Trigger: Treatment variable coded as character with Active and Placebo; reference not explicitly set.
Mechanism: statsmodels defaults to alphabetical ordering; 'Active' becomes the reference.
Symptom: OR for "ARM[T.Placebo]" appears in output; direction is the opposite of intended.
Fix: Always pass C(ARM, Treatment(reference="Placebo")) or numeric encoding with explicit comment.

Adjusting for a mediator

Trigger: Including a post-randomisation variable that lies on the causal pathway.
Mechanism: Adjustment blocks the causal effect, attenuating treatment effect estimate toward null.
Symptom: Significant unadjusted effect becomes non-significant after adjustment for "mechanism" variable (e.g., inflammation marker).
Fix: Use causal DAG to distinguish confounder from mediator; never adjust for post-treatment variables in the primary analysis. See ICH E9(R1).

Hauck-Donner non-monotonicity

Trigger: Small samples, strong predictor, cell counts approaching zero.
Mechanism: Wald test statistic is non-monotone in the parameter near the boundary.
Symptom: Large OR magnitude with non-significant Wald p; switching to LR test changes significance.
Fix: Use profile likelihood (R MASS::confint.glm) or LR inference; detect with VGAM::hdeff().

Complete or quasi-complete separation

Trigger: A predictor perfectly predicts outcome in a subset of the data.
Mechanism: Likelihood is monotone in that coefficient; ML estimate diverges to infinity.
Symptom: Coefficient >10, SE >100, convergence warning; convergence message says "fitted probabilities numerically 0 or 1."
Fix: Firth penalty (firthmodels); inference via penalised LR test, not Wald.

Proportional odds violation undetected

Trigger: Ordinal outcome fit with cumulative-logit (PO) model without testing assumption.
Mechanism: PO assumes a constant effect across cut-points; violation invalidates model parameters.
Symptom: Brant test rejects PO for one or more predictors; per-cut-point effects in a saturated model diverge.
Fix: Brant test first; if PO fails on one predictor, partial PO model; if widely fails, multinomial logistic.

Stratified randomisation not in analysis

Trigger: Stratification by site/region/severity at randomisation; analysis ignores strata.
Mechanism: Achieved SE smaller than calculated SE because randomisation removed between-stratum variability.
Symptom: Over-conservative inference -- SE biased upward, CIs too wide, Type-I error below nominal, and power loss (Kahan-Morris 2012).
Fix: Include stratification factors as covariates in the logistic; this is non-optional per ICH E9, FDA 2023, EMA 2015.

Model Diagnostics

| Diagnostic | Method | Threshold | Caveat | |-----------|--------|-----------|--------| | Discrimination | ROC-AUC (sklearn.metrics.roc_auc_score) | >0.7 acceptable, >0.8 good | Trial-level, not patient-individual; depends on outcome prevalence | | Calibration -- primary | Calibration plot (observed vs predicted in deciles) | Curve along the diagonal | Visual; preferred over H-L | | Calibration -- secondary | Hosmer-Lemeshow chi-square | p > 0.05 | Low power n<200; oversensitive n>2000 | | Pseudo R-squared | model.prsquared (McFadden) | >0.2 excellent | NOT comparable to OLS R-squared | | Events per variable | n_events / n_covariates | >=10 EPV (Peduzzi 1996) | Below this: bias, overfitting; consider Firth |

Hosmer-Lemeshow

from scipy.stats import chi2
import pandas as pd

def hosmer_lemeshow(y_true, y_pred, n_groups=10):
    df_hl = pd.DataFrame({'y': y_true, 'prob': y_pred})
    df_hl['group'] = pd.qcut(df_hl['prob'], n_groups, duplicates='drop')
    grouped = df_hl.groupby('group').agg(obs=('y', 'sum'), n=('y', 'count'), pred=('prob', 'mean'))
    grouped['expected'] = grouped['n'] * grouped['pred']
    hl_stat = (((grouped['obs'] - grouped['expected']) ** 2) /
               (grouped['n'] * grouped['pred'] * (1 - grouped['pred']))).sum()
    actual_groups = len(grouped)
    return hl_stat, 1 - chi2.cdf(hl_stat, actual_groups - 2)

H-L is supplementary, not primary: Hosmer-Lemeshow has low power n<200 (rarely rejects even for poor calibration) and oversensitive n>2000 (rejects for trivial miscalibration). The decile choice is also arbitrary -- pd.qcut(..., duplicates='drop') can reduce the actual number of groups when probabilities tie, changing df. Use calibration plots (smoothed observed vs predicted) as primary; Hosmer-Lemeshow as supplementary p-value.

Quantitative Thresholds

| Threshold | Source | Rationale | |-----------|--------|-----------| | 10 events per variable (EPV) | Peduzzi et al 1996 J Clin Epidemiol 49:1373 | Below this, coefficient bias and CI mis-coverage | | Prevalence > 10% -> prefer marginal RR via modified Poisson | Zou 2004 AJE 159:702 | OR overstates RR; modified Poisson directly estimates RR | | Marginal estimand for primary regulatory analysis | FDA 2023 Final Guidance | Conditional OR is a different parameter than marginal (Permutt 2020) | | Brant test for PO assumption | Brant 1990 Biometrics 46:1171 | Omnibus LR test misses per-coefficient violations | | Firth penalty with penalised LR test, not Wald | Heinze-Schemper 2002 Stat Med 21:2409 | Wald is liberal under Firth penalty | | HC3 sandwich SE for n <=250 | Long-Ervin 2000 Am Stat 54:217 | HC1 (Stata default) anti-conservative in small samples | | Stratification factors in analysis when used in randomisation | Kahan-Morris 2012 Stat Med 31:328 | Ignoring biases SE upward -- CIs too wide, power loss (over-conservative) |

Common Errors

| Error / symptom | Cause | Solution | |-----------------|-------|----------| | OR direction opposite of expected | Alphabetical reference; 'Active' sorts before 'Placebo' | C(ARM, Treatment(reference="Placebo")) always | | Conditional and marginal OR differ substantially | Non-collapsibility, not confounding | Cite Permutt 2020; report both with explicit labels | | Treatment effect vanishes after adjustment | Adjusted for a mediator (post-treatment variable) | Causal DAG check; never adjust for post-treatment in primary | | Huge coefficient, huge SE, non-significant Wald | Separation OR Hauck-Donner | Detect separation: firthmodels.detect_separation; switch to Firth + PLRT | | H-L p > 0.5 with obvious miscalibration | Low power at n<200 | Use calibration plot as primary; H-L supplementary | | H-L p < 0.001 with great-looking plot | Oversensitive at n>2000 | Use calibration plot; cite Steyerberg 2019 for cautions | | OrderedModel fails to converge with intercept | Threshold parameters replace intercept | Drop the explicit constant | | firth.pvalues_ looks too low | Wald p from Firth is liberal | Inspect the installed Firth package for a PLRT attribute (e.g. pvalues_lrt_); compute PLRT manually if none is exposed | | GLM(family=Binomial) and Logit give same point but different output | sm.Logit provides pred_table(), prsquared; sm.GLM does not | Use Logit unless changing link functions | | Robust SE not reported for modified Poisson | Missing cov_type='HC1' or 'HC3' | Always specify sandwich SE for modified Poisson |

Anticipated Reviewer Pushback

| Pushback | Response | |----------|----------| | "Why is the marginal RD different from the conditional OR's implied RD?" | Non-collapsibility of OR; cite Permutt 2020 and FDA 2023. Marginal RD via g-computation is the primary per FDA 2023. | | "Adjustment for stratification factors?" | Pre-specified strata included in the model. Cite Kahan-Morris 2012 and ICH E9. | | "How were covariates chosen?" | Pre-specified in the SAP based on prior literature/clinical knowledge. No data-driven selection (which would inflate Type-I). | | "Why not use stepwise selection?" | Stepwise leaks signal from the outcome and inflates Type-I. Pre-specification is the regulatory norm; cite FDA 2023. | | "PH check for the longitudinal binary?" | Binary doesn't have PH; if longitudinal, used GEE with sandwich SE or GLMM; see ICH E9 estimand for treatment policy. | | "Why Firth not standard logistic?" | Separation detected (or rare events <5%); standard ML diverges; Firth penalty + PLRT recommended (Heinze-Schemper 2002). | | "Brant test result?" | PO holds for predictors X1, X2, X3; PO fails for X4 -> partial PO model fit per VGAM. | | "Calibration?" | Calibration plot in supplement; H-L p reported as supplementary, not primary. |

References

Arel-Bundock V, Greifer N, Heiss A. 2024. How to interpret statistical models using marginaleffects in R and Python. J Stat Softw 111:9.
Brant R. 1990. Assessing proportionality in the proportional odds model for ordinal logistic regression. Biometrics 46:1171-1178.
FDA. 2023. Adjusting for Covariates in Randomized Clinical Trials for Drugs and Biological Products. Final Guidance, May 2023.
Firth D. 1993. Bias reduction of maximum likelihood estimates. Biometrika 80:27-38.
Hauck WW, Donner A. 1977. Wald's test as applied to hypotheses in logit analysis. JASA 72:851-853.
Heinze G, Schemper M. 2002. A solution to the problem of separation in logistic regression. Stat Med 21:2409-2419.
Kahan BC, Morris TP. 2012. Improper analysis of trials randomised using stratified blocks or minimisation. Stat Med 31:328-340.
Lin DY, Wei LJ. 1989. The robust inference for the Cox proportional hazards model. JASA 84:1074-1078.
Long JS, Ervin LH. 2000. Using heteroscedasticity consistent standard errors in the linear regression model. Am Stat 54:217-224.
Moore KL, van der Laan MJ. 2009. Covariate adjustment in randomized trials with binary outcomes: TMLE. Stat Med 28:39-64.
Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. 1996. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 49:1373-1379.
Permutt T. 2020. Do covariates change the estimand? Stat Biopharm Res 12:45-53.
Senn S. 2013. Seven myths of randomisation in clinical trials. Stat Med 32:1439-1450.
Steyerberg EW. 2019. Clinical Prediction Models (2nd ed). Springer.
Tsiatis AA, Davidian M, Zhang M, Lu X. 2008. Covariate adjustment for two-sample treatment comparisons in randomized clinical trials. Stat Med 27:4658-4677.
Wang B, Susukida R, Mojtabai R, Amin-Esmaeili M, Rosenblum M. 2023. Model-robust inference for clinical trials that improve precision by stratified randomization and covariate adjustment. JASA 118:1152-1163.
Yee TW. 2022. On the Hauck-Donner effect in Wald tests. JASA 117:1763-1774.
Zou G. 2004. A modified Poisson regression approach to prospective studies with binary data. AJE 159:702-706.

Related Skills

clinical-biostatistics/cdisc-data-handling - Prepare analysis datasets from CDISC SDTM/ADaM domains
clinical-biostatistics/effect-measures - Modern CI methods; marginal vs conditional in depth
clinical-biostatistics/categorical-tests - Chi-square and Fisher alternatives for unadjusted tests
clinical-biostatistics/subgroup-analysis - Interaction terms and HTE detection
clinical-biostatistics/survival-analysis - Cox regression for time-to-event analogues
clinical-biostatistics/multiplicity-graphical - Bretz-Maurer graphs for multiple endpoints from one logistic model
clinical-biostatistics/trial-reporting - CONSORT 2025 and ICH E9(R1) reporting of logistic analyses

Version Compatibility

Before using code patterns, verify installed versions match. If versions differ:

Python: pip show <package> then help(module.function) to check signatures
R: packageVersion('<pkg>') then ?function_name

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Logistic Regression for Clinical Outcomes

Algorithmic Taxonomy

Decision Tree by Scenario

Standard Workflow

Goal: Fit a covariate-adjusted logistic regression for a binary clinical endpoint with explicit reference category and pre-specified covariates.

Approach: Use the formula API for automatic categorical handling and explicit reference; report both the conditional log-OR and the marginal RD via g-computation.

import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

# CRITICAL: set reference category explicitly. statsmodels default is alphabetical,
# so 'Active' sorts before 'Placebo' and the OR direction silently flips.
model = smf.logit(
    'outcome ~ C(ARM, Treatment(reference="Placebo")) + age + C(sex) + baseline_score',
    data=df
).fit()

# Conditional ORs (Wald)
or_table = pd.DataFrame({
    'OR': np.exp(model.params),
    'Lower_CI': np.exp(model.conf_int()[0]),
    'Upper_CI': np.exp(model.conf_int()[1]),
    'p_value': model.pvalues
})

# Marginal RD via g-computation (the FDA 2023-recommended primary estimand):
df_active = df.assign(ARM='Active')
df_placebo = df.assign(ARM='Placebo')
risk_active = model.predict(df_active).mean()
risk_placebo = model.predict(df_placebo).mean()
marginal_rd = risk_active - risk_placebo
# For SE, use influence-function or bootstrap; see RobinCar / marginaleffects in R

Marginal vs Conditional Estimand -- The FDA 2023 Pivot

G-computation / Standardisation

# Marginal effects on three scales via g-computation:
df_z1 = df.assign(ARM='Active')
df_z0 = df.assign(ARM='Placebo')
p_z1 = model.predict(df_z1)
p_z0 = model.predict(df_z0)

marg_p1 = p_z1.mean()
marg_p0 = p_z0.mean()
marg_rd = marg_p1 - marg_p0
marg_rr = marg_p1 / marg_p0
marg_or = (marg_p1 / (1 - marg_p1)) / (marg_p0 / (1 - marg_p0))

Reporting template (post-FDA-2023):

"Primary estimand: marginal risk difference of -8.5 percentage points (95% CI -12.7 to -4.2, HC3 SE), computed by g-computation/standardisation from a logistic regression adjusted for age, sex, and baseline severity. Supportive: conditional OR 0.48 (95% CI 0.34-0.67, Wald) from the same model."

Modified Poisson for Common Outcomes -- Direct RR

When prevalence > 10% and the policy quantity is RR (not OR), modified Poisson with sandwich SE is the de facto modern standard (Zou 2004 AJE 159:702):

import statsmodels.api as sm
import numpy as np

X = sm.add_constant(df[['treatment', 'age', 'baseline_score']])
poisson_model = sm.GLM(df['outcome'], X, family=sm.families.Poisson()).fit(cov_type='HC1')
rr = np.exp(poisson_model.params)
rr_ci = np.exp(poisson_model.conf_int())

Proportional Odds and Brant Test

Proportional odds (PO) assumption: the effect of each predictor is constant across all cut-points of the ordinal outcome. Must be tested or the model parameters are invalid.

from statsmodels.miscmodels.ordinal_model import OrderedModel
from statsmodels.api import MNLogit
import pandas as pd

df['severity'] = pd.Categorical(df['severity'], categories=['mild', 'moderate', 'severe'], ordered=True)

po_model = OrderedModel.from_formula('severity ~ treatment + age', data=df, distr='logit').fit(method='bfgs', disp=0)
mn_model = MNLogit.from_formula('severity ~ treatment + age', data=df).fit(disp=0)

# LR test for PO vs MN (omnibus)
lr_stat = 2 * (mn_model.llf - po_model.llf)
lr_df = mn_model.df_model - po_model.df_model
from scipy.stats import chi2
lr_p = 1 - chi2.cdf(lr_stat, lr_df)

# Per-coefficient Brant test in R: brant::brant(po_model_object)
# The Brant test localises which predictor(s) violate PO -- essential before deciding on partial PO

Separation and Firth Penalty

Detection:

from firthmodels import FirthLogisticRegression, detect_separation
import numpy as np

sep_result = detect_separation(X, y)
if sep_result.separation:
    print(sep_result.summary())
# Manual signs: coefficient > 10, SE > 100, convergence warnings

Firth penalty (Firth 1993 Biometrika 80:27):

firth = FirthLogisticRegression()
firth.fit(X, y)
or_firth = np.exp(firth.coef_)

# IMPORTANT: Wald p-values from Firth are LIBERAL (anti-conservative).
# Prefer the penalised likelihood-ratio test (PLRT).
# Some Python `firthlogist` releases expose a PLRT attribute (e.g. `pvalues_lrt_`)
# but its presence and name vary by release -- check `dir(firth)` against the
# installed version, otherwise compute the PLRT manually from the penalised
# log-likelihoods of nested models.

def penalised_lrt(firth_full, firth_reduced):
    # Compute 2 * (penalised log-lik full - penalised log-lik reduced) ~ chi-square_df
    pass  # implementation depends on package version; see Heinze-Schemper 2002

Hauck-Donner Effect Detection

# In R, detect with VGAM::hdeff() and replace Wald with profile likelihood:
# MASS::confint.glm(model) returns profile-likelihood CIs as default
# Python equivalent: bootstrap or manual profile likelihood

Reconciliation: When Methods Disagree

Per-Method Failure Modes

Reference-category silent reversal

Trigger: Treatment variable coded as character with Active and Placebo; reference not explicitly set.
Mechanism: statsmodels defaults to alphabetical ordering; 'Active' becomes the reference.
Symptom: OR for "ARM[T.Placebo]" appears in output; direction is the opposite of intended.
Fix: Always pass C(ARM, Treatment(reference="Placebo")) or numeric encoding with explicit comment.

Adjusting for a mediator

Trigger: Including a post-randomisation variable that lies on the causal pathway.
Mechanism: Adjustment blocks the causal effect, attenuating treatment effect estimate toward null.
Symptom: Significant unadjusted effect becomes non-significant after adjustment for "mechanism" variable (e.g., inflammation marker).
Fix: Use causal DAG to distinguish confounder from mediator; never adjust for post-treatment variables in the primary analysis. See ICH E9(R1).

Hauck-Donner non-monotonicity

Trigger: Small samples, strong predictor, cell counts approaching zero.
Mechanism: Wald test statistic is non-monotone in the parameter near the boundary.
Symptom: Large OR magnitude with non-significant Wald p; switching to LR test changes significance.
Fix: Use profile likelihood (R MASS::confint.glm) or LR inference; detect with VGAM::hdeff().

Complete or quasi-complete separation

Trigger: A predictor perfectly predicts outcome in a subset of the data.
Mechanism: Likelihood is monotone in that coefficient; ML estimate diverges to infinity.
Symptom: Coefficient >10, SE >100, convergence warning; convergence message says "fitted probabilities numerically 0 or 1."
Fix: Firth penalty (firthmodels); inference via penalised LR test, not Wald.

Proportional odds violation undetected

Trigger: Ordinal outcome fit with cumulative-logit (PO) model without testing assumption.
Mechanism: PO assumes a constant effect across cut-points; violation invalidates model parameters.
Symptom: Brant test rejects PO for one or more predictors; per-cut-point effects in a saturated model diverge.
Fix: Brant test first; if PO fails on one predictor, partial PO model; if widely fails, multinomial logistic.

Stratified randomisation not in analysis

Trigger: Stratification by site/region/severity at randomisation; analysis ignores strata.
Mechanism: Achieved SE smaller than calculated SE because randomisation removed between-stratum variability.
Symptom: Over-conservative inference -- SE biased upward, CIs too wide, Type-I error below nominal, and power loss (Kahan-Morris 2012).
Fix: Include stratification factors as covariates in the logistic; this is non-optional per ICH E9, FDA 2023, EMA 2015.

Model Diagnostics

Hosmer-Lemeshow

from scipy.stats import chi2
import pandas as pd

def hosmer_lemeshow(y_true, y_pred, n_groups=10):
    df_hl = pd.DataFrame({'y': y_true, 'prob': y_pred})
    df_hl['group'] = pd.qcut(df_hl['prob'], n_groups, duplicates='drop')
    grouped = df_hl.groupby('group').agg(obs=('y', 'sum'), n=('y', 'count'), pred=('prob', 'mean'))
    grouped['expected'] = grouped['n'] * grouped['pred']
    hl_stat = (((grouped['obs'] - grouped['expected']) ** 2) /
               (grouped['n'] * grouped['pred'] * (1 - grouped['pred']))).sum()
    actual_groups = len(grouped)
    return hl_stat, 1 - chi2.cdf(hl_stat, actual_groups - 2)

Quantitative Thresholds

Common Errors

Anticipated Reviewer Pushback

References

Arel-Bundock V, Greifer N, Heiss A. 2024. How to interpret statistical models using marginaleffects in R and Python. J Stat Softw 111:9.
Brant R. 1990. Assessing proportionality in the proportional odds model for ordinal logistic regression. Biometrics 46:1171-1178.
FDA. 2023. Adjusting for Covariates in Randomized Clinical Trials for Drugs and Biological Products. Final Guidance, May 2023.
Firth D. 1993. Bias reduction of maximum likelihood estimates. Biometrika 80:27-38.
Hauck WW, Donner A. 1977. Wald's test as applied to hypotheses in logit analysis. JASA 72:851-853.
Heinze G, Schemper M. 2002. A solution to the problem of separation in logistic regression. Stat Med 21:2409-2419.
Kahan BC, Morris TP. 2012. Improper analysis of trials randomised using stratified blocks or minimisation. Stat Med 31:328-340.
Lin DY, Wei LJ. 1989. The robust inference for the Cox proportional hazards model. JASA 84:1074-1078.
Long JS, Ervin LH. 2000. Using heteroscedasticity consistent standard errors in the linear regression model. Am Stat 54:217-224.
Moore KL, van der Laan MJ. 2009. Covariate adjustment in randomized trials with binary outcomes: TMLE. Stat Med 28:39-64.
Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. 1996. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 49:1373-1379.
Permutt T. 2020. Do covariates change the estimand? Stat Biopharm Res 12:45-53.
Senn S. 2013. Seven myths of randomisation in clinical trials. Stat Med 32:1439-1450.
Steyerberg EW. 2019. Clinical Prediction Models (2nd ed). Springer.
Tsiatis AA, Davidian M, Zhang M, Lu X. 2008. Covariate adjustment for two-sample treatment comparisons in randomized clinical trials. Stat Med 27:4658-4677.
Wang B, Susukida R, Mojtabai R, Amin-Esmaeili M, Rosenblum M. 2023. Model-robust inference for clinical trials that improve precision by stratified randomization and covariate adjustment. JASA 118:1152-1163.
Yee TW. 2022. On the Hauck-Donner effect in Wald tests. JASA 117:1763-1774.
Zou G. 2004. A modified Poisson regression approach to prospective studies with binary data. AJE 159:702-706.

Related Skills

clinical-biostatistics/cdisc-data-handling - Prepare analysis datasets from CDISC SDTM/ADaM domains
clinical-biostatistics/effect-measures - Modern CI methods; marginal vs conditional in depth
clinical-biostatistics/categorical-tests - Chi-square and Fisher alternatives for unadjusted tests
clinical-biostatistics/subgroup-analysis - Interaction terms and HTE detection
clinical-biostatistics/survival-analysis - Cox regression for time-to-event analogues
clinical-biostatistics/multiplicity-graphical - Bretz-Maurer graphs for multiple endpoints from one logistic model
clinical-biostatistics/trial-reporting - CONSORT 2025 and ICH E9(R1) reporting of logistic analyses

Adoption

GPTomics/bio-clinical-biostatistics-logistic-regression

$ install --global

Security Scan Results

SKILL.md

Version Compatibility

Logistic Regression for Clinical Outcomes

Algorithmic Taxonomy

Decision Tree by Scenario

Standard Workflow

Marginal vs Conditional Estimand -- The FDA 2023 Pivot

G-computation / Standardisation

Modified Poisson for Common Outcomes -- Direct RR

Proportional Odds and Brant Test

Separation and Firth Penalty

Hauck-Donner Effect Detection

Reconciliation: When Methods Disagree

Per-Method Failure Modes

Reference-category silent reversal

Adjusting for a mediator

Hauck-Donner non-monotonicity

Complete or quasi-complete separation

Proportional odds violation undetected

Stratified randomisation not in analysis

Model Diagnostics

Hosmer-Lemeshow

Quantitative Thresholds

Common Errors

Anticipated Reviewer Pushback

References

Related Skills

Related Skills

GPTomics/bio-workflows-clip-pipeline

GPTomics/bio-comparative-genomics-whole-genome-duplication

GPTomics/bio-comparative-genomics-whole-genome-alignment

GPTomics/bio-comparative-genomics-synteny-analysis

GPTomics/bio-clinical-biostatistics-logistic-regression

$ install --global

Security Scan Results

SKILL.md

Version Compatibility

Logistic Regression for Clinical Outcomes

Algorithmic Taxonomy

Decision Tree by Scenario

Standard Workflow

Marginal vs Conditional Estimand -- The FDA 2023 Pivot

G-computation / Standardisation

Modified Poisson for Common Outcomes -- Direct RR

Proportional Odds and Brant Test

Separation and Firth Penalty

Hauck-Donner Effect Detection

Reconciliation: When Methods Disagree

Per-Method Failure Modes

Reference-category silent reversal

Adjusting for a mediator

Hauck-Donner non-monotonicity

Complete or quasi-complete separation

Proportional odds violation undetected

Stratified randomisation not in analysis

Model Diagnostics

Hosmer-Lemeshow

Quantitative Thresholds

Common Errors

Anticipated Reviewer Pushback

References

Related Skills

Related Skills

GPTomics/bio-workflows-clip-pipeline

GPTomics/bio-comparative-genomics-whole-genome-duplication

GPTomics/bio-comparative-genomics-whole-genome-alignment

GPTomics/bio-comparative-genomics-synteny-analysis