skills/biostatistics/statsmodels-statistical-modeling/SKILL.md
Python statistical modeling: regression (OLS, WLS, GLM), discrete (Logit, Poisson, NegBin), time series (ARIMA, SARIMAX, VAR), with rigorous inference, diagnostics, and hypothesis tests. Use scikit-learn for ML; statistical-analysis for test choice.
npx skillsauth add jaechang-hits/sciagent-skills statsmodels-statistical-modelingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Statsmodels provides classical statistical modeling with rigorous inference for Python. It covers linear models, generalized linear models, discrete choice, time series, and comprehensive diagnostics. Unlike scikit-learn (prediction-focused), statsmodels emphasizes coefficient interpretation, p-values, confidence intervals, and model diagnostics.
y ~ x1 + x2 + C(group)) for intuitive model specificationscikit-learn insteadpymc insteadstatsmodels, numpy, pandas, scipymatplotlib (for diagnostic plots), patsy (for formula API, included with statsmodels)pip install statsmodels numpy pandas matplotlib
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
# Generate sample data
np.random.seed(42)
n = 100
df = pd.DataFrame({
"x1": np.random.randn(n),
"x2": np.random.randn(n),
"group": np.random.choice(["A", "B"], n)
})
df["y"] = 2 + 3 * df["x1"] - 1.5 * df["x2"] + np.random.randn(n)
# OLS with formula API (R-style)
results = smf.ols("y ~ x1 + x2 + C(group)", data=df).fit()
print(results.summary())
print(f"R²: {results.rsquared:.3f}, AIC: {results.aic:.1f}")
Standard linear models with comprehensive diagnostics.
import statsmodels.api as sm
import numpy as np
# Generate data
np.random.seed(42)
X = np.random.randn(200, 3)
y = 1 + 2*X[:, 0] - 0.5*X[:, 1] + np.random.randn(200)
# ALWAYS add constant for intercept
X_const = sm.add_constant(X)
results = sm.OLS(y, X_const).fit()
print(results.summary())
print(f"\nCoefficients: {results.params}")
print(f"P-values: {results.pvalues}")
print(f"R²: {results.rsquared:.4f}")
# Predictions with confidence intervals
pred = results.get_prediction(X_const[:5])
print(pred.summary_frame())
# Robust standard errors (heteroskedasticity-consistent)
results_robust = sm.OLS(y, X_const).fit(cov_type="HC3")
print("Robust SEs:", results_robust.bse)
# Weighted Least Squares
weights = 1 / np.abs(results.resid + 0.1) # Example weights
results_wls = sm.WLS(y, X_const, weights=weights).fit()
print(f"WLS R²: {results_wls.rsquared:.4f}")
Extend regression to non-normal outcomes (binary, count, continuous-positive).
import statsmodels.api as sm
import numpy as np
# Poisson regression for count data
np.random.seed(42)
X = np.random.randn(200, 2)
X_const = sm.add_constant(X)
y_counts = np.random.poisson(np.exp(0.5 + 0.3*X[:, 0]))
model = sm.GLM(y_counts, X_const, family=sm.families.Poisson())
results = model.fit()
print(results.summary())
# Rate ratios
rate_ratios = np.exp(results.params)
print(f"Rate ratios: {rate_ratios}")
# Check overdispersion
overdispersion = results.pearson_chi2 / results.df_resid
print(f"Overdispersion ratio: {overdispersion:.2f}")
if overdispersion > 1.5:
print("→ Consider Negative Binomial model")
Binary, multinomial, and count outcome models.
import statsmodels.api as sm
import numpy as np
# Logistic regression
np.random.seed(42)
X = np.random.randn(300, 2)
X_const = sm.add_constant(X)
prob = 1 / (1 + np.exp(-(0.5 + X[:, 0] - 0.5*X[:, 1])))
y_binary = np.random.binomial(1, prob)
logit_results = sm.Logit(y_binary, X_const).fit()
print(logit_results.summary())
# Odds ratios
odds_ratios = np.exp(logit_results.params)
print(f"Odds ratios: {odds_ratios}")
# Marginal effects (at means)
margeff = logit_results.get_margeff()
print(margeff.summary())
# Predicted probabilities
probs = logit_results.predict(X_const[:5])
print(f"Predicted P(Y=1): {probs}")
Univariate and multivariate time series modeling and forecasting.
import statsmodels.api as sm
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
import numpy as np
import pandas as pd
# Generate time series
np.random.seed(42)
dates = pd.date_range("2020-01-01", periods=200, freq="D")
y = np.cumsum(np.random.randn(200)) + 50
ts = pd.Series(y, index=dates)
# Stationarity test
adf_result = adfuller(ts)
print(f"ADF statistic: {adf_result[0]:.4f}, p-value: {adf_result[1]:.4f}")
print("Stationary" if adf_result[1] < 0.05 else "Non-stationary → difference")
# Fit ARIMA
model = ARIMA(ts, order=(1, 1, 1))
results = model.fit()
print(results.summary())
# Forecast with confidence intervals
forecast = results.get_forecast(steps=30)
forecast_df = forecast.summary_frame()
print(f"30-day forecast:\n{forecast_df.head()}")
# Seasonal ARIMA (SARIMAX)
from statsmodels.tsa.statespace.sarimax import SARIMAX
# Monthly data with yearly seasonality
model_sarima = SARIMAX(ts, order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
results_sarima = model_sarima.fit(disp=False)
print(f"AIC: {results_sarima.aic:.1f}")
# Diagnostic plots
results_sarima.plot_diagnostics(figsize=(12, 8))
Assumption tests, hypothesis tests, and model validation.
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan, acorr_ljungbox
from statsmodels.stats.stattools import jarque_bera
import numpy as np
# Fit a model first
np.random.seed(42)
X = sm.add_constant(np.random.randn(200, 2))
y = 1 + 2*X[:, 1] + np.random.randn(200) * X[:, 1] # Heteroskedastic
results = sm.OLS(y, X).fit()
# Heteroskedasticity test (Breusch-Pagan)
bp_stat, bp_p, _, _ = het_breuschpagan(results.resid, X)
print(f"Breusch-Pagan p-value: {bp_p:.4f} {'→ heteroskedastic' if bp_p < 0.05 else '→ OK'}")
# Normality test (Jarque-Bera)
jb_stat, jb_p, _, _ = jarque_bera(results.resid)
print(f"Jarque-Bera p-value: {jb_p:.4f} {'→ non-normal' if jb_p < 0.05 else '→ OK'}")
# Autocorrelation test (Ljung-Box)
lb_result = acorr_ljungbox(results.resid, lags=[10], return_df=True)
print(f"Ljung-Box p-value (lag 10): {lb_result['lb_pvalue'].values[0]:.4f}")
# Variance Inflation Factor (multicollinearity)
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame({
"Variable": [f"x{i}" for i in range(X.shape[1])],
"VIF": [variance_inflation_factor(X, i) for i in range(X.shape[1])]
})
print(vif_data) # VIF > 10 suggests multicollinearity
Intuitive model specification using formulas with automatic dummy coding.
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({
"y": np.random.randn(100),
"x1": np.random.randn(100),
"x2": np.random.randn(100),
"group": np.random.choice(["A", "B", "C"], 100),
})
# Formula with categoricals (auto dummy-coded)
res = smf.ols("y ~ x1 + x2 + C(group)", data=df).fit()
print(res.summary())
# Interactions
res2 = smf.ols("y ~ x1 * x2", data=df).fit() # x1 + x2 + x1:x2
print(f"Interaction term p-value: {res2.pvalues['x1:x2']:.4f}")
# Logit via formula
df["binary"] = (df["y"] > 0).astype(int)
logit_res = smf.logit("binary ~ x1 + x2 + C(group)", data=df).fit()
print(f"Logit AIC: {logit_res.aic:.1f}")
Goal: Fit OLS, validate assumptions, use robust SEs if needed.
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.outliers_influence import variance_inflation_factor
import numpy as np
import pandas as pd
# 1. Fit initial model
np.random.seed(42)
df = pd.DataFrame({"y": np.random.randn(200), "x1": np.random.randn(200), "x2": np.random.randn(200)})
df["y"] = 2 + 3*df["x1"] - df["x2"] + np.random.randn(200)
results = smf.ols("y ~ x1 + x2", data=df).fit()
# 2. Check heteroskedasticity
bp_stat, bp_p, _, _ = het_breuschpagan(results.resid, results.model.exog)
print(f"Breusch-Pagan p: {bp_p:.4f}")
# 3. If heteroskedastic, use robust SEs
if bp_p < 0.05:
results = smf.ols("y ~ x1 + x2", data=df).fit(cov_type="HC3")
print("Using HC3 robust standard errors")
# 4. Check multicollinearity
X = results.model.exog
for i in range(1, X.shape[1]): # skip constant
print(f"VIF x{i}: {variance_inflation_factor(X, i):.2f}")
# 5. Final results
print(results.summary())
print(f"\nAIC: {results.aic:.1f}, BIC: {results.bic:.1f}")
Goal: Compare nested and non-nested models using appropriate criteria.
import statsmodels.formula.api as smf
from scipy import stats
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({"y": np.random.randn(200), "x1": np.random.randn(200),
"x2": np.random.randn(200), "x3": np.random.randn(200)})
df["y"] = 1 + 2*df["x1"] - df["x2"] + 0.1*df["x3"] + np.random.randn(200)
# Fit nested models
m1 = smf.ols("y ~ x1", data=df).fit()
m2 = smf.ols("y ~ x1 + x2", data=df).fit()
m3 = smf.ols("y ~ x1 + x2 + x3", data=df).fit()
# Compare via AIC/BIC (lower = better)
comparison = pd.DataFrame({
"R²": [m.rsquared for m in [m1, m2, m3]],
"AIC": [m.aic for m in [m1, m2, m3]],
"BIC": [m.bic for m in [m1, m2, m3]],
}, index=["y~x1", "y~x1+x2", "y~x1+x2+x3"])
print(comparison)
# Likelihood ratio test (nested: m2 vs m3)
lr_stat = 2 * (m3.llf - m2.llf)
p_val = 1 - stats.chi2.cdf(lr_stat, df=m3.df_model - m2.df_model)
print(f"\nLR test (m3 vs m2): stat={lr_stat:.2f}, p={p_val:.4f}")
Goal: Test stationarity, identify model order, forecast.
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Generate data
np.random.seed(42)
ts = pd.Series(np.cumsum(np.random.randn(200)) + 100,
index=pd.date_range("2020-01-01", periods=200, freq="D"))
# 1. Test stationarity
adf_p = adfuller(ts)[1]
print(f"ADF p-value: {adf_p:.4f} → {'stationary' if adf_p < 0.05 else 'non-stationary'}")
# 2. Identify order from ACF/PACF (on differenced series)
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 6))
plot_acf(ts.diff().dropna(), lags=20, ax=ax1)
plot_pacf(ts.diff().dropna(), lags=20, ax=ax2)
plt.savefig("acf_pacf.png", dpi=150, bbox_inches="tight")
# 3. Fit and forecast
model = ARIMA(ts[:180], order=(1, 1, 1))
results = model.fit()
forecast = results.get_forecast(steps=20)
fc_df = forecast.summary_frame()
print(f"ARIMA AIC: {results.aic:.1f}")
print(f"Forecast (first 5 days):\n{fc_df.head()}")
| Parameter | Module | Default | Range / Options | Effect |
|-----------|--------|---------|-----------------|--------|
| cov_type | OLS/WLS/GLM | "nonrobust" | "HC0"-"HC3", "HAC", "cluster" | Robust covariance estimator |
| family | GLM | required | Poisson(), Binomial(), Gamma(), etc. | Distribution family |
| order | ARIMA | required | (p, d, q) tuple | AR order, differencing, MA order |
| seasonal_order | SARIMAX | (0,0,0,0) | (P, D, Q, s) tuple | Seasonal ARIMA parameters |
| alpha | summary(), conf_int() | 0.05 | 0.01-0.10 | Significance level for CIs |
| maxiter | All .fit() | 35-100 | 50-1000 | Max optimization iterations |
| method | .fit() | model-dependent | "newton", "bfgs", "lbfgs", "powell" | Optimization algorithm |
| lags | ACF/PACF | None | 10-50 | Number of lags to display |
Always add a constant for OLS/GLM: sm.add_constant(X) or use formula API (adds intercept automatically)
Match model to outcome type: Binary → Logit/Probit, Counts → Poisson/NegBin, Continuous → OLS/WLS, Time series → ARIMA
Check diagnostics before interpreting: Run Breusch-Pagan (heteroskedasticity), Jarque-Bera (normality), Ljung-Box (autocorrelation) on residuals
Use robust SEs when assumptions fail: results = model.fit(cov_type="HC3") for heteroskedasticity-robust inference
Report effect sizes, not just p-values: Include coefficients, confidence intervals, and R² alongside significance tests
Prefer formula API for exploratory work: smf.ols("y ~ x1 * x2 + C(group)", data=df) is more readable and handles categoricals automatically
Test stationarity before time series modeling: Use ADF test; difference if non-stationary
When to use: Comparing means across 3+ groups.
import statsmodels.formula.api as smf
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({"value": np.concatenate([np.random.normal(m, 1, 30) for m in [5, 6, 7]]),
"group": np.repeat(["A", "B", "C"], 30)})
# One-way ANOVA
anova = smf.ols("value ~ C(group)", data=df).fit()
print(sm.stats.anova_lm(anova))
# Post-hoc Tukey HSD
tukey = pairwise_tukeyhsd(df["value"], df["group"], alpha=0.05)
print(tukey)
When to use: Determining required sample size before a study.
from statsmodels.stats.power import TTestIndPower
analysis = TTestIndPower()
# What sample size for medium effect (d=0.5), 80% power, alpha=0.05?
n = analysis.solve_power(effect_size=0.5, alpha=0.05, power=0.8)
print(f"Required n per group: {n:.0f}")
# Power for given sample size
power = analysis.solve_power(effect_size=0.5, alpha=0.05, nobs1=50)
print(f"Power with n=50: {power:.3f}")
When to use: Hierarchical/clustered data (patients within hospitals, students within schools).
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({
"y": np.random.randn(100), "x": np.random.randn(100),
"group": np.repeat(range(10), 10)
})
# Random intercept model
model = smf.mixedlm("y ~ x", data=df, groups=df["group"])
results = model.fit()
print(results.summary())
| Problem | Cause | Solution |
|---------|-------|----------|
| MissingDataError | NaN values in data | Drop NAs: df.dropna() or impute before fitting |
| No intercept in results | Forgot sm.add_constant() | Always add constant, or use smf.ols() formula API |
| ConvergenceWarning | Optimization failed | Increase maxiter, try different method, or scale variables |
| Overdispersion in Poisson | Variance > mean | Switch to NegativeBinomial or use GLM(family=NegativeBinomial()) |
| Non-stationary time series | Trend or unit root | Difference the series (ts.diff()) or increase d in ARIMA |
| Singular matrix error | Perfect multicollinearity | Remove redundant variables; check VIF > 10 |
| Different results from R | Default settings differ | Check: constant term, link function, optimizer, SE type |
| PerfectSeparationError in Logit | Predictor perfectly separates classes | Use regularized logistic (penalized MLE) or Firth's method |
tools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.