skills/17-DAAF-Contribution-Community-daaf/dot-claude/skills/statsmodels/SKILL.md
Statistical modeling: OLS/WLS/GLS, GLM (logit, probit, Poisson), time series (ARIMA, VAR), mixed effects, diagnostics. Formula API. Use for regressions without fixed effects, GLMs, or time series. For FE/DiD use pyfixest; panel/IV use linearmodels.
npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research statsmodelsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
statsmodels general-purpose statistical modeling library for Python. Covers OLS/WLS/GLS, GLM (logit, probit, Poisson, negative binomial), discrete choice models, time series (ARIMA, SARIMAX, VAR), mixed effects (MixedLM), robust regression, hypothesis tests, and comprehensive diagnostics. Supports R-style formula API. Use when fitting regressions without fixed effects, running GLMs or logit/probit, analyzing time series, or using formula syntax. For fixed effects or DiD, use pyfixest; for panel/IV/system models, use linearmodels.
Comprehensive skill for statistical modeling with statsmodels. Use decision trees below to find the right guidance, then load detailed references.
statsmodels is the general-purpose statistical modeling library for Python:
smf.ols("y ~ x1 + x2", data=df)) for R-style modeling, and array API (sm.OLS(y, X)) for programmatic control| File | Purpose | When to Read |
|------|---------|--------------|
| quickstart.md | Installation, formula vs array API, first model | Starting with statsmodels |
| linear-models.md | OLS, WLS, GLS, robust regression, quantile regression | Fitting linear models |
| glm-discrete.md | GLM families, logit/probit, count models, zero-inflated | Non-linear models, binary/count outcomes |
| time-series.md | ARIMA, SARIMAX, VAR, exponential smoothing, unit root tests | Analyzing temporal data |
| diagnostics.md | Heteroskedasticity, normality, VIF, influence, residuals | Checking model assumptions |
| hypothesis-testing.md | t-tests, F-tests, Wald tests, multiple comparisons | Testing coefficients and comparing models |
| gotchas.md | Constant term, convergence, predict pitfalls, pyfixest boundary | Debugging issues |
quickstart.md then linear-models.mdquickstart.md then glm-discrete.mdquickstart.md then time-series.mddiagnostics.mdquickstart.md (formula API mirrors R syntax)svy skill insteaddf.to_pandas() if using PolarsWhat kind of regression?
├─ Linear (continuous outcome)
│ ├─ Basic OLS → ./references/linear-models.md
│ ├─ Weighted least squares → ./references/linear-models.md
│ │ (⚠ WLS ≠ survey-weighted regression — for complex surveys, use `svy` skill)
│ ├─ Correlated errors (GLS) → ./references/linear-models.md
│ ├─ Robust to outliers (M-estimator) → ./references/linear-models.md
│ └─ Quantile regression → ./references/linear-models.md
├─ Binary outcome (0/1)
│ ├─ Logit → ./references/glm-discrete.md
│ └─ Probit → ./references/glm-discrete.md
├─ Count outcome (0, 1, 2, ...)
│ ├─ Poisson → ./references/glm-discrete.md
│ ├─ Negative binomial → ./references/glm-discrete.md
│ └─ Zero-inflated → ./references/glm-discrete.md
├─ Multinomial (3+ categories)
│ └─ Multinomial logit → ./references/glm-discrete.md
├─ GLM (custom family/link)
│ └─ GLM framework → ./references/glm-discrete.md
└─ Need fixed effects?
└─ Use pyfixest instead (faster FE absorption)
What time series task?
├─ Forecast a single series
│ ├─ ARIMA / SARIMAX → ./references/time-series.md
│ └─ Exponential smoothing → ./references/time-series.md
├─ Multiple interrelated series
│ └─ VAR / VECM → ./references/time-series.md
├─ Test for stationarity
│ ├─ ADF test → ./references/time-series.md
│ └─ KPSS test → ./references/time-series.md
├─ Examine autocorrelation
│ └─ ACF / PACF → ./references/time-series.md
└─ Structural time series
└─ Unobserved components → ./references/time-series.md
What assumption to check?
├─ Heteroskedasticity → ./references/diagnostics.md
│ ├─ Breusch-Pagan test
│ └─ White test
├─ Normality of residuals → ./references/diagnostics.md
│ ├─ Jarque-Bera test
│ └─ Shapiro-Wilk test
├─ Specification / functional form → ./references/diagnostics.md
│ └─ RESET test
├─ Multicollinearity → ./references/diagnostics.md
│ ├─ VIF
│ └─ Condition number
├─ Influential observations → ./references/diagnostics.md
│ ├─ Cook's distance
│ └─ Leverage / DFFITS
├─ Serial correlation → ./references/diagnostics.md
│ └─ Durbin-Watson / Breusch-Godfrey
└─ All of the above → ./references/diagnostics.md
What kind of test?
├─ Single coefficient significance → ./references/hypothesis-testing.md
├─ Joint significance (F-test) → ./references/hypothesis-testing.md
├─ Linear restrictions (Wald) → ./references/hypothesis-testing.md
├─ Compare nested models (LR test) → ./references/hypothesis-testing.md
├─ Multiple comparisons correction → ./references/hypothesis-testing.md
└─ Chi-squared test → ./references/hypothesis-testing.md
Common issues?
├─ Missing constant / intercept → ./references/gotchas.md
├─ Convergence warnings → ./references/gotchas.md
├─ predict() errors → ./references/gotchas.md
├─ Formula parsing issues → ./references/gotchas.md
├─ summary() formatting → ./references/gotchas.md
├─ statsmodels vs pyfixest → ./references/gotchas.md
└─ General troubleshooting → ./references/gotchas.md
Important: In data research pipelines (see CLAUDE.md), statsmodels analyses are executed through script files, not interactively. This ensures auditability and reproducibility.
The pattern:
scripts/stage8_analysis/{step}_{model-name}.pyClosely read agent_reference/SCRIPT_EXECUTION_REFERENCE.md for the mandatory file-first execution protocol covering complete code file writing, output capture, and file versioning rules.
See:
agent_reference/SCRIPT_EXECUTION_REFERENCE.md — Script execution protocol and format with validationThe examples below show statsmodels syntax. In research workflows, wrap them in scripts following the file-first pattern.
import statsmodels.api as sm # Array API
import statsmodels.formula.api as smf # Formula API (R-style)
| Operation | Code |
|-----------|------|
| OLS (formula) | smf.ols("y ~ x1 + x2", data=df).fit() |
| OLS (array) | sm.OLS(y, sm.add_constant(X)).fit() |
| Logit | smf.logit("y ~ x1 + x2", data=df).fit() |
| Probit | smf.probit("y ~ x1 + x2", data=df).fit() |
| Poisson | smf.poisson("y ~ x1 + x2", data=df).fit() |
| GLM (custom) | smf.glm("y ~ x1", data=df, family=sm.families.Binomial()).fit() |
| WLS | smf.wls("y ~ x1", data=df, weights=w).fit() |
| Robust (HC1) | fit = smf.ols(...).fit(cov_type='HC1') |
| ARIMA | sm.tsa.ARIMA(y, order=(p,d,q)).fit() |
| Summary | results.summary() |
| Predict | results.predict(new_data) |
| Confidence intervals | results.conf_int(alpha=0.05) |
| Marginal effects | results.get_margeff(at='overall') |
| VIF | from statsmodels.stats.outliers_influence import variance_inflation_factor |
| Breusch-Pagan | sm.stats.diagnostic.het_breuschpagan(resid, exog) |
# Additive terms
"y ~ x1 + x2 + x3"
# Interaction (with main effects)
"y ~ x1 * x2" # equivalent to x1 + x2 + x1:x2
# Interaction only (no main effects)
"y ~ x1 : x2"
# Categorical variable
"y ~ C(region)" # treatment coding (default)
"y ~ C(region, Treatment(reference='West'))" # explicit reference
# Suppress intercept
"y ~ x1 + x2 - 1"
# Polynomial
"y ~ x1 + I(x1**2)" # I() protects Python operators
| Topic | Reference File |
|-------|---------------|
| Installation | ./references/quickstart.md |
| Formula vs array API | ./references/quickstart.md |
| Reading summary output | ./references/quickstart.md |
| Comparison to pyfixest | ./references/quickstart.md |
| OLS regression | ./references/linear-models.md |
| Weighted least squares | ./references/linear-models.md |
| GLS | ./references/linear-models.md |
| Robust regression (RLM) | ./references/linear-models.md |
| Quantile regression | ./references/linear-models.md |
| Interactions and polynomials | ./references/linear-models.md |
| GLM framework | ./references/glm-discrete.md |
| Logit / probit | ./references/glm-discrete.md |
| Multinomial logit | ./references/glm-discrete.md |
| Poisson / negative binomial | ./references/glm-discrete.md |
| Zero-inflated models | ./references/glm-discrete.md |
| Marginal effects | ./references/glm-discrete.md |
| Exposure / offset | ./references/glm-discrete.md |
| ARIMA / SARIMAX | ./references/time-series.md |
| VAR / VECM | ./references/time-series.md |
| Exponential smoothing | ./references/time-series.md |
| Unit root tests | ./references/time-series.md |
| ACF / PACF | ./references/time-series.md |
| Forecasting | ./references/time-series.md |
| State space models | ./references/time-series.md |
| Heteroskedasticity tests | ./references/diagnostics.md |
| Normality tests | ./references/diagnostics.md |
| Specification tests (RESET) | ./references/diagnostics.md |
| VIF / multicollinearity | ./references/diagnostics.md |
| Influence measures | ./references/diagnostics.md |
| Residual analysis | ./references/diagnostics.md |
| Durbin-Watson | ./references/diagnostics.md |
| t-tests and F-tests | ./references/hypothesis-testing.md |
| Wald tests | ./references/hypothesis-testing.md |
| Likelihood ratio tests | ./references/hypothesis-testing.md |
| Multiple comparison corrections | ./references/hypothesis-testing.md |
| Comparing nested models | ./references/hypothesis-testing.md |
| Serial correlation tests | ./references/diagnostics.md |
| Diagnostic checklist | ./references/diagnostics.md |
| Chi-squared tests | ./references/hypothesis-testing.md |
| Joint significance tests | ./references/hypothesis-testing.md |
| Ordered logit / probit | ./references/glm-discrete.md |
| Mixed effects (MixedLM) | ./references/linear-models.md |
| Constant term pitfall | ./references/gotchas.md |
| Convergence warnings | ./references/gotchas.md |
| predict() issues | ./references/gotchas.md |
| Formula parsing (patsy) | ./references/gotchas.md |
| summary() vs summary2() | ./references/gotchas.md |
| NaN / missing data | ./references/gotchas.md |
| DataFrame index issues | ./references/gotchas.md |
| statsmodels vs pyfixest | ./references/gotchas.md |
When this library is used as a primary analytical tool, include in the report's Software & Tools references:
Seabold, S. & Perktold, J. (2010). "Statsmodels: Econometric and Statistical Modeling with Python." Proceedings of the 9th Python in Science Conference.
Cite when: statsmodels is used for GLM estimation, time series modeling, or statistical hypothesis testing central to the analysis. Do not cite when: Only used for post-estimation diagnostics supporting another library's primary estimation.
For method-specific citations (e.g., individual estimators or techniques),
consult the reference files in this skill and agent_reference/CITATION_REFERENCE.md.
tools
Show mcp-stata identity, connected tools, and status. Use when the user asks if mcp-stata is available, asks about access to the toolkit, or asks what Stata tools are connected.
tools
Activate when users mention Stata commands, .do files, regressions, econometrics, stored results, graphs, dataset inspection, replication, or Stata errors. Route the task through mcp-stata tools and the specialized research skills instead of treating it as plain text coding.
development
Build and review paper-ready regression, balance, and summary tables from Stata outputs. Use when the user needs a clean table for a draft, appendix, or coauthor share-out.
tools
Install, configure, update, or verify mcp-stata across Claude Code, Codex, Gemini CLI, Cursor, Windsurf, and VS Code. Activate when users ask to set up the Stata toolkit or troubleshoot the installation.