skills/python_data_stack/SKILL.md
Statistical modeling toolkit. OLS, GLM, logistic, ARIMA, time series, hypothesis tests, diagnostics, AIC/BIC, for rigorous statistical inference and econometric analysis.
npx skillsauth add vuralserhat86/antigravity-agentic-skills python_data_stackInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Statsmodels is Python's premier library for statistical modeling, providing tools for estimation, inference, and diagnostics across a wide range of statistical methods. Apply this skill for rigorous statistical analysis, from simple linear regression to complex time series models and econometric analyses.
This skill should be used when:
import statsmodels.api as sm
import numpy as np
import pandas as pd
# Prepare data - ALWAYS add constant for intercept
X = sm.add_constant(X_data)
# Fit OLS model
model = sm.OLS(y, X)
results = model.fit()
# View comprehensive results
print(results.summary())
# Key results
print(f"R-squared: {results.rsquared:.4f}")
print(f"Coefficients:\\n{results.params}")
print(f"P-values:\\n{results.pvalues}")
# Predictions with confidence intervals
predictions = results.get_prediction(X_new)
pred_summary = predictions.summary_frame()
print(pred_summary) # includes mean, CI, prediction intervals
# Diagnostics
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(results.resid, X)
print(f"Breusch-Pagan p-value: {bp_test[1]:.4f}")
# Visualize residuals
import matplotlib.pyplot as plt
plt.scatter(results.fittedvalues, results.resid)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.show()
from statsmodels.discrete.discrete_model import Logit
# Add constant
X = sm.add_constant(X_data)
# Fit logit model
model = Logit(y_binary, X)
results = model.fit()
print(results.summary())
# Odds ratios
odds_ratios = np.exp(results.params)
print("Odds ratios:\\n", odds_ratios)
# Predicted probabilities
probs = results.predict(X)
# Binary predictions (0.5 threshold)
predictions = (probs > 0.5).astype(int)
# Model evaluation
from sklearn.metrics import classification_report, roc_auc_score
print(classification_report(y_binary, predictions))
print(f"AUC: {roc_auc_score(y_binary, probs):.4f}")
# Marginal effects
marginal = results.get_margeff()
print(marginal.summary())
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Check stationarity
from statsmodels.tsa.stattools import adfuller
adf_result = adfuller(y_series)
print(f"ADF p-value: {adf_result[1]:.4f}")
if adf_result[1] > 0.05:
# Series is non-stationary, difference it
y_diff = y_series.diff().dropna()
# Plot ACF/PACF to identify p, q
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
plot_acf(y_diff, lags=40, ax=ax1)
plot_pacf(y_diff, lags=40, ax=ax2)
plt.show()
# Fit ARIMA(p,d,q)
model = ARIMA(y_series, order=(1, 1, 1))
results = model.fit()
print(results.summary())
# Forecast
forecast = results.forecast(steps=10)
forecast_obj = results.get_forecast(steps=10)
forecast_df = forecast_obj.summary_frame()
print(forecast_df) # includes mean and confidence intervals
# Residual diagnostics
results.plot_diagnostics(figsize=(12, 8))
plt.show()
import statsmodels.api as sm
# Poisson regression for count data
X = sm.add_constant(X_data)
model = sm.GLM(y_counts, X, family=sm.families.Poisson())
results = model.fit()
print(results.summary())
# Rate ratios (for Poisson with log link)
rate_ratios = np.exp(results.params)
print("Rate ratios:\\n", rate_ratios)
# Check overdispersion
overdispersion = results.pearson_chi2 / results.df_resid
print(f"Overdispersion: {overdispersion:.2f}")
if overdispersion > 1.5:
# Use Negative Binomial instead
from statsmodels.discrete.count_model import NegativeBinomial
nb_model = NegativeBinomial(y_counts, X)
nb_results = nb_model.fit()
print(nb_results.summary())
Comprehensive suite of linear models for continuous outcomes with various error structures.
Available models:
Key features:
When to use: Continuous outcome variable, want inference on coefficients, need diagnostics
Reference: See references/linear_models.md for detailed guidance on model selection, diagnostics, and best practices.
Flexible framework extending linear models to non-normal distributions.
Distribution families:
Link functions:
Key features:
When to use: Non-normal outcomes, need flexible variance and link specifications
Reference: See references/glm.md for family selection, link functions, interpretation, and diagnostics.
Models for categorical and count outcomes.
Binary models:
Multinomial models:
Count models:
Key features:
When to use: Binary, categorical, or count outcomes
Reference: See references/discrete_choice.md for model selection, interpretation, and evaluation.
Comprehensive time series modeling and forecasting capabilities.
Univariate models:
Multivariate models:
Advanced models:
Key features:
When to use: Time-ordered data, forecasting, understanding temporal dynamics
Reference: See references/time_series.md for model selection, diagnostics, and forecasting methods.
Extensive testing and diagnostic capabilities for model validation.
Residual diagnostics:
Influence and outliers:
Hypothesis testing:
Multiple comparisons:
Effect sizes and power:
Robust inference:
When to use: Validating assumptions, detecting problems, ensuring robust inference
Reference: See references/stats_diagnostics.md for comprehensive testing and diagnostic procedures.
Statsmodels supports R-style formulas for intuitive model specification:
import statsmodels.formula.api as smf
# OLS with formula
results = smf.ols('y ~ x1 + x2 + x1:x2', data=df).fit()
# Categorical variables (automatic dummy coding)
results = smf.ols('y ~ x1 + C(category)', data=df).fit()
# Interactions
results = smf.ols('y ~ x1 * x2', data=df).fit() # x1 + x2 + x1:x2
# Polynomial terms
results = smf.ols('y ~ x + I(x**2)', data=df).fit()
# Logit
results = smf.logit('y ~ x1 + x2 + C(group)', data=df).fit()
# Poisson
results = smf.poisson('count ~ x1 + x2', data=df).fit()
# ARIMA (not available via formula, use regular API)
# Compare models using AIC/BIC
models = {
'Model 1': model1_results,
'Model 2': model2_results,
'Model 3': model3_results
}
comparison = pd.DataFrame({
'AIC': {name: res.aic for name, res in models.items()},
'BIC': {name: res.bic for name, res in models.items()},
'Log-Likelihood': {name: res.llf for name, res in models.items()}
})
print(comparison.sort_values('AIC'))
# Lower AIC/BIC indicates better model
# For nested models (one is subset of the other)
from scipy import stats
lr_stat = 2 * (full_model.llf - reduced_model.llf)
df = full_model.df_model - reduced_model.df_model
p_value = 1 - stats.chi2.cdf(lr_stat, df)
print(f"LR statistic: {lr_stat:.4f}")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
print("Full model significantly better")
else:
print("Reduced model preferred (parsimony)")
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = []
for train_idx, val_idx in kf.split(X):
X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
# Fit model
model = sm.OLS(y_train, X_train).fit()
# Predict
y_pred = model.predict(X_val)
# Score
rmse = np.sqrt(mean_squared_error(y_val, y_pred))
cv_scores.append(rmse)
print(f"CV RMSE: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")
sm.add_constant() unless excluding intercept.summary() for detailed outputThis skill includes comprehensive reference files for detailed guidance:
Detailed coverage of linear regression models including:
Complete guide to generalized linear models:
Comprehensive guide to discrete outcome models:
In-depth time series analysis guidance:
Comprehensive statistical testing and diagnostics:
When to reference:
Search patterns:
# Find information about specific models
grep -r "Quantile Regression" references/
# Find diagnostic tests
grep -r "Breusch-Pagan" references/stats_diagnostics.md
# Find time series guidance
grep -r "SARIMAX" references/time_series.md
sm.add_constant() unless no intercept desiredFor detailed documentation and examples:
Kaynak: Modern Data Stack & Polars Documentation
Polars kullan (Lazy Evaluation ve Parallelism için).DuckDB kullan (In-process OLAP).Parquet veya Arrow formatında sakla (Hız ve sıkıştırma)..py dosyalarında olsun.Altair veya Plotnine (Grammar of Graphics) kullan.ydata-profiling raporları oluştur.Prefect veya Dagster kullan (Airflow yerine daha modern ve lightweight).Pydantic veya Great Expectations kullan.uv veya conda environment kullan.| Aşama | Doğrulama | |-------|-----------| | 1 | Veri işleme hafızaya (RAM) sığıyor mu? (Sığmıyorsa Polars Streaming veya DuckDB kullan). | | 2 | Tekrarlanan analizler modüler fonksiyonlara dönüştürüldü mü? | | 3 | Veri gizliliği (PII) kontrol edildi mi? |
tools
Production-tested setup for Zustand state management in React. Includes patterns for persistence, devtools, and TypeScript patterns. Prevents hydration mismatches and render loops.
development
Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas
development
--- name: websocket_engineer router_kit: FullStackKit description: WebSocket specialist for real-time communication systems. Invoke for Socket.IO, WebSocket servers, bidirectional messaging, presence systems. Keywords: WebSocket, Socket.IO, real-time, pub/sub, Redis. triggers: - WebSocket - Socket.IO - real-time communication - bidirectional messaging - pub/sub - server push - live updates - chat systems - presence tracking role: specialist scope: implementation output-format:
tools
Toolkit for interacting with and testing local web applications using Playwright. Supports verifying frontend functionality, debugging UI behavior, capturing browser screenshots, and viewing browser logs.