skills/python-regression-statistics/SKILL.md
Expert guidance for regression analysis, statistical modeling, and outlier detection in Python using statsmodels, scikit-learn, scipy, and PyOD - includes model diagnostics, assumption checking, robust methods, and comprehensive outlier detection strategies
npx skillsauth add Hongyu-yu/matsci-ai-skills python-regression-statisticsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Expert assistant for regression analysis, statistical modeling, and outlier detection in Python. This skill covers statsmodels for statistical regression, scikit-learn for machine learning approaches, scipy for statistical methods, and PyOD for comprehensive outlier detection.
Use when:
Advantages:
Limitations:
Use when:
Advantages:
Limitations:
# 1. Exploratory Data Analysis
# - Visualize relationships (scatter plots, pair plots)
# - Check distributions (histograms, Q-Q plots)
# - Identify potential outliers (box plots, scatter plots)
# - Calculate summary statistics
# 2. Initial Outlier Screening (Optional)
# - Use multiple detection methods (IQR, Z-score, isolation forest)
# - Visualize flagged points
# - Investigate why they're outliers (data errors vs real extremes)
# - Document decision to keep or remove
# 3. Check Assumptions (for statistical regression)
# - Linearity: residual vs fitted plots, partial regression plots
# - Independence: Durbin-Watson test, autocorrelation plots
# - Normality: Q-Q plots, Shapiro-Wilk test, histogram of residuals
# - Equal variance: scale-location plot, Breusch-Pagan test
# 4. Model Fitting
# - Start simple (OLS) then add complexity as needed
# - Try transformations if assumptions violated (log, Box-Cox)
# - Consider robust methods if outliers present
# - Use regularization if many predictors (Ridge/Lasso)
# 5. Model Diagnostics
# - Residual analysis
# - Check for influential points (Cook's D, DFFITS, leverage)
# - Multicollinearity (VIF)
# - Outliers in predictor space vs response space
# 6. Validation
# - Train/test split or cross-validation
# - Compare multiple models (AIC, BIC, R², RMSE)
# - Check prediction intervals (statistical) or residual distribution (ML)
# 7. Interpretation & Reporting
# - Coefficient interpretation with confidence intervals
# - Feature importance (if ML model)
# - Visualize predictions vs actuals
# - Report diagnostics and assumption checks
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
# Arrays API - more control
X = sm.add_constant(X_raw) # Add intercept
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
# Formula API - easier specification
df = pd.DataFrame({'y': y, 'x1': x1, 'x2': x2, 'category': cat})
model = smf.ols('y ~ x1 + x2 + C(category)', data=df)
results = model.fit()
# Access key results
results.params # Coefficients
results.pvalues # P-values
results.conf_int() # Confidence intervals
results.rsquared # R²
results.rsquared_adj # Adjusted R²
results.aic, results.bic # Information criteria
results.resid # Residuals
results.fittedvalues # Predicted values
# When heteroskedasticity is present
# Weights = 1/variance of each observation
# If you know the variance structure
weights = 1 / variance_array
model = sm.WLS(y, X, weights=weights)
results = model.fit()
# Iteratively Reweighted Least Squares
# Start with OLS, estimate variance function, refit
ols_results = sm.OLS(y, X).fit()
resid_squared = ols_results.resid**2
# Fit variance as function of predictors
variance_model = sm.OLS(np.log(resid_squared), X).fit()
weights = 1 / np.exp(variance_model.fittedvalues)
wls_results = sm.WLS(y, X, weights=weights).fit()
# Resistant to outliers - downweights influential points
import statsmodels.robust.robust_linear_model as rlm
# Huber's T norm (default)
model = rlm.RLM(y, X, M=sm.robust.norms.HuberT())
results = model.fit()
# Other M-estimators
# TukeyBiweight - more aggressive outlier downweighting
model = rlm.RLM(y, X, M=sm.robust.norms.TukeyBiweight())
results = model.fit()
# Check weights assigned to each observation
weights = results.weights # Outliers get lower weights
# For non-normal response variables
# Poisson regression (count data)
model = sm.GLM(y, X, family=sm.families.Poisson())
results = model.fit()
# Negative Binomial (overdispersed counts)
model = sm.GLM(y, X, family=sm.families.NegativeBinomial())
results = model.fit()
# Gamma regression (positive continuous data)
model = sm.GLM(y, X, family=sm.families.Gamma())
results = model.fit()
# Binomial/Logistic regression
model = sm.GLM(y, X, family=sm.families.Binomial())
results = model.fit()
# Link functions
# family=sm.families.Gaussian(link=sm.families.links.log())
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.vector_ar.var_model import VAR
# ARIMA (AutoRegressive Integrated Moving Average)
model = ARIMA(y, order=(p, d, q)) # p=AR order, d=differencing, q=MA order
results = model.fit()
# Seasonal ARIMA
model = SARIMAX(y, order=(p, d, q), seasonal_order=(P, D, Q, s))
results = model.fit()
# Forecasting
forecast = results.forecast(steps=10)
forecast_with_ci = results.get_forecast(steps=10)
forecast_df = forecast_with_ci.summary_frame()
# Vector Autoregression (multivariate time series)
model = VAR(df[['y1', 'y2', 'y3']])
results = model.fit(maxlags=5)
results.forecast(df.values[-results.k_ar:], steps=10)
# Comprehensive diagnostic plots
from statsmodels.graphics.gofplots import qqplot
import matplotlib.pyplot as plt
# 1. Residual vs Fitted
plt.scatter(results.fittedvalues, results.resid)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
# Should show random scatter, no pattern
# 2. Q-Q plot for normality
qqplot(results.resid, line='s')
# Points should follow diagonal line
# 3. Scale-Location (heteroskedasticity)
plt.scatter(results.fittedvalues, np.sqrt(np.abs(results.resid_pearson)))
plt.xlabel('Fitted values')
plt.ylabel('√|Standardized Residuals|')
# Should show constant spread
# 4. Residuals vs Leverage
from statsmodels.stats.outliers_influence import OLSInfluence
influence = OLSInfluence(results)
fig, ax = plt.subplots()
ax.scatter(influence.hat_matrix_diag, results.resid_pearson)
# Identifies influential points (high leverage + large residual)
# Statistical tests
from statsmodels.stats.diagnostic import het_breuschpagan, acorr_ljungbox
from statsmodels.stats.stattools import durbin_watson
# Heteroskedasticity test
lm_stat, lm_pval, f_stat, f_pval = het_breuschpagan(results.resid, results.model.exog)
print(f"Breusch-Pagan p-value: {lm_pval}") # p < 0.05 indicates heteroskedasticity
# Autocorrelation
dw = durbin_watson(results.resid)
print(f"Durbin-Watson: {dw}") # Should be ~2 for no autocorrelation
# Autocorrelation test
lb_test = acorr_ljungbox(results.resid, lags=10)
# Normality test
from scipy.stats import shapiro
stat, pval = shapiro(results.resid)
print(f"Shapiro-Wilk p-value: {pval}") # p < 0.05 suggests non-normality
from statsmodels.stats.outliers_influence import variance_inflation_factor, OLSInfluence
# Variance Inflation Factor (multicollinearity)
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)
# VIF > 10 indicates serious multicollinearity
# VIF > 5 is concerning
# Influence measures
influence = OLSInfluence(results)
# Cook's Distance (overall influence)
cooks_d = influence.cooks_distance[0]
# Rule of thumb: D > 4/n is influential
# DFFITS (influence on fitted value)
dffits = influence.dffits[0]
# Rule of thumb: |DFFITS| > 2√(p/n) is influential
# Leverage (hat values)
leverage = influence.hat_matrix_diag
# Rule of thumb: h > 2p/n or h > 3p/n is high leverage
# Studentized residuals
student_resid = influence.resid_studentized_internal
# |r| > 2 or 3 suggests outlier in response space
# Summary table
influence_summary = influence.summary_frame()
print(influence_summary)
from sklearn.linear_model import (
LinearRegression, Ridge, Lasso, ElasticNet,
Lars, LassoLars, BayesianRidge, HuberRegressor,
RANSACRegressor, TheilSenRegressor
)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, GridSearchCV
# Standard Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
model.coef_, model.intercept_
# Ridge Regression (L2 regularization)
# Good when many correlated predictors
ridge = Ridge(alpha=1.0) # Higher alpha = more regularization
ridge.fit(X_train, y_train)
# Lasso Regression (L1 regularization)
# Performs feature selection by setting coefficients to zero
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
# Check which features selected
selected_features = X.columns[lasso.coef_ != 0]
# ElasticNet (L1 + L2)
# l1_ratio=1.0 is Lasso, l1_ratio=0.0 is Ridge
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_train, y_train)
# Huber Regression (robust to outliers)
huber = HuberRegressor(epsilon=1.35)
huber.fit(X_train, y_train)
# RANSAC (extremely robust, finds inlier subset)
ransac = RANSACRegressor(min_samples=0.5, residual_threshold=2.0)
ransac.fit(X_train, y_train)
inlier_mask = ransac.inlier_mask_
# Theil-Sen (robust, good for small datasets)
theil = TheilSenRegressor()
theil.fit(X_train, y_train)
from sklearn.model_selection import (
cross_val_score, cross_validate,
GridSearchCV, RandomizedSearchCV,
KFold, RepeatedKFold
)
from sklearn.metrics import (
mean_squared_error, mean_absolute_error,
r2_score, mean_absolute_percentage_error
)
# Simple cross-validation
scores = cross_val_score(model, X, y, cv=5,
scoring='neg_mean_squared_error')
rmse_scores = np.sqrt(-scores)
print(f"RMSE: {rmse_scores.mean():.3f} (+/- {rmse_scores.std():.3f})")
# Multiple metrics
cv_results = cross_validate(
model, X, y, cv=5,
scoring=['neg_mean_squared_error', 'r2', 'neg_mean_absolute_error'],
return_train_score=True
)
# Grid search for optimal hyperparameters
param_grid = {
'alpha': [0.001, 0.01, 0.1, 1.0, 10.0],
'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
}
grid_search = GridSearchCV(
ElasticNet(),
param_grid,
cv=5,
scoring='neg_mean_squared_error',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {np.sqrt(-grid_search.best_score_):.3f}")
# Use best model
best_model = grid_search.best_estimator_
from sklearn.ensemble import (
RandomForestRegressor, GradientBoostingRegressor,
AdaBoostRegressor, BaggingRegressor, ExtraTreesRegressor
)
from sklearn.tree import DecisionTreeRegressor
# Random Forest
rf = RandomForestRegressor(
n_estimators=100,
max_depth=10,
min_samples_split=5,
min_samples_leaf=2,
random_state=42,
n_jobs=-1
)
rf.fit(X_train, y_train)
# Feature importance
importance_df = pd.DataFrame({
'feature': X.columns,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
# Gradient Boosting
gb = GradientBoostingRegressor(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
subsample=0.8,
random_state=42
)
gb.fit(X_train, y_train)
# XGBoost (if installed)
try:
import xgboost as xgb
xgb_model = xgb.XGBRegressor(
n_estimators=100,
learning_rate=0.1,
max_depth=5,
random_state=42
)
xgb_model.fit(X_train, y_train)
except ImportError:
print("XGBoost not installed. Use: pip install xgboost")
from sklearn.svm import SVR
from sklearn.kernel_ridge import KernelRidge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
# Support Vector Regression
svr = SVR(kernel='rbf', C=1.0, epsilon=0.1)
svr.fit(X_train, y_train)
# Kernel Ridge Regression
kr = KernelRidge(alpha=1.0, kernel='rbf')
kr.fit(X_train, y_train)
# K-Nearest Neighbors
knn = KNeighborsRegressor(n_neighbors=5, weights='distance')
knn.fit(X_train, y_train)
# Gaussian Process (with uncertainty quantification)
kernel = RBF(length_scale=1.0) + WhiteKernel(noise_level=1.0)
gp = GaussianProcessRegressor(kernel=kernel, random_state=42)
gp.fit(X_train, y_train)
# Predictions with uncertainty
y_pred, sigma = gp.predict(X_test, return_std=True)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.compose import TransformedTargetRegressor
# Standard pipeline with scaling
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', Ridge(alpha=1.0))
])
pipeline.fit(X_train, y_train)
# Polynomial features
poly_pipeline = Pipeline([
('poly', PolynomialFeatures(degree=2, include_bias=False)),
('scaler', StandardScaler()),
('model', Ridge(alpha=1.0))
])
# Target transformation (e.g., log transform y)
log_model = TransformedTargetRegressor(
regressor=Ridge(alpha=1.0),
func=np.log1p,
inverse_func=np.expm1
)
log_model.fit(X_train, y_train)
import numpy as np
from scipy import stats
# Z-score method
def detect_outliers_zscore(data, threshold=3):
"""
Outliers are points with |z-score| > threshold
Works well for normally distributed data
"""
z_scores = np.abs(stats.zscore(data))
return z_scores > threshold
# Modified Z-score (MAD - Median Absolute Deviation)
def detect_outliers_mad(data, threshold=3.5):
"""
More robust than Z-score, uses median instead of mean
Recommended threshold: 3.5
"""
median = np.median(data)
mad = np.median(np.abs(data - median))
modified_z_scores = 0.6745 * (data - median) / mad
return np.abs(modified_z_scores) > threshold
# IQR method
def detect_outliers_iqr(data, factor=1.5):
"""
Outliers are below Q1 - factor*IQR or above Q3 + factor*IQR
factor=1.5: outliers, factor=3.0: extreme outliers
"""
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
lower_bound = q1 - factor * iqr
upper_bound = q3 + factor * iqr
return (data < lower_bound) | (data > upper_bound)
# Grubbs test (for single outlier in normal data)
def grubbs_test(data, alpha=0.05):
"""
Tests if maximum or minimum is an outlier
Assumes data is normally distributed
"""
from scipy.stats import t
n = len(data)
mean = np.mean(data)
std = np.std(data, ddof=1)
# Test statistic
G_max = np.max(np.abs(data - mean)) / std
# Critical value
t_crit = t.ppf(1 - alpha / (2 * n), n - 2)
G_crit = ((n - 1) / np.sqrt(n)) * np.sqrt(t_crit**2 / (n - 2 + t_crit**2))
return G_max > G_crit
# Example usage
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 100]) # 100 is outlier
outliers_z = detect_outliers_zscore(data)
outliers_mad = detect_outliers_mad(data)
outliers_iqr = detect_outliers_iqr(data)
PyOD (Python Outlier Detection) provides 40+ algorithms. Install with: pip install pyod
from pyod.models.knn import KNN
from pyod.models.lof import LOF
from pyod.models.iforest import IForest
from pyod.models.ocsvm import OCSVM
from pyod.models.pca import PCA as PCA_OD
from pyod.models.mcd import MCD
from pyod.models.abod import ABOD
from pyod.models.copod import COPOD
from pyod.models.ecod import ECOD
from pyod.models.combination import average, maximization
# General pattern for PyOD detectors
detector = KNN(contamination=0.1) # Expect 10% outliers
detector.fit(X)
# Get outlier labels (0=inlier, 1=outlier)
outlier_labels = detector.labels_
# Get outlier scores (higher = more outlier-like)
outlier_scores = detector.decision_scores_
# Predict on new data
new_labels = detector.predict(X_new)
new_scores = detector.decision_function(X_new)
# K-Nearest Neighbors (KNN)
# Simple, fast, good baseline
knn = KNN(contamination=0.1, n_neighbors=5, method='mean')
knn.fit(X)
# Local Outlier Factor (LOF)
# Captures local density deviations
# Good for datasets with varying densities
lof = LOF(contamination=0.1, n_neighbors=20)
lof.fit(X)
# Connectivity-Based Outlier Factor (COF)
from pyod.models.cof import COF
cof = COF(contamination=0.1, n_neighbors=20)
cof.fit(X)
# COPOD (Copula-Based Outlier Detection)
# Fast, parameter-free, works well on high-dimensional data
copod = COPOD(contamination=0.1)
copod.fit(X)
# ECOD (Empirical Cumulative Distribution)
# Very fast, no parameters, good for tabular data
ecod = ECOD(contamination=0.1)
ecod.fit(X)
# ABOD (Angle-Based Outlier Detection)
# Effective in high-dimensional spaces
# Slow on large datasets
abod = ABOD(contamination=0.1)
abod.fit(X)
# PCA-based detection
# Outliers have large reconstruction error
pca = PCA_OD(contamination=0.1, n_components=None)
pca.fit(X)
# MCD (Minimum Covariance Determinant)
# Robust covariance estimation
# Assumes data is roughly Gaussian
mcd = MCD(contamination=0.1)
mcd.fit(X)
# One-Class SVM
# Learns boundary around normal points
ocsvm = OCSVM(contamination=0.1, kernel='rbf', nu=0.1)
ocsvm.fit(X)
# Isolation Forest
# Fast, scalable, effective for high-dimensional data
# Isolates outliers by random partitioning
iforest = IForest(
contamination=0.1,
n_estimators=100,
max_features=1.0,
random_state=42
)
iforest.fit(X)
# Combine multiple detectors
from pyod.models.lscp import LSCP
from pyod.models.feature_bagging import FeatureBagging
# Feature Bagging
# Trains multiple detectors on random feature subsets
fb = FeatureBagging(
base_estimator=LOF(contamination=0.1),
n_estimators=10,
contamination=0.1,
random_state=42
)
fb.fit(X)
# Average multiple detector scores
detectors = [KNN(), LOF(), IForest(), COPOD()]
scores_list = []
for detector in detectors:
detector.fit(X)
scores_list.append(detector.decision_scores_)
# Average combination
avg_scores = average(scores_list)
# Maximum combination (more conservative)
max_scores = maximization(scores_list)
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.covariance import EllipticEnvelope
from sklearn.svm import OneClassSVM
# Isolation Forest
iso_forest = IsolationForest(
contamination=0.1,
random_state=42,
n_estimators=100
)
outlier_labels = iso_forest.fit_predict(X) # -1=outlier, 1=inlier
outlier_scores = iso_forest.score_samples(X) # More negative = more outlier
# Local Outlier Factor (novelty detection mode)
lof = LocalOutlierFactor(
n_neighbors=20,
contamination=0.1,
novelty=True # Enables predict on new data
)
lof.fit(X_train)
outlier_labels = lof.predict(X_test)
outlier_scores = lof.score_samples(X_test)
# Elliptic Envelope (assumes Gaussian distribution)
elliptic = EllipticEnvelope(contamination=0.1, random_state=42)
outlier_labels = elliptic.fit_predict(X)
# One-Class SVM
oc_svm = OneClassSVM(kernel='rbf', gamma='auto', nu=0.1)
outlier_labels = oc_svm.fit_predict(X)
Use IQR or Z-score when:
Use Modified Z-score (MAD) when:
Use Isolation Forest when:
Use LOF when:
Use COPOD/ECOD when:
Use PCA-based when:
Use MCD when:
Use ensemble methods when:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.gofplots import qqplot
from statsmodels.stats.outliers_influence import variance_inflation_factor, OLSInfluence
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.stattools import durbin_watson
import matplotlib.pyplot as plt
# Load data
df = pd.read_csv('data.csv')
# 1. Exploratory analysis
print(df.describe())
df.hist(bins=30, figsize=(15, 10))
pd.plotting.scatter_matrix(df, figsize=(15, 15))
# 2. Fit initial model
model = smf.ols('price ~ area + bedrooms + age + location', data=df)
results = model.fit()
print(results.summary())
# 3. Check assumptions
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Residuals vs Fitted
axes[0, 0].scatter(results.fittedvalues, results.resid)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')
# Q-Q plot
qqplot(results.resid, line='s', ax=axes[0, 1])
axes[0, 1].set_title('Normal Q-Q')
# Scale-Location
axes[1, 0].scatter(results.fittedvalues, np.sqrt(np.abs(results.resid_pearson)))
axes[1, 0].set_xlabel('Fitted values')
axes[1, 0].set_ylabel('√|Standardized Residuals|')
axes[1, 0].set_title('Scale-Location')
# Residuals vs Leverage
influence = OLSInfluence(results)
axes[1, 1].scatter(influence.hat_matrix_diag, results.resid_pearson)
axes[1, 1].set_xlabel('Leverage')
axes[1, 1].set_ylabel('Standardized Residuals')
axes[1, 1].set_title('Residuals vs Leverage')
plt.tight_layout()
# 4. Statistical tests
# Heteroskedasticity
lm_stat, lm_pval, f_stat, f_pval = het_breuschpagan(results.resid, results.model.exog)
print(f"\nBreusch-Pagan test p-value: {lm_pval:.4f}")
# Autocorrelation
dw = durbin_watson(results.resid)
print(f"Durbin-Watson statistic: {dw:.4f}")
# 5. Check multicollinearity
X = df[['area', 'bedrooms', 'age']] # Exclude categorical
X = sm.add_constant(X)
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print("\n", vif_data)
# 6. Identify influential points
influence_df = influence.summary_frame()
print("\nInfluential points (Cook's D > 4/n):")
n = len(df)
influential = influence_df[influence_df['cooks_d'] > 4/n]
print(influential[['cooks_d', 'student_resid', 'hat_diag']])
# 7. If assumptions violated, try transformations or robust regression
if lm_pval < 0.05: # Heteroskedasticity detected
print("\nFitting robust regression (RLM)...")
from statsmodels.robust.robust_linear_model import RLM
robust_model = RLM(df['price'], X, M=sm.robust.norms.HuberT())
robust_results = robust_model.fit()
print(robust_results.summary())
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import pandas as pd
import numpy as np
# Load and split data
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Define models to compare
models = {
'Ridge': Ridge(),
'Lasso': Lasso(),
'ElasticNet': ElasticNet(),
'RandomForest': RandomForestRegressor(random_state=42),
'GradientBoosting': GradientBoostingRegressor(random_state=42)
}
# Cross-validation comparison
results = {}
for name, model in models.items():
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', model)
])
scores = cross_val_score(
pipeline, X_train, y_train,
cv=5, scoring='neg_mean_squared_error'
)
rmse_scores = np.sqrt(-scores)
results[name] = {
'mean_rmse': rmse_scores.mean(),
'std_rmse': rmse_scores.std()
}
print(f"{name}: RMSE = {rmse_scores.mean():.3f} (+/- {rmse_scores.std():.3f})")
# Hyperparameter tuning for best model
param_grid = {
'model__n_estimators': [50, 100, 200],
'model__max_depth': [5, 10, 15, None],
'model__min_samples_split': [2, 5, 10]
}
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestRegressor(random_state=42))
])
grid_search = GridSearchCV(
pipeline, param_grid, cv=5,
scoring='neg_mean_squared_error',
n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV RMSE: {np.sqrt(-grid_search.best_score_):.3f}")
# Evaluate on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print(f"\nTest set performance:")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.3f}")
print(f"R²: {r2_score(y_test, y_pred):.3f}")
# Feature importance
if hasattr(best_model.named_steps['model'], 'feature_importances_'):
importance_df = pd.DataFrame({
'feature': X.columns,
'importance': best_model.named_steps['model'].feature_importances_
}).sort_values('importance', ascending=False)
print("\nTop 10 features:")
print(importance_df.head(10))
import numpy as np
import pandas as pd
from pyod.models.iforest import IForest
from pyod.models.lof import LOF
from pyod.models.copod import COPOD
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Load data
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1).values
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Multiple detection methods
detectors = {
'Isolation Forest': IForest(contamination=0.1, random_state=42),
'LOF': LOF(contamination=0.1, n_neighbors=20),
'COPOD': COPOD(contamination=0.1)
}
# Fit detectors and collect scores
scores_df = pd.DataFrame(index=df.index)
labels_df = pd.DataFrame(index=df.index)
for name, detector in detectors.items():
detector.fit(X_scaled)
scores_df[name] = detector.decision_scores_
labels_df[name] = detector.labels_
# Consensus: point is outlier if flagged by at least 2 methods
consensus_outliers = (labels_df.sum(axis=1) >= 2)
print(f"Outliers detected by consensus: {consensus_outliers.sum()}")
# Investigate outliers
outlier_indices = df[consensus_outliers].index
print("\nOutlier summary statistics:")
print(df.loc[outlier_indices].describe())
print("\nNormal data summary statistics:")
print(df.loc[~consensus_outliers].describe())
# Visualize outlier scores
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i, (name, scores) in enumerate(scores_df.items()):
axes[i].hist(scores, bins=50, alpha=0.7)
threshold = np.percentile(scores, 90) # 90th percentile
axes[i].axvline(threshold, color='r', linestyle='--', label='Threshold')
axes[i].set_title(f'{name} Scores')
axes[i].set_xlabel('Outlier Score')
axes[i].set_ylabel('Frequency')
axes[i].legend()
plt.tight_layout()
# Decision: remove, investigate, or keep?
# Option 1: Remove outliers
df_clean = df[~consensus_outliers].copy()
# Option 2: Use robust regression that handles outliers
# (see RLM, Huber, RANSAC examples above)
# Option 3: Keep but analyze separately
df['is_outlier'] = consensus_outliers
import numpy as np
import pandas as pd
import statsmodels.api as sm
from pyod.models.iforest import IForest
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Load data
df = pd.read_csv('data.csv')
X = df[['x1', 'x2', 'x3']].values
y = df['y'].values
# Step 1: Detect outliers in predictor space
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
detector = IForest(contamination=0.05, random_state=42)
detector.fit(X_scaled)
outliers_X = detector.labels_ == 1
print(f"Outliers in predictor space: {outliers_X.sum()}")
# Step 2: Fit initial model
X_with_const = sm.add_constant(X)
model = sm.OLS(y, X_with_const)
results = model.fit()
# Step 3: Identify outliers in response space
from statsmodels.stats.outliers_influence import OLSInfluence
influence = OLSInfluence(results)
studentized_resid = influence.resid_studentized_internal
outliers_y = np.abs(studentized_resid) > 3
print(f"Outliers in response space: {outliers_y.sum()}")
# Step 4: Identify influential points
cooks_d = influence.cooks_distance[0]
n, p = X_with_const.shape
influential = cooks_d > 4/n
print(f"Influential points: {influential.sum()}")
# Step 5: Categorize points
df['outlier_X'] = outliers_X
df['outlier_y'] = outliers_y
df['influential'] = influential
df['any_flag'] = outliers_X | outliers_y | influential
# Step 6: Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Plot 1: Residuals vs Fitted
colors = ['red' if flag else 'blue' for flag in df['any_flag']]
axes[0].scatter(results.fittedvalues, results.resid, c=colors, alpha=0.6)
axes[0].axhline(y=0, color='black', linestyle='--')
axes[0].set_xlabel('Fitted values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Fitted (red = flagged)')
# Plot 2: Cook's distance
axes[1].stem(range(len(cooks_d)), cooks_d, markerfmt=',')
axes[1].axhline(y=4/n, color='r', linestyle='--', label='Threshold (4/n)')
axes[1].set_xlabel('Observation index')
axes[1].set_ylabel("Cook's distance")
axes[1].set_title("Cook's Distance")
axes[1].legend()
plt.tight_layout()
# Step 7: Compare models
print("\n=== Original Model ===")
print(results.summary())
# Remove flagged points and refit
df_clean = df[~df['any_flag']].copy()
X_clean = df_clean[['x1', 'x2', 'x3']].values
y_clean = df_clean['y'].values
X_clean_const = sm.add_constant(X_clean)
model_clean = sm.OLS(y_clean, X_clean_const)
results_clean = model_clean.fit()
print("\n=== Model After Removing Flagged Points ===")
print(results_clean.summary())
# Alternative: Robust regression (keeps all data)
from statsmodels.robust.robust_linear_model import RLM
robust_model = RLM(y, X_with_const, M=sm.robust.norms.HuberT())
robust_results = robust_model.fit()
print("\n=== Robust Regression (All Data) ===")
print(robust_results.summary())
Problem: Testing many models/features and reporting only significant ones.
Solution:
from statsmodels.stats.multitest import multipletests
# Multiple hypothesis testing correction
p_values = [0.04, 0.03, 0.001, 0.15, 0.08]
reject, pvals_corrected, _, _ = multipletests(p_values, method='fdr_bh')
# Benjamini-Hochberg controls false discovery rate
Problem: Model fits training data perfectly but generalizes poorly.
Solution:
# Rule of thumb: need at least 10-20 observations per predictor
n_samples, n_features = X.shape
if n_samples < 10 * n_features:
print(f"Warning: Only {n_samples/n_features:.1f} samples per feature")
print("Consider: (1) regularization, (2) feature selection, (3) collect more data")
Problem: Using OLS when assumptions are violated leads to invalid inference.
Solution:
# Box-Cox transformation for normality
from scipy.stats import boxcox
y_transformed, lambda_param = boxcox(y) # Only works for positive y
print(f"Optimal lambda: {lambda_param}")
# lambda=0: log transform
# lambda=0.5: square root
# lambda=1: no transform needed
Problem: Arbitrarily removing points to improve fit.
Solution:
# Systematic approach to outlier investigation
def investigate_outliers(df, outlier_mask):
"""
Print detailed information about detected outliers
"""
outliers = df[outlier_mask]
normal = df[~outlier_mask]
print(f"Number of outliers: {len(outliers)} ({100*len(outliers)/len(df):.1f}%)")
print("\nOutlier characteristics:")
print(outliers.describe())
print("\nNormal data characteristics:")
print(normal.describe())
# Check for data entry errors
print("\nPotential data errors:")
for col in df.columns:
if df[col].dtype in [np.float64, np.int64]:
min_val, max_val = df[col].min(), df[col].max()
suspicious = (outliers[col] < min_val * 0.1) | (outliers[col] > max_val * 0.9)
if suspicious.any():
print(f" {col}: {suspicious.sum()} suspicious values")
Problem: Not all outliers are influential, and not all influential points are outliers.
Concepts:
Solution:
# Categorize points
def categorize_points(influence, results, n, p):
"""
Categorize observations based on influence measures
"""
leverage = influence.hat_matrix_diag
studentized_resid = influence.resid_studentized_internal
cooks_d = influence.cooks_distance[0]
high_leverage = leverage > 2 * p / n
outlier_y = np.abs(studentized_resid) > 3
influential = cooks_d > 4 / n
categories = []
for i in range(n):
if influential[i]:
categories.append('Influential')
elif high_leverage[i] and outlier_y[i]:
categories.append('High Leverage Outlier')
elif high_leverage[i]:
categories.append('High Leverage')
elif outlier_y[i]:
categories.append('Outlier')
else:
categories.append('Normal')
return pd.Series(categories, name='Category')
Problem: Reporting point predictions without uncertainty.
Solution:
# Prediction intervals with statsmodels
predictions = results.get_prediction(X_new)
pred_df = predictions.summary_frame(alpha=0.05) # 95% intervals
# pred_df has columns: mean, mean_se, mean_ci_lower, mean_ci_upper,
# obs_ci_lower, obs_ci_upper
# Confidence interval: uncertainty in mean prediction
# Prediction interval: where we expect new observation to fall (wider)
import matplotlib.pyplot as plt
plt.scatter(X_test, y_test, label='Actual')
plt.plot(X_test, pred_df['mean'], 'r-', label='Prediction')
plt.fill_between(X_test, pred_df['obs_ci_lower'], pred_df['obs_ci_upper'],
alpha=0.2, label='95% Prediction Interval')
plt.legend()
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.graphics.gofplots import qqplot
def plot_diagnostics(results, figsize=(12, 10)):
"""
Create comprehensive diagnostic plots for regression
"""
fig, axes = plt.subplots(2, 2, figsize=figsize)
# 1. Residuals vs Fitted
axes[0, 0].scatter(results.fittedvalues, results.resid, alpha=0.6)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')
# Add lowess smooth
from statsmodels.nonparametric.smoothers_lowess import lowess
smoothed = lowess(results.resid, results.fittedvalues, frac=0.3)
axes[0, 0].plot(smoothed[:, 0], smoothed[:, 1], 'g-', lw=2)
# 2. Q-Q plot
qqplot(results.resid, line='s', ax=axes[0, 1])
axes[0, 1].set_title('Normal Q-Q')
# 3. Scale-Location
resid_std = np.sqrt(np.abs(results.resid_pearson))
axes[1, 0].scatter(results.fittedvalues, resid_std, alpha=0.6)
axes[1, 0].set_xlabel('Fitted values')
axes[1, 0].set_ylabel('√|Standardized Residuals|')
axes[1, 0].set_title('Scale-Location')
smoothed = lowess(resid_std, results.fittedvalues, frac=0.3)
axes[1, 0].plot(smoothed[:, 0], smoothed[:, 1], 'r-', lw=2)
# 4. Residuals vs Leverage
from statsmodels.stats.outliers_influence import OLSInfluence
influence = OLSInfluence(results)
leverage = influence.hat_matrix_diag
cooks_d = influence.cooks_distance[0]
axes[1, 1].scatter(leverage, results.resid_pearson, alpha=0.6)
axes[1, 1].set_xlabel('Leverage')
axes[1, 1].set_ylabel('Standardized Residuals')
axes[1, 1].set_title('Residuals vs Leverage')
# Highlight high Cook's D points
n = len(results.resid)
high_cooks = cooks_d > 4/n
axes[1, 1].scatter(leverage[high_cooks], results.resid_pearson[high_cooks],
color='red', s=100, alpha=0.8, label="High Cook's D")
axes[1, 1].legend()
plt.tight_layout()
return fig
# Usage
# fig = plot_diagnostics(results)
# plt.show()
def plot_outlier_scores(detectors_dict, X, figsize=(15, 5)):
"""
Plot outlier score distributions for multiple detectors
"""
n_detectors = len(detectors_dict)
fig, axes = plt.subplots(1, n_detectors, figsize=figsize)
if n_detectors == 1:
axes = [axes]
for ax, (name, detector) in zip(axes, detectors_dict.items()):
detector.fit(X)
scores = detector.decision_scores_
labels = detector.labels_
# Histogram
ax.hist(scores[labels == 0], bins=50, alpha=0.6, label='Inliers')
ax.hist(scores[labels == 1], bins=50, alpha=0.6, label='Outliers')
ax.set_xlabel('Outlier Score')
ax.set_ylabel('Frequency')
ax.set_title(name)
ax.legend()
plt.tight_layout()
return fig
| Goal | Method | Library |
|------|--------|---------|
| Simple linear regression with inference | OLS | statsmodels |
| Regression with correlated predictors | Ridge | sklearn |
| Feature selection in regression | Lasso | sklearn |
| Robust to outliers | RLM, HuberRegressor, RANSACRegressor | statsmodels, sklearn |
| Count data | GLM(Poisson), GLM(NegativeBinomial) | statsmodels |
| Time series forecasting | ARIMA, SARIMAX | statsmodels |
| High-dimensional data | Lasso, ElasticNet, Ridge | sklearn |
| Complex nonlinear patterns | RandomForest, GradientBoosting | sklearn |
| Fast outlier detection | COPOD, ECOD, IsolationForest | PyOD, sklearn |
| Robust outlier detection | LOF, Ensemble methods | PyOD, sklearn |
| Univariate outliers | IQR, MAD, Z-score | scipy |
# Statistical modeling
pip install statsmodels scipy
# Machine learning
pip install scikit-learn
# Outlier detection
pip install pyod
# Optional: advanced methods
pip install xgboost lightgbm
tools
Spreadsheet toolkit (.xlsx/.csv). Create/edit with formulas/formatting, analyze data, visualization, recalculate formulas, for spreadsheet processing and analysis.
tools
Expert assistant for VASP (Vienna Ab initio Simulation Package) calculations - input file generation, parameter selection, workflow setup, and best practices for accurate DFT calculations
data-ai
This skill should be used when working with pre-trained transformer models for natural language processing, computer vision, audio, or multimodal tasks. Use for text generation, classification, question answering, translation, summarization, image classification, object detection, speech recognition, and fine-tuning models on custom datasets.
tools
Graph Neural Networks (PyG). Node/graph classification, link prediction, GCN, GAT, GraphSAGE, heterogeneous graphs, molecular property prediction, for geometric deep learning.