Python Regression & Statistical Analysis

Expert assistant for regression analysis, statistical modeling, and outlier detection in Python. This skill covers statsmodels for statistical regression, scikit-learn for machine learning approaches, scipy for statistical methods, and PyOD for comprehensive outlier detection.

When to Use This Skill

Fitting linear and nonlinear regression models with proper statistical inference
Checking and validating regression assumptions (LINE: Linearity, Independence, Normality, Equal variance)
Performing model diagnostics: residual analysis, influential points, multicollinearity
Detecting and handling outliers using statistical, proximity-based, and ensemble methods
Comparing statistical vs machine learning regression approaches
Building robust regression pipelines with proper validation
Time series modeling and forecasting
Generalized linear models (GLM) for non-normal response variables

Philosophy: Statistical vs Machine Learning Approaches

Statistical Regression (statsmodels, scipy.stats)

Use when:

You need p-values, confidence intervals, and hypothesis testing
Understanding which variables are significant is important
You want to interpret coefficients causally
Sample size is moderate and you need inference guarantees
You need diagnostic tests for assumptions

Advantages:

Rich statistical inference (t-tests, F-tests, likelihood ratios)
Comprehensive diagnostics and assumption tests
Formula interface for easy model specification
Standard errors, confidence intervals, prediction intervals
Well-understood theoretical properties

Limitations:

Less flexible for complex nonlinear relationships
Requires careful assumption checking
May not handle high-dimensional data well
Less automated feature selection

Machine Learning Regression (scikit-learn)

Use when:

Prediction accuracy is the primary goal
You have many features and complex interactions
Relationships are highly nonlinear
You don't need statistical inference (p-values)
You have sufficient data for train/test splits

Advantages:

Handles complex nonlinear patterns
Built-in regularization (Ridge, Lasso, ElasticNet)
Excellent for high-dimensional data
Automated cross-validation and hyperparameter tuning
Ensemble methods for improved accuracy

Limitations:

No automatic p-values or statistical inference
Less interpretable coefficients
Requires more data for complex models
Black-box nature for some algorithms

Recommended Workflow

# 1. Exploratory Data Analysis
# - Visualize relationships (scatter plots, pair plots)
# - Check distributions (histograms, Q-Q plots)
# - Identify potential outliers (box plots, scatter plots)
# - Calculate summary statistics

# 2. Initial Outlier Screening (Optional)
# - Use multiple detection methods (IQR, Z-score, isolation forest)
# - Visualize flagged points
# - Investigate why they're outliers (data errors vs real extremes)
# - Document decision to keep or remove

# 3. Check Assumptions (for statistical regression)
# - Linearity: residual vs fitted plots, partial regression plots
# - Independence: Durbin-Watson test, autocorrelation plots
# - Normality: Q-Q plots, Shapiro-Wilk test, histogram of residuals
# - Equal variance: scale-location plot, Breusch-Pagan test

# 4. Model Fitting
# - Start simple (OLS) then add complexity as needed
# - Try transformations if assumptions violated (log, Box-Cox)
# - Consider robust methods if outliers present
# - Use regularization if many predictors (Ridge/Lasso)

# 5. Model Diagnostics
# - Residual analysis
# - Check for influential points (Cook's D, DFFITS, leverage)
# - Multicollinearity (VIF)
# - Outliers in predictor space vs response space

# 6. Validation
# - Train/test split or cross-validation
# - Compare multiple models (AIC, BIC, R², RMSE)
# - Check prediction intervals (statistical) or residual distribution (ML)

# 7. Interpretation & Reporting
# - Coefficient interpretation with confidence intervals
# - Feature importance (if ML model)
# - Visualize predictions vs actuals
# - Report diagnostics and assumption checks

Statistical Regression with statsmodels

Ordinary Least Squares (OLS)

import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd

# Arrays API - more control
X = sm.add_constant(X_raw)  # Add intercept
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

# Formula API - easier specification
df = pd.DataFrame({'y': y, 'x1': x1, 'x2': x2, 'category': cat})
model = smf.ols('y ~ x1 + x2 + C(category)', data=df)
results = model.fit()

# Access key results
results.params           # Coefficients
results.pvalues         # P-values
results.conf_int()      # Confidence intervals
results.rsquared        # R²
results.rsquared_adj    # Adjusted R²
results.aic, results.bic  # Information criteria
results.resid           # Residuals
results.fittedvalues    # Predicted values

Weighted Least Squares (WLS)

# When heteroskedasticity is present
# Weights = 1/variance of each observation

# If you know the variance structure
weights = 1 / variance_array
model = sm.WLS(y, X, weights=weights)
results = model.fit()

# Iteratively Reweighted Least Squares
# Start with OLS, estimate variance function, refit
ols_results = sm.OLS(y, X).fit()
resid_squared = ols_results.resid**2

# Fit variance as function of predictors
variance_model = sm.OLS(np.log(resid_squared), X).fit()
weights = 1 / np.exp(variance_model.fittedvalues)

wls_results = sm.WLS(y, X, weights=weights).fit()

Robust Regression (RLM)

# Resistant to outliers - downweights influential points
import statsmodels.robust.robust_linear_model as rlm

# Huber's T norm (default)
model = rlm.RLM(y, X, M=sm.robust.norms.HuberT())
results = model.fit()

# Other M-estimators
# TukeyBiweight - more aggressive outlier downweighting
model = rlm.RLM(y, X, M=sm.robust.norms.TukeyBiweight())
results = model.fit()

# Check weights assigned to each observation
weights = results.weights  # Outliers get lower weights

Generalized Linear Models (GLM)

# For non-normal response variables

# Poisson regression (count data)
model = sm.GLM(y, X, family=sm.families.Poisson())
results = model.fit()

# Negative Binomial (overdispersed counts)
model = sm.GLM(y, X, family=sm.families.NegativeBinomial())
results = model.fit()

# Gamma regression (positive continuous data)
model = sm.GLM(y, X, family=sm.families.Gamma())
results = model.fit()

# Binomial/Logistic regression
model = sm.GLM(y, X, family=sm.families.Binomial())
results = model.fit()

# Link functions
# family=sm.families.Gaussian(link=sm.families.links.log())

Time Series Models

from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.vector_ar.var_model import VAR

# ARIMA (AutoRegressive Integrated Moving Average)
model = ARIMA(y, order=(p, d, q))  # p=AR order, d=differencing, q=MA order
results = model.fit()

# Seasonal ARIMA
model = SARIMAX(y, order=(p, d, q), seasonal_order=(P, D, Q, s))
results = model.fit()

# Forecasting
forecast = results.forecast(steps=10)
forecast_with_ci = results.get_forecast(steps=10)
forecast_df = forecast_with_ci.summary_frame()

# Vector Autoregression (multivariate time series)
model = VAR(df[['y1', 'y2', 'y3']])
results = model.fit(maxlags=5)
results.forecast(df.values[-results.k_ar:], steps=10)

Model Diagnostics - statsmodels

# Comprehensive diagnostic plots
from statsmodels.graphics.gofplots import qqplot
import matplotlib.pyplot as plt

# 1. Residual vs Fitted
plt.scatter(results.fittedvalues, results.resid)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
# Should show random scatter, no pattern

# 2. Q-Q plot for normality
qqplot(results.resid, line='s')
# Points should follow diagonal line

# 3. Scale-Location (heteroskedasticity)
plt.scatter(results.fittedvalues, np.sqrt(np.abs(results.resid_pearson)))
plt.xlabel('Fitted values')
plt.ylabel('√|Standardized Residuals|')
# Should show constant spread

# 4. Residuals vs Leverage
from statsmodels.stats.outliers_influence import OLSInfluence
influence = OLSInfluence(results)
fig, ax = plt.subplots()
ax.scatter(influence.hat_matrix_diag, results.resid_pearson)
# Identifies influential points (high leverage + large residual)

# Statistical tests
from statsmodels.stats.diagnostic import het_breuschpagan, acorr_ljungbox
from statsmodels.stats.stattools import durbin_watson

# Heteroskedasticity test
lm_stat, lm_pval, f_stat, f_pval = het_breuschpagan(results.resid, results.model.exog)
print(f"Breusch-Pagan p-value: {lm_pval}")  # p < 0.05 indicates heteroskedasticity

# Autocorrelation
dw = durbin_watson(results.resid)
print(f"Durbin-Watson: {dw}")  # Should be ~2 for no autocorrelation

# Autocorrelation test
lb_test = acorr_ljungbox(results.resid, lags=10)

# Normality test
from scipy.stats import shapiro
stat, pval = shapiro(results.resid)
print(f"Shapiro-Wilk p-value: {pval}")  # p < 0.05 suggests non-normality

Influential Points and Multicollinearity

from statsmodels.stats.outliers_influence import variance_inflation_factor, OLSInfluence

# Variance Inflation Factor (multicollinearity)
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)
# VIF > 10 indicates serious multicollinearity
# VIF > 5 is concerning

# Influence measures
influence = OLSInfluence(results)

# Cook's Distance (overall influence)
cooks_d = influence.cooks_distance[0]
# Rule of thumb: D > 4/n is influential

# DFFITS (influence on fitted value)
dffits = influence.dffits[0]
# Rule of thumb: |DFFITS| > 2√(p/n) is influential

# Leverage (hat values)
leverage = influence.hat_matrix_diag
# Rule of thumb: h > 2p/n or h > 3p/n is high leverage

# Studentized residuals
student_resid = influence.resid_studentized_internal
# |r| > 2 or 3 suggests outlier in response space

# Summary table
influence_summary = influence.summary_frame()
print(influence_summary)

Machine Learning Regression with scikit-learn

Linear Models with Regularization

from sklearn.linear_model import (
    LinearRegression, Ridge, Lasso, ElasticNet,
    Lars, LassoLars, BayesianRidge, HuberRegressor,
    RANSACRegressor, TheilSenRegressor
)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, GridSearchCV

# Standard Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
model.coef_, model.intercept_

# Ridge Regression (L2 regularization)
# Good when many correlated predictors
ridge = Ridge(alpha=1.0)  # Higher alpha = more regularization
ridge.fit(X_train, y_train)

# Lasso Regression (L1 regularization)
# Performs feature selection by setting coefficients to zero
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
# Check which features selected
selected_features = X.columns[lasso.coef_ != 0]

# ElasticNet (L1 + L2)
# l1_ratio=1.0 is Lasso, l1_ratio=0.0 is Ridge
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_train, y_train)

# Huber Regression (robust to outliers)
huber = HuberRegressor(epsilon=1.35)
huber.fit(X_train, y_train)

# RANSAC (extremely robust, finds inlier subset)
ransac = RANSACRegressor(min_samples=0.5, residual_threshold=2.0)
ransac.fit(X_train, y_train)
inlier_mask = ransac.inlier_mask_

# Theil-Sen (robust, good for small datasets)
theil = TheilSenRegressor()
theil.fit(X_train, y_train)

Cross-Validation and Hyperparameter Tuning

from sklearn.model_selection import (
    cross_val_score, cross_validate,
    GridSearchCV, RandomizedSearchCV,
    KFold, RepeatedKFold
)
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error,
    r2_score, mean_absolute_percentage_error
)

# Simple cross-validation
scores = cross_val_score(model, X, y, cv=5,
                         scoring='neg_mean_squared_error')
rmse_scores = np.sqrt(-scores)
print(f"RMSE: {rmse_scores.mean():.3f} (+/- {rmse_scores.std():.3f})")

# Multiple metrics
cv_results = cross_validate(
    model, X, y, cv=5,
    scoring=['neg_mean_squared_error', 'r2', 'neg_mean_absolute_error'],
    return_train_score=True
)

# Grid search for optimal hyperparameters
param_grid = {
    'alpha': [0.001, 0.01, 0.1, 1.0, 10.0],
    'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
}
grid_search = GridSearchCV(
    ElasticNet(),
    param_grid,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {np.sqrt(-grid_search.best_score_):.3f}")

# Use best model
best_model = grid_search.best_estimator_

Tree-Based and Ensemble Methods

from sklearn.ensemble import (
    RandomForestRegressor, GradientBoostingRegressor,
    AdaBoostRegressor, BaggingRegressor, ExtraTreesRegressor
)
from sklearn.tree import DecisionTreeRegressor

# Random Forest
rf = RandomForestRegressor(
    n_estimators=100,
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)

# Feature importance
importance_df = pd.DataFrame({
    'feature': X.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

# Gradient Boosting
gb = GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    subsample=0.8,
    random_state=42
)
gb.fit(X_train, y_train)

# XGBoost (if installed)
try:
    import xgboost as xgb
    xgb_model = xgb.XGBRegressor(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=5,
        random_state=42
    )
    xgb_model.fit(X_train, y_train)
except ImportError:
    print("XGBoost not installed. Use: pip install xgboost")

Other ML Regression Methods

from sklearn.svm import SVR
from sklearn.kernel_ridge import KernelRidge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel

# Support Vector Regression
svr = SVR(kernel='rbf', C=1.0, epsilon=0.1)
svr.fit(X_train, y_train)

# Kernel Ridge Regression
kr = KernelRidge(alpha=1.0, kernel='rbf')
kr.fit(X_train, y_train)

# K-Nearest Neighbors
knn = KNeighborsRegressor(n_neighbors=5, weights='distance')
knn.fit(X_train, y_train)

# Gaussian Process (with uncertainty quantification)
kernel = RBF(length_scale=1.0) + WhiteKernel(noise_level=1.0)
gp = GaussianProcessRegressor(kernel=kernel, random_state=42)
gp.fit(X_train, y_train)

# Predictions with uncertainty
y_pred, sigma = gp.predict(X_test, return_std=True)

Building Complete Pipelines

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.compose import TransformedTargetRegressor

# Standard pipeline with scaling
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', Ridge(alpha=1.0))
])
pipeline.fit(X_train, y_train)

# Polynomial features
poly_pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler', StandardScaler()),
    ('model', Ridge(alpha=1.0))
])

# Target transformation (e.g., log transform y)
log_model = TransformedTargetRegressor(
    regressor=Ridge(alpha=1.0),
    func=np.log1p,
    inverse_func=np.expm1
)
log_model.fit(X_train, y_train)

Outlier Detection

Statistical Methods (scipy, statsmodels)

import numpy as np
from scipy import stats

# Z-score method
def detect_outliers_zscore(data, threshold=3):
    """
    Outliers are points with |z-score| > threshold
    Works well for normally distributed data
    """
    z_scores = np.abs(stats.zscore(data))
    return z_scores > threshold

# Modified Z-score (MAD - Median Absolute Deviation)
def detect_outliers_mad(data, threshold=3.5):
    """
    More robust than Z-score, uses median instead of mean
    Recommended threshold: 3.5
    """
    median = np.median(data)
    mad = np.median(np.abs(data - median))
    modified_z_scores = 0.6745 * (data - median) / mad
    return np.abs(modified_z_scores) > threshold

# IQR method
def detect_outliers_iqr(data, factor=1.5):
    """
    Outliers are below Q1 - factor*IQR or above Q3 + factor*IQR
    factor=1.5: outliers, factor=3.0: extreme outliers
    """
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    iqr = q3 - q1
    lower_bound = q1 - factor * iqr
    upper_bound = q3 + factor * iqr
    return (data < lower_bound) | (data > upper_bound)

# Grubbs test (for single outlier in normal data)
def grubbs_test(data, alpha=0.05):
    """
    Tests if maximum or minimum is an outlier
    Assumes data is normally distributed
    """
    from scipy.stats import t
    n = len(data)
    mean = np.mean(data)
    std = np.std(data, ddof=1)

    # Test statistic
    G_max = np.max(np.abs(data - mean)) / std

    # Critical value
    t_crit = t.ppf(1 - alpha / (2 * n), n - 2)
    G_crit = ((n - 1) / np.sqrt(n)) * np.sqrt(t_crit**2 / (n - 2 + t_crit**2))

    return G_max > G_crit

# Example usage
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 100])  # 100 is outlier
outliers_z = detect_outliers_zscore(data)
outliers_mad = detect_outliers_mad(data)
outliers_iqr = detect_outliers_iqr(data)

PyOD - Comprehensive Outlier Detection

PyOD (Python Outlier Detection) provides 40+ algorithms. Install with: pip install pyod

from pyod.models.knn import KNN
from pyod.models.lof import LOF
from pyod.models.iforest import IForest
from pyod.models.ocsvm import OCSVM
from pyod.models.pca import PCA as PCA_OD
from pyod.models.mcd import MCD
from pyod.models.abod import ABOD
from pyod.models.copod import COPOD
from pyod.models.ecod import ECOD
from pyod.models.combination import average, maximization

# General pattern for PyOD detectors
detector = KNN(contamination=0.1)  # Expect 10% outliers
detector.fit(X)

# Get outlier labels (0=inlier, 1=outlier)
outlier_labels = detector.labels_

# Get outlier scores (higher = more outlier-like)
outlier_scores = detector.decision_scores_

# Predict on new data
new_labels = detector.predict(X_new)
new_scores = detector.decision_function(X_new)

Proximity-Based Detectors

# K-Nearest Neighbors (KNN)
# Simple, fast, good baseline
knn = KNN(contamination=0.1, n_neighbors=5, method='mean')
knn.fit(X)

# Local Outlier Factor (LOF)
# Captures local density deviations
# Good for datasets with varying densities
lof = LOF(contamination=0.1, n_neighbors=20)
lof.fit(X)

# Connectivity-Based Outlier Factor (COF)
from pyod.models.cof import COF
cof = COF(contamination=0.1, n_neighbors=20)
cof.fit(X)

Probabilistic Detectors

# COPOD (Copula-Based Outlier Detection)
# Fast, parameter-free, works well on high-dimensional data
copod = COPOD(contamination=0.1)
copod.fit(X)

# ECOD (Empirical Cumulative Distribution)
# Very fast, no parameters, good for tabular data
ecod = ECOD(contamination=0.1)
ecod.fit(X)

# ABOD (Angle-Based Outlier Detection)
# Effective in high-dimensional spaces
# Slow on large datasets
abod = ABOD(contamination=0.1)
abod.fit(X)

Linear Model Detectors

# PCA-based detection
# Outliers have large reconstruction error
pca = PCA_OD(contamination=0.1, n_components=None)
pca.fit(X)

# MCD (Minimum Covariance Determinant)
# Robust covariance estimation
# Assumes data is roughly Gaussian
mcd = MCD(contamination=0.1)
mcd.fit(X)

# One-Class SVM
# Learns boundary around normal points
ocsvm = OCSVM(contamination=0.1, kernel='rbf', nu=0.1)
ocsvm.fit(X)

Isolation-Based Detectors

# Isolation Forest
# Fast, scalable, effective for high-dimensional data
# Isolates outliers by random partitioning
iforest = IForest(
    contamination=0.1,
    n_estimators=100,
    max_features=1.0,
    random_state=42
)
iforest.fit(X)

Ensemble Methods

# Combine multiple detectors
from pyod.models.lscp import LSCP
from pyod.models.feature_bagging import FeatureBagging

# Feature Bagging
# Trains multiple detectors on random feature subsets
fb = FeatureBagging(
    base_estimator=LOF(contamination=0.1),
    n_estimators=10,
    contamination=0.1,
    random_state=42
)
fb.fit(X)

# Average multiple detector scores
detectors = [KNN(), LOF(), IForest(), COPOD()]
scores_list = []
for detector in detectors:
    detector.fit(X)
    scores_list.append(detector.decision_scores_)

# Average combination
avg_scores = average(scores_list)

# Maximum combination (more conservative)
max_scores = maximization(scores_list)

scikit-learn Outlier Detection

from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.covariance import EllipticEnvelope
from sklearn.svm import OneClassSVM

# Isolation Forest
iso_forest = IsolationForest(
    contamination=0.1,
    random_state=42,
    n_estimators=100
)
outlier_labels = iso_forest.fit_predict(X)  # -1=outlier, 1=inlier
outlier_scores = iso_forest.score_samples(X)  # More negative = more outlier

# Local Outlier Factor (novelty detection mode)
lof = LocalOutlierFactor(
    n_neighbors=20,
    contamination=0.1,
    novelty=True  # Enables predict on new data
)
lof.fit(X_train)
outlier_labels = lof.predict(X_test)
outlier_scores = lof.score_samples(X_test)

# Elliptic Envelope (assumes Gaussian distribution)
elliptic = EllipticEnvelope(contamination=0.1, random_state=42)
outlier_labels = elliptic.fit_predict(X)

# One-Class SVM
oc_svm = OneClassSVM(kernel='rbf', gamma='auto', nu=0.1)
outlier_labels = oc_svm.fit_predict(X)

Choosing the Right Outlier Detection Method

Use IQR or Z-score when:

Univariate data (single variable)
Quick screening needed
Interpretability is important

Use Modified Z-score (MAD) when:

Data may already contain outliers (MAD is robust)
Distribution is roughly symmetric

Use Isolation Forest when:

High-dimensional data
No assumptions about data distribution
Fast computation needed
Good all-purpose choice

Use LOF when:

Varying density clusters in data
Local structure is important
Willing to tune n_neighbors parameter

Use COPOD/ECOD when:

Very fast detection needed
High-dimensional tabular data
Don't want to tune parameters

Use PCA-based when:

Data lies in lower-dimensional manifold
Want to understand which features contribute to outlierness

Use MCD when:

Multivariate Gaussian assumption reasonable
Robust covariance estimation needed

Use ensemble methods when:

Maximum robustness needed
Can afford computational cost
Different detector types disagree

Complete Workflow Examples

Example 1: Statistical Regression with Full Diagnostics

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.gofplots import qqplot
from statsmodels.stats.outliers_influence import variance_inflation_factor, OLSInfluence
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.stattools import durbin_watson
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv('data.csv')

# 1. Exploratory analysis
print(df.describe())
df.hist(bins=30, figsize=(15, 10))
pd.plotting.scatter_matrix(df, figsize=(15, 15))

# 2. Fit initial model
model = smf.ols('price ~ area + bedrooms + age + location', data=df)
results = model.fit()
print(results.summary())

# 3. Check assumptions
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Residuals vs Fitted
axes[0, 0].scatter(results.fittedvalues, results.resid)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')

# Q-Q plot
qqplot(results.resid, line='s', ax=axes[0, 1])
axes[0, 1].set_title('Normal Q-Q')

# Scale-Location
axes[1, 0].scatter(results.fittedvalues, np.sqrt(np.abs(results.resid_pearson)))
axes[1, 0].set_xlabel('Fitted values')
axes[1, 0].set_ylabel('√|Standardized Residuals|')
axes[1, 0].set_title('Scale-Location')

# Residuals vs Leverage
influence = OLSInfluence(results)
axes[1, 1].scatter(influence.hat_matrix_diag, results.resid_pearson)
axes[1, 1].set_xlabel('Leverage')
axes[1, 1].set_ylabel('Standardized Residuals')
axes[1, 1].set_title('Residuals vs Leverage')

plt.tight_layout()

# 4. Statistical tests
# Heteroskedasticity
lm_stat, lm_pval, f_stat, f_pval = het_breuschpagan(results.resid, results.model.exog)
print(f"\nBreusch-Pagan test p-value: {lm_pval:.4f}")

# Autocorrelation
dw = durbin_watson(results.resid)
print(f"Durbin-Watson statistic: {dw:.4f}")

# 5. Check multicollinearity
X = df[['area', 'bedrooms', 'age']]  # Exclude categorical
X = sm.add_constant(X)
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print("\n", vif_data)

# 6. Identify influential points
influence_df = influence.summary_frame()
print("\nInfluential points (Cook's D > 4/n):")
n = len(df)
influential = influence_df[influence_df['cooks_d'] > 4/n]
print(influential[['cooks_d', 'student_resid', 'hat_diag']])

# 7. If assumptions violated, try transformations or robust regression
if lm_pval < 0.05:  # Heteroskedasticity detected
    print("\nFitting robust regression (RLM)...")
    from statsmodels.robust.robust_linear_model import RLM
    robust_model = RLM(df['price'], X, M=sm.robust.norms.HuberT())
    robust_results = robust_model.fit()
    print(robust_results.summary())

Example 2: ML Regression with Cross-Validation

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import pandas as pd
import numpy as np

# Load and split data
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define models to compare
models = {
    'Ridge': Ridge(),
    'Lasso': Lasso(),
    'ElasticNet': ElasticNet(),
    'RandomForest': RandomForestRegressor(random_state=42),
    'GradientBoosting': GradientBoostingRegressor(random_state=42)
}

# Cross-validation comparison
results = {}
for name, model in models.items():
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('model', model)
    ])

    scores = cross_val_score(
        pipeline, X_train, y_train,
        cv=5, scoring='neg_mean_squared_error'
    )
    rmse_scores = np.sqrt(-scores)
    results[name] = {
        'mean_rmse': rmse_scores.mean(),
        'std_rmse': rmse_scores.std()
    }
    print(f"{name}: RMSE = {rmse_scores.mean():.3f} (+/- {rmse_scores.std():.3f})")

# Hyperparameter tuning for best model
param_grid = {
    'model__n_estimators': [50, 100, 200],
    'model__max_depth': [5, 10, 15, None],
    'model__min_samples_split': [2, 5, 10]
}

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestRegressor(random_state=42))
])

grid_search = GridSearchCV(
    pipeline, param_grid, cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV RMSE: {np.sqrt(-grid_search.best_score_):.3f}")

# Evaluate on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

print(f"\nTest set performance:")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.3f}")
print(f"R²: {r2_score(y_test, y_pred):.3f}")

# Feature importance
if hasattr(best_model.named_steps['model'], 'feature_importances_'):
    importance_df = pd.DataFrame({
        'feature': X.columns,
        'importance': best_model.named_steps['model'].feature_importances_
    }).sort_values('importance', ascending=False)
    print("\nTop 10 features:")
    print(importance_df.head(10))

Example 3: Outlier Detection Pipeline

import numpy as np
import pandas as pd
from pyod.models.iforest import IForest
from pyod.models.lof import LOF
from pyod.models.copod import COPOD
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1).values

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Multiple detection methods
detectors = {
    'Isolation Forest': IForest(contamination=0.1, random_state=42),
    'LOF': LOF(contamination=0.1, n_neighbors=20),
    'COPOD': COPOD(contamination=0.1)
}

# Fit detectors and collect scores
scores_df = pd.DataFrame(index=df.index)
labels_df = pd.DataFrame(index=df.index)

for name, detector in detectors.items():
    detector.fit(X_scaled)
    scores_df[name] = detector.decision_scores_
    labels_df[name] = detector.labels_

# Consensus: point is outlier if flagged by at least 2 methods
consensus_outliers = (labels_df.sum(axis=1) >= 2)
print(f"Outliers detected by consensus: {consensus_outliers.sum()}")

# Investigate outliers
outlier_indices = df[consensus_outliers].index
print("\nOutlier summary statistics:")
print(df.loc[outlier_indices].describe())
print("\nNormal data summary statistics:")
print(df.loc[~consensus_outliers].describe())

# Visualize outlier scores
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i, (name, scores) in enumerate(scores_df.items()):
    axes[i].hist(scores, bins=50, alpha=0.7)
    threshold = np.percentile(scores, 90)  # 90th percentile
    axes[i].axvline(threshold, color='r', linestyle='--', label='Threshold')
    axes[i].set_title(f'{name} Scores')
    axes[i].set_xlabel('Outlier Score')
    axes[i].set_ylabel('Frequency')
    axes[i].legend()
plt.tight_layout()

# Decision: remove, investigate, or keep?
# Option 1: Remove outliers
df_clean = df[~consensus_outliers].copy()

# Option 2: Use robust regression that handles outliers
# (see RLM, Huber, RANSAC examples above)

# Option 3: Keep but analyze separately
df['is_outlier'] = consensus_outliers

Example 4: Combined Regression + Outlier Detection

import numpy as np
import pandas as pd
import statsmodels.api as sm
from pyod.models.iforest import IForest
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv('data.csv')
X = df[['x1', 'x2', 'x3']].values
y = df['y'].values

# Step 1: Detect outliers in predictor space
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

detector = IForest(contamination=0.05, random_state=42)
detector.fit(X_scaled)
outliers_X = detector.labels_ == 1

print(f"Outliers in predictor space: {outliers_X.sum()}")

# Step 2: Fit initial model
X_with_const = sm.add_constant(X)
model = sm.OLS(y, X_with_const)
results = model.fit()

# Step 3: Identify outliers in response space
from statsmodels.stats.outliers_influence import OLSInfluence
influence = OLSInfluence(results)
studentized_resid = influence.resid_studentized_internal
outliers_y = np.abs(studentized_resid) > 3

print(f"Outliers in response space: {outliers_y.sum()}")

# Step 4: Identify influential points
cooks_d = influence.cooks_distance[0]
n, p = X_with_const.shape
influential = cooks_d > 4/n

print(f"Influential points: {influential.sum()}")

# Step 5: Categorize points
df['outlier_X'] = outliers_X
df['outlier_y'] = outliers_y
df['influential'] = influential
df['any_flag'] = outliers_X | outliers_y | influential

# Step 6: Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Plot 1: Residuals vs Fitted
colors = ['red' if flag else 'blue' for flag in df['any_flag']]
axes[0].scatter(results.fittedvalues, results.resid, c=colors, alpha=0.6)
axes[0].axhline(y=0, color='black', linestyle='--')
axes[0].set_xlabel('Fitted values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Fitted (red = flagged)')

# Plot 2: Cook's distance
axes[1].stem(range(len(cooks_d)), cooks_d, markerfmt=',')
axes[1].axhline(y=4/n, color='r', linestyle='--', label='Threshold (4/n)')
axes[1].set_xlabel('Observation index')
axes[1].set_ylabel("Cook's distance")
axes[1].set_title("Cook's Distance")
axes[1].legend()

plt.tight_layout()

# Step 7: Compare models
print("\n=== Original Model ===")
print(results.summary())

# Remove flagged points and refit
df_clean = df[~df['any_flag']].copy()
X_clean = df_clean[['x1', 'x2', 'x3']].values
y_clean = df_clean['y'].values
X_clean_const = sm.add_constant(X_clean)

model_clean = sm.OLS(y_clean, X_clean_const)
results_clean = model_clean.fit()

print("\n=== Model After Removing Flagged Points ===")
print(results_clean.summary())

# Alternative: Robust regression (keeps all data)
from statsmodels.robust.robust_linear_model import RLM
robust_model = RLM(y, X_with_const, M=sm.robust.norms.HuberT())
robust_results = robust_model.fit()

print("\n=== Robust Regression (All Data) ===")
print(robust_results.summary())

Common Pitfalls and Solutions

1. P-hacking and Multiple Testing

Problem: Testing many models/features and reporting only significant ones.

Solution:

Pre-specify your model based on theory/prior knowledge
Use adjusted p-values (Bonferroni, Benjamini-Hochberg) for multiple comparisons
Focus on effect sizes and confidence intervals, not just p-values
Use train/test split or cross-validation for model selection

from statsmodels.stats.multitest import multipletests

# Multiple hypothesis testing correction
p_values = [0.04, 0.03, 0.001, 0.15, 0.08]
reject, pvals_corrected, _, _ = multipletests(p_values, method='fdr_bh')
# Benjamini-Hochberg controls false discovery rate

2. Overfitting with Many Features

Problem: Model fits training data perfectly but generalizes poorly.

Solution:

Use regularization (Ridge, Lasso, ElasticNet)
Cross-validation to estimate generalization performance
Feature selection based on domain knowledge
Simpler models when sample size is small

# Rule of thumb: need at least 10-20 observations per predictor
n_samples, n_features = X.shape
if n_samples < 10 * n_features:
    print(f"Warning: Only {n_samples/n_features:.1f} samples per feature")
    print("Consider: (1) regularization, (2) feature selection, (3) collect more data")

3. Ignoring Assumptions

Problem: Using OLS when assumptions are violated leads to invalid inference.

Solution:

Always check diagnostics (residual plots, Q-Q plots, tests)
Transform variables if needed (log, Box-Cox, square root)
Use robust methods when outliers present
Use WLS when heteroskedasticity detected
Use GLM when response distribution is non-normal

# Box-Cox transformation for normality
from scipy.stats import boxcox

y_transformed, lambda_param = boxcox(y)  # Only works for positive y
print(f"Optimal lambda: {lambda_param}")
# lambda=0: log transform
# lambda=0.5: square root
# lambda=1: no transform needed

4. Removing Outliers Without Justification

Problem: Arbitrarily removing points to improve fit.

Solution:

Investigate WHY points are outliers (data errors vs real extremes)
Document all data cleaning decisions
Try robust methods before removing
Compare results with/without outliers
Never remove points just because they don't fit the model

# Systematic approach to outlier investigation
def investigate_outliers(df, outlier_mask):
    """
    Print detailed information about detected outliers
    """
    outliers = df[outlier_mask]
    normal = df[~outlier_mask]

    print(f"Number of outliers: {len(outliers)} ({100*len(outliers)/len(df):.1f}%)")
    print("\nOutlier characteristics:")
    print(outliers.describe())
    print("\nNormal data characteristics:")
    print(normal.describe())

    # Check for data entry errors
    print("\nPotential data errors:")
    for col in df.columns:
        if df[col].dtype in [np.float64, np.int64]:
            min_val, max_val = df[col].min(), df[col].max()
            suspicious = (outliers[col] < min_val * 0.1) | (outliers[col] > max_val * 0.9)
            if suspicious.any():
                print(f"  {col}: {suspicious.sum()} suspicious values")

5. Confusing Outliers vs Influential Points

Problem: Not all outliers are influential, and not all influential points are outliers.

Concepts:

Outlier in X-space: Unusual predictor values (high leverage)
Outlier in Y-space: Unusual response value (large residual)
Influential point: Removing it significantly changes model (high Cook's D)

Solution:

Check multiple diagnostics: leverage, residuals, Cook's D
High leverage + small residual = not very influential
Low leverage + large residual = not very influential
High leverage + large residual = very influential

# Categorize points
def categorize_points(influence, results, n, p):
    """
    Categorize observations based on influence measures
    """
    leverage = influence.hat_matrix_diag
    studentized_resid = influence.resid_studentized_internal
    cooks_d = influence.cooks_distance[0]

    high_leverage = leverage > 2 * p / n
    outlier_y = np.abs(studentized_resid) > 3
    influential = cooks_d > 4 / n

    categories = []
    for i in range(n):
        if influential[i]:
            categories.append('Influential')
        elif high_leverage[i] and outlier_y[i]:
            categories.append('High Leverage Outlier')
        elif high_leverage[i]:
            categories.append('High Leverage')
        elif outlier_y[i]:
            categories.append('Outlier')
        else:
            categories.append('Normal')

    return pd.Series(categories, name='Category')

6. Not Checking Prediction Intervals

Problem: Reporting point predictions without uncertainty.

Solution:

Use prediction intervals (statistical models) or quantile regression
Report confidence intervals for coefficients
Visualize uncertainty in predictions

# Prediction intervals with statsmodels
predictions = results.get_prediction(X_new)
pred_df = predictions.summary_frame(alpha=0.05)  # 95% intervals

# pred_df has columns: mean, mean_se, mean_ci_lower, mean_ci_upper,
#                       obs_ci_lower, obs_ci_upper

# Confidence interval: uncertainty in mean prediction
# Prediction interval: where we expect new observation to fall (wider)

import matplotlib.pyplot as plt
plt.scatter(X_test, y_test, label='Actual')
plt.plot(X_test, pred_df['mean'], 'r-', label='Prediction')
plt.fill_between(X_test, pred_df['obs_ci_lower'], pred_df['obs_ci_upper'],
                 alpha=0.2, label='95% Prediction Interval')
plt.legend()

Visualization Best Practices

Diagnostic Plots

import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.graphics.gofplots import qqplot

def plot_diagnostics(results, figsize=(12, 10)):
    """
    Create comprehensive diagnostic plots for regression
    """
    fig, axes = plt.subplots(2, 2, figsize=figsize)

    # 1. Residuals vs Fitted
    axes[0, 0].scatter(results.fittedvalues, results.resid, alpha=0.6)
    axes[0, 0].axhline(y=0, color='r', linestyle='--')
    axes[0, 0].set_xlabel('Fitted values')
    axes[0, 0].set_ylabel('Residuals')
    axes[0, 0].set_title('Residuals vs Fitted')

    # Add lowess smooth
    from statsmodels.nonparametric.smoothers_lowess import lowess
    smoothed = lowess(results.resid, results.fittedvalues, frac=0.3)
    axes[0, 0].plot(smoothed[:, 0], smoothed[:, 1], 'g-', lw=2)

    # 2. Q-Q plot
    qqplot(results.resid, line='s', ax=axes[0, 1])
    axes[0, 1].set_title('Normal Q-Q')

    # 3. Scale-Location
    resid_std = np.sqrt(np.abs(results.resid_pearson))
    axes[1, 0].scatter(results.fittedvalues, resid_std, alpha=0.6)
    axes[1, 0].set_xlabel('Fitted values')
    axes[1, 0].set_ylabel('√|Standardized Residuals|')
    axes[1, 0].set_title('Scale-Location')
    smoothed = lowess(resid_std, results.fittedvalues, frac=0.3)
    axes[1, 0].plot(smoothed[:, 0], smoothed[:, 1], 'r-', lw=2)

    # 4. Residuals vs Leverage
    from statsmodels.stats.outliers_influence import OLSInfluence
    influence = OLSInfluence(results)
    leverage = influence.hat_matrix_diag
    cooks_d = influence.cooks_distance[0]

    axes[1, 1].scatter(leverage, results.resid_pearson, alpha=0.6)
    axes[1, 1].set_xlabel('Leverage')
    axes[1, 1].set_ylabel('Standardized Residuals')
    axes[1, 1].set_title('Residuals vs Leverage')

    # Highlight high Cook's D points
    n = len(results.resid)
    high_cooks = cooks_d > 4/n
    axes[1, 1].scatter(leverage[high_cooks], results.resid_pearson[high_cooks],
                       color='red', s=100, alpha=0.8, label="High Cook's D")
    axes[1, 1].legend()

    plt.tight_layout()
    return fig

# Usage
# fig = plot_diagnostics(results)
# plt.show()

Outlier Score Distributions

def plot_outlier_scores(detectors_dict, X, figsize=(15, 5)):
    """
    Plot outlier score distributions for multiple detectors
    """
    n_detectors = len(detectors_dict)
    fig, axes = plt.subplots(1, n_detectors, figsize=figsize)
    if n_detectors == 1:
        axes = [axes]

    for ax, (name, detector) in zip(axes, detectors_dict.items()):
        detector.fit(X)
        scores = detector.decision_scores_
        labels = detector.labels_

        # Histogram
        ax.hist(scores[labels == 0], bins=50, alpha=0.6, label='Inliers')
        ax.hist(scores[labels == 1], bins=50, alpha=0.6, label='Outliers')
        ax.set_xlabel('Outlier Score')
        ax.set_ylabel('Frequency')
        ax.set_title(name)
        ax.legend()

    plt.tight_layout()
    return fig

Quick Reference

When to Use Which Method

| Goal | Method | Library | |------|--------|---------| | Simple linear regression with inference | OLS | statsmodels | | Regression with correlated predictors | Ridge | sklearn | | Feature selection in regression | Lasso | sklearn | | Robust to outliers | RLM, HuberRegressor, RANSACRegressor | statsmodels, sklearn | | Count data | GLM(Poisson), GLM(NegativeBinomial) | statsmodels | | Time series forecasting | ARIMA, SARIMAX | statsmodels | | High-dimensional data | Lasso, ElasticNet, Ridge | sklearn | | Complex nonlinear patterns | RandomForest, GradientBoosting | sklearn | | Fast outlier detection | COPOD, ECOD, IsolationForest | PyOD, sklearn | | Robust outlier detection | LOF, Ensemble methods | PyOD, sklearn | | Univariate outliers | IQR, MAD, Z-score | scipy |

Key Diagnostics Checklist

[ ] Linearity: Residual vs fitted plot shows random scatter
[ ] Independence: Durbin-Watson ≈ 2, no autocorrelation
[ ] Normality: Q-Q plot follows diagonal, Shapiro-Wilk p > 0.05
[ ] Equal variance: Scale-location plot shows constant spread, Breusch-Pagan p > 0.05
[ ] Multicollinearity: All VIF < 5 (ideally < 10)
[ ] Outliers: Identify using |studentized residual| > 3
[ ] Influential points: Cook's D < 4/n, check DFFITS
[ ] Model comparison: Use AIC, BIC, cross-validation RMSE

Installation Commands

# Statistical modeling
pip install statsmodels scipy

# Machine learning
pip install scikit-learn

# Outlier detection
pip install pyod

# Optional: advanced methods
pip install xgboost lightgbm

Additional Resources

Documentation

statsmodels: https://www.statsmodels.org/stable/index.html
scikit-learn: https://scikit-learn.org/stable/
PyOD: https://pyod.readthedocs.io/
scipy.stats: https://docs.scipy.org/doc/scipy/reference/stats.html

Key Statsmodels Pages

OLS: https://www.statsmodels.org/stable/regression.html
GLM: https://www.statsmodels.org/stable/glm.html
RLM: https://www.statsmodels.org/stable/rlm.html
Diagnostics: https://www.statsmodels.org/stable/diagnostic.html

Key Sklearn Pages

Linear Models: https://scikit-learn.org/stable/modules/linear_model.html
Ensemble Methods: https://scikit-learn.org/stable/modules/ensemble.html
Outlier Detection: https://scikit-learn.org/stable/modules/outlier_detection.html
Model Selection: https://scikit-learn.org/stable/model_selection.html

Python Regression & Statistical Analysis

When to Use This Skill

Fitting linear and nonlinear regression models with proper statistical inference
Checking and validating regression assumptions (LINE: Linearity, Independence, Normality, Equal variance)
Performing model diagnostics: residual analysis, influential points, multicollinearity
Detecting and handling outliers using statistical, proximity-based, and ensemble methods
Comparing statistical vs machine learning regression approaches
Building robust regression pipelines with proper validation
Time series modeling and forecasting
Generalized linear models (GLM) for non-normal response variables

Philosophy: Statistical vs Machine Learning Approaches

Statistical Regression (statsmodels, scipy.stats)

Use when:

You need p-values, confidence intervals, and hypothesis testing
Understanding which variables are significant is important
You want to interpret coefficients causally
Sample size is moderate and you need inference guarantees
You need diagnostic tests for assumptions

Advantages:

Rich statistical inference (t-tests, F-tests, likelihood ratios)
Comprehensive diagnostics and assumption tests
Formula interface for easy model specification
Standard errors, confidence intervals, prediction intervals
Well-understood theoretical properties

Limitations:

Less flexible for complex nonlinear relationships
Requires careful assumption checking
May not handle high-dimensional data well
Less automated feature selection

Machine Learning Regression (scikit-learn)

Use when:

Prediction accuracy is the primary goal
You have many features and complex interactions
Relationships are highly nonlinear
You don't need statistical inference (p-values)
You have sufficient data for train/test splits

Advantages:

Handles complex nonlinear patterns
Built-in regularization (Ridge, Lasso, ElasticNet)
Excellent for high-dimensional data
Automated cross-validation and hyperparameter tuning
Ensemble methods for improved accuracy

Limitations:

No automatic p-values or statistical inference
Less interpretable coefficients
Requires more data for complex models
Black-box nature for some algorithms

Recommended Workflow

# 1. Exploratory Data Analysis
# - Visualize relationships (scatter plots, pair plots)
# - Check distributions (histograms, Q-Q plots)
# - Identify potential outliers (box plots, scatter plots)
# - Calculate summary statistics

# 2. Initial Outlier Screening (Optional)
# - Use multiple detection methods (IQR, Z-score, isolation forest)
# - Visualize flagged points
# - Investigate why they're outliers (data errors vs real extremes)
# - Document decision to keep or remove

# 3. Check Assumptions (for statistical regression)
# - Linearity: residual vs fitted plots, partial regression plots
# - Independence: Durbin-Watson test, autocorrelation plots
# - Normality: Q-Q plots, Shapiro-Wilk test, histogram of residuals
# - Equal variance: scale-location plot, Breusch-Pagan test

# 4. Model Fitting
# - Start simple (OLS) then add complexity as needed
# - Try transformations if assumptions violated (log, Box-Cox)
# - Consider robust methods if outliers present
# - Use regularization if many predictors (Ridge/Lasso)

# 5. Model Diagnostics
# - Residual analysis
# - Check for influential points (Cook's D, DFFITS, leverage)
# - Multicollinearity (VIF)
# - Outliers in predictor space vs response space

# 6. Validation
# - Train/test split or cross-validation
# - Compare multiple models (AIC, BIC, R², RMSE)
# - Check prediction intervals (statistical) or residual distribution (ML)

# 7. Interpretation & Reporting
# - Coefficient interpretation with confidence intervals
# - Feature importance (if ML model)
# - Visualize predictions vs actuals
# - Report diagnostics and assumption checks

Statistical Regression with statsmodels

Ordinary Least Squares (OLS)

import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd

# Arrays API - more control
X = sm.add_constant(X_raw)  # Add intercept
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

# Formula API - easier specification
df = pd.DataFrame({'y': y, 'x1': x1, 'x2': x2, 'category': cat})
model = smf.ols('y ~ x1 + x2 + C(category)', data=df)
results = model.fit()

# Access key results
results.params           # Coefficients
results.pvalues         # P-values
results.conf_int()      # Confidence intervals
results.rsquared        # R²
results.rsquared_adj    # Adjusted R²
results.aic, results.bic  # Information criteria
results.resid           # Residuals
results.fittedvalues    # Predicted values

Weighted Least Squares (WLS)

# When heteroskedasticity is present
# Weights = 1/variance of each observation

# If you know the variance structure
weights = 1 / variance_array
model = sm.WLS(y, X, weights=weights)
results = model.fit()

# Iteratively Reweighted Least Squares
# Start with OLS, estimate variance function, refit
ols_results = sm.OLS(y, X).fit()
resid_squared = ols_results.resid**2

# Fit variance as function of predictors
variance_model = sm.OLS(np.log(resid_squared), X).fit()
weights = 1 / np.exp(variance_model.fittedvalues)

wls_results = sm.WLS(y, X, weights=weights).fit()

Robust Regression (RLM)

# Resistant to outliers - downweights influential points
import statsmodels.robust.robust_linear_model as rlm

# Huber's T norm (default)
model = rlm.RLM(y, X, M=sm.robust.norms.HuberT())
results = model.fit()

# Other M-estimators
# TukeyBiweight - more aggressive outlier downweighting
model = rlm.RLM(y, X, M=sm.robust.norms.TukeyBiweight())
results = model.fit()

# Check weights assigned to each observation
weights = results.weights  # Outliers get lower weights

Generalized Linear Models (GLM)

# For non-normal response variables

# Poisson regression (count data)
model = sm.GLM(y, X, family=sm.families.Poisson())
results = model.fit()

# Negative Binomial (overdispersed counts)
model = sm.GLM(y, X, family=sm.families.NegativeBinomial())
results = model.fit()

# Gamma regression (positive continuous data)
model = sm.GLM(y, X, family=sm.families.Gamma())
results = model.fit()

# Binomial/Logistic regression
model = sm.GLM(y, X, family=sm.families.Binomial())
results = model.fit()

# Link functions
# family=sm.families.Gaussian(link=sm.families.links.log())

Time Series Models

from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.vector_ar.var_model import VAR

# ARIMA (AutoRegressive Integrated Moving Average)
model = ARIMA(y, order=(p, d, q))  # p=AR order, d=differencing, q=MA order
results = model.fit()

# Seasonal ARIMA
model = SARIMAX(y, order=(p, d, q), seasonal_order=(P, D, Q, s))
results = model.fit()

# Forecasting
forecast = results.forecast(steps=10)
forecast_with_ci = results.get_forecast(steps=10)
forecast_df = forecast_with_ci.summary_frame()

# Vector Autoregression (multivariate time series)
model = VAR(df[['y1', 'y2', 'y3']])
results = model.fit(maxlags=5)
results.forecast(df.values[-results.k_ar:], steps=10)

Model Diagnostics - statsmodels

# Comprehensive diagnostic plots
from statsmodels.graphics.gofplots import qqplot
import matplotlib.pyplot as plt

# 1. Residual vs Fitted
plt.scatter(results.fittedvalues, results.resid)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
# Should show random scatter, no pattern

# 2. Q-Q plot for normality
qqplot(results.resid, line='s')
# Points should follow diagonal line

# 3. Scale-Location (heteroskedasticity)
plt.scatter(results.fittedvalues, np.sqrt(np.abs(results.resid_pearson)))
plt.xlabel('Fitted values')
plt.ylabel('√|Standardized Residuals|')
# Should show constant spread

# 4. Residuals vs Leverage
from statsmodels.stats.outliers_influence import OLSInfluence
influence = OLSInfluence(results)
fig, ax = plt.subplots()
ax.scatter(influence.hat_matrix_diag, results.resid_pearson)
# Identifies influential points (high leverage + large residual)

# Statistical tests
from statsmodels.stats.diagnostic import het_breuschpagan, acorr_ljungbox
from statsmodels.stats.stattools import durbin_watson

# Heteroskedasticity test
lm_stat, lm_pval, f_stat, f_pval = het_breuschpagan(results.resid, results.model.exog)
print(f"Breusch-Pagan p-value: {lm_pval}")  # p < 0.05 indicates heteroskedasticity

# Autocorrelation
dw = durbin_watson(results.resid)
print(f"Durbin-Watson: {dw}")  # Should be ~2 for no autocorrelation

# Autocorrelation test
lb_test = acorr_ljungbox(results.resid, lags=10)

# Normality test
from scipy.stats import shapiro
stat, pval = shapiro(results.resid)
print(f"Shapiro-Wilk p-value: {pval}")  # p < 0.05 suggests non-normality

Influential Points and Multicollinearity

from statsmodels.stats.outliers_influence import variance_inflation_factor, OLSInfluence

# Variance Inflation Factor (multicollinearity)
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)
# VIF > 10 indicates serious multicollinearity
# VIF > 5 is concerning

# Influence measures
influence = OLSInfluence(results)

# Cook's Distance (overall influence)
cooks_d = influence.cooks_distance[0]
# Rule of thumb: D > 4/n is influential

# DFFITS (influence on fitted value)
dffits = influence.dffits[0]
# Rule of thumb: |DFFITS| > 2√(p/n) is influential

# Leverage (hat values)
leverage = influence.hat_matrix_diag
# Rule of thumb: h > 2p/n or h > 3p/n is high leverage

# Studentized residuals
student_resid = influence.resid_studentized_internal
# |r| > 2 or 3 suggests outlier in response space

# Summary table
influence_summary = influence.summary_frame()
print(influence_summary)

Machine Learning Regression with scikit-learn

Linear Models with Regularization

from sklearn.linear_model import (
    LinearRegression, Ridge, Lasso, ElasticNet,
    Lars, LassoLars, BayesianRidge, HuberRegressor,
    RANSACRegressor, TheilSenRegressor
)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, GridSearchCV

# Standard Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
model.coef_, model.intercept_

# Ridge Regression (L2 regularization)
# Good when many correlated predictors
ridge = Ridge(alpha=1.0)  # Higher alpha = more regularization
ridge.fit(X_train, y_train)

# Lasso Regression (L1 regularization)
# Performs feature selection by setting coefficients to zero
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
# Check which features selected
selected_features = X.columns[lasso.coef_ != 0]

# ElasticNet (L1 + L2)
# l1_ratio=1.0 is Lasso, l1_ratio=0.0 is Ridge
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_train, y_train)

# Huber Regression (robust to outliers)
huber = HuberRegressor(epsilon=1.35)
huber.fit(X_train, y_train)

# RANSAC (extremely robust, finds inlier subset)
ransac = RANSACRegressor(min_samples=0.5, residual_threshold=2.0)
ransac.fit(X_train, y_train)
inlier_mask = ransac.inlier_mask_

# Theil-Sen (robust, good for small datasets)
theil = TheilSenRegressor()
theil.fit(X_train, y_train)

Cross-Validation and Hyperparameter Tuning

from sklearn.model_selection import (
    cross_val_score, cross_validate,
    GridSearchCV, RandomizedSearchCV,
    KFold, RepeatedKFold
)
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error,
    r2_score, mean_absolute_percentage_error
)

# Simple cross-validation
scores = cross_val_score(model, X, y, cv=5,
                         scoring='neg_mean_squared_error')
rmse_scores = np.sqrt(-scores)
print(f"RMSE: {rmse_scores.mean():.3f} (+/- {rmse_scores.std():.3f})")

# Multiple metrics
cv_results = cross_validate(
    model, X, y, cv=5,
    scoring=['neg_mean_squared_error', 'r2', 'neg_mean_absolute_error'],
    return_train_score=True
)

# Grid search for optimal hyperparameters
param_grid = {
    'alpha': [0.001, 0.01, 0.1, 1.0, 10.0],
    'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
}
grid_search = GridSearchCV(
    ElasticNet(),
    param_grid,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {np.sqrt(-grid_search.best_score_):.3f}")

# Use best model
best_model = grid_search.best_estimator_

Tree-Based and Ensemble Methods

from sklearn.ensemble import (
    RandomForestRegressor, GradientBoostingRegressor,
    AdaBoostRegressor, BaggingRegressor, ExtraTreesRegressor
)
from sklearn.tree import DecisionTreeRegressor

# Random Forest
rf = RandomForestRegressor(
    n_estimators=100,
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)

# Feature importance
importance_df = pd.DataFrame({
    'feature': X.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

# Gradient Boosting
gb = GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    subsample=0.8,
    random_state=42
)
gb.fit(X_train, y_train)

# XGBoost (if installed)
try:
    import xgboost as xgb
    xgb_model = xgb.XGBRegressor(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=5,
        random_state=42
    )
    xgb_model.fit(X_train, y_train)
except ImportError:
    print("XGBoost not installed. Use: pip install xgboost")

Other ML Regression Methods

from sklearn.svm import SVR
from sklearn.kernel_ridge import KernelRidge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel

# Support Vector Regression
svr = SVR(kernel='rbf', C=1.0, epsilon=0.1)
svr.fit(X_train, y_train)

# Kernel Ridge Regression
kr = KernelRidge(alpha=1.0, kernel='rbf')
kr.fit(X_train, y_train)

# K-Nearest Neighbors
knn = KNeighborsRegressor(n_neighbors=5, weights='distance')
knn.fit(X_train, y_train)

# Gaussian Process (with uncertainty quantification)
kernel = RBF(length_scale=1.0) + WhiteKernel(noise_level=1.0)
gp = GaussianProcessRegressor(kernel=kernel, random_state=42)
gp.fit(X_train, y_train)

# Predictions with uncertainty
y_pred, sigma = gp.predict(X_test, return_std=True)

Building Complete Pipelines

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.compose import TransformedTargetRegressor

# Standard pipeline with scaling
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', Ridge(alpha=1.0))
])
pipeline.fit(X_train, y_train)

# Polynomial features
poly_pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler', StandardScaler()),
    ('model', Ridge(alpha=1.0))
])

# Target transformation (e.g., log transform y)
log_model = TransformedTargetRegressor(
    regressor=Ridge(alpha=1.0),
    func=np.log1p,
    inverse_func=np.expm1
)
log_model.fit(X_train, y_train)

Outlier Detection

Statistical Methods (scipy, statsmodels)

import numpy as np
from scipy import stats

# Z-score method
def detect_outliers_zscore(data, threshold=3):
    """
    Outliers are points with |z-score| > threshold
    Works well for normally distributed data
    """
    z_scores = np.abs(stats.zscore(data))
    return z_scores > threshold

# Modified Z-score (MAD - Median Absolute Deviation)
def detect_outliers_mad(data, threshold=3.5):
    """
    More robust than Z-score, uses median instead of mean
    Recommended threshold: 3.5
    """
    median = np.median(data)
    mad = np.median(np.abs(data - median))
    modified_z_scores = 0.6745 * (data - median) / mad
    return np.abs(modified_z_scores) > threshold

# IQR method
def detect_outliers_iqr(data, factor=1.5):
    """
    Outliers are below Q1 - factor*IQR or above Q3 + factor*IQR
    factor=1.5: outliers, factor=3.0: extreme outliers
    """
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    iqr = q3 - q1
    lower_bound = q1 - factor * iqr
    upper_bound = q3 + factor * iqr
    return (data < lower_bound) | (data > upper_bound)

# Grubbs test (for single outlier in normal data)
def grubbs_test(data, alpha=0.05):
    """
    Tests if maximum or minimum is an outlier
    Assumes data is normally distributed
    """
    from scipy.stats import t
    n = len(data)
    mean = np.mean(data)
    std = np.std(data, ddof=1)

    # Test statistic
    G_max = np.max(np.abs(data - mean)) / std

    # Critical value
    t_crit = t.ppf(1 - alpha / (2 * n), n - 2)
    G_crit = ((n - 1) / np.sqrt(n)) * np.sqrt(t_crit**2 / (n - 2 + t_crit**2))

    return G_max > G_crit

# Example usage
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 100])  # 100 is outlier
outliers_z = detect_outliers_zscore(data)
outliers_mad = detect_outliers_mad(data)
outliers_iqr = detect_outliers_iqr(data)

PyOD - Comprehensive Outlier Detection

PyOD (Python Outlier Detection) provides 40+ algorithms. Install with: pip install pyod

from pyod.models.knn import KNN
from pyod.models.lof import LOF
from pyod.models.iforest import IForest
from pyod.models.ocsvm import OCSVM
from pyod.models.pca import PCA as PCA_OD
from pyod.models.mcd import MCD
from pyod.models.abod import ABOD
from pyod.models.copod import COPOD
from pyod.models.ecod import ECOD
from pyod.models.combination import average, maximization

# General pattern for PyOD detectors
detector = KNN(contamination=0.1)  # Expect 10% outliers
detector.fit(X)

# Get outlier labels (0=inlier, 1=outlier)
outlier_labels = detector.labels_

# Get outlier scores (higher = more outlier-like)
outlier_scores = detector.decision_scores_

# Predict on new data
new_labels = detector.predict(X_new)
new_scores = detector.decision_function(X_new)

Proximity-Based Detectors

# K-Nearest Neighbors (KNN)
# Simple, fast, good baseline
knn = KNN(contamination=0.1, n_neighbors=5, method='mean')
knn.fit(X)

# Local Outlier Factor (LOF)
# Captures local density deviations
# Good for datasets with varying densities
lof = LOF(contamination=0.1, n_neighbors=20)
lof.fit(X)

# Connectivity-Based Outlier Factor (COF)
from pyod.models.cof import COF
cof = COF(contamination=0.1, n_neighbors=20)
cof.fit(X)

Probabilistic Detectors

# COPOD (Copula-Based Outlier Detection)
# Fast, parameter-free, works well on high-dimensional data
copod = COPOD(contamination=0.1)
copod.fit(X)

# ECOD (Empirical Cumulative Distribution)
# Very fast, no parameters, good for tabular data
ecod = ECOD(contamination=0.1)
ecod.fit(X)

# ABOD (Angle-Based Outlier Detection)
# Effective in high-dimensional spaces
# Slow on large datasets
abod = ABOD(contamination=0.1)
abod.fit(X)

Linear Model Detectors

# PCA-based detection
# Outliers have large reconstruction error
pca = PCA_OD(contamination=0.1, n_components=None)
pca.fit(X)

# MCD (Minimum Covariance Determinant)
# Robust covariance estimation
# Assumes data is roughly Gaussian
mcd = MCD(contamination=0.1)
mcd.fit(X)

# One-Class SVM
# Learns boundary around normal points
ocsvm = OCSVM(contamination=0.1, kernel='rbf', nu=0.1)
ocsvm.fit(X)

Isolation-Based Detectors

# Isolation Forest
# Fast, scalable, effective for high-dimensional data
# Isolates outliers by random partitioning
iforest = IForest(
    contamination=0.1,
    n_estimators=100,
    max_features=1.0,
    random_state=42
)
iforest.fit(X)

Ensemble Methods

# Combine multiple detectors
from pyod.models.lscp import LSCP
from pyod.models.feature_bagging import FeatureBagging

# Feature Bagging
# Trains multiple detectors on random feature subsets
fb = FeatureBagging(
    base_estimator=LOF(contamination=0.1),
    n_estimators=10,
    contamination=0.1,
    random_state=42
)
fb.fit(X)

# Average multiple detector scores
detectors = [KNN(), LOF(), IForest(), COPOD()]
scores_list = []
for detector in detectors:
    detector.fit(X)
    scores_list.append(detector.decision_scores_)

# Average combination
avg_scores = average(scores_list)

# Maximum combination (more conservative)
max_scores = maximization(scores_list)

scikit-learn Outlier Detection

from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.covariance import EllipticEnvelope
from sklearn.svm import OneClassSVM

# Isolation Forest
iso_forest = IsolationForest(
    contamination=0.1,
    random_state=42,
    n_estimators=100
)
outlier_labels = iso_forest.fit_predict(X)  # -1=outlier, 1=inlier
outlier_scores = iso_forest.score_samples(X)  # More negative = more outlier

# Local Outlier Factor (novelty detection mode)
lof = LocalOutlierFactor(
    n_neighbors=20,
    contamination=0.1,
    novelty=True  # Enables predict on new data
)
lof.fit(X_train)
outlier_labels = lof.predict(X_test)
outlier_scores = lof.score_samples(X_test)

# Elliptic Envelope (assumes Gaussian distribution)
elliptic = EllipticEnvelope(contamination=0.1, random_state=42)
outlier_labels = elliptic.fit_predict(X)

# One-Class SVM
oc_svm = OneClassSVM(kernel='rbf', gamma='auto', nu=0.1)
outlier_labels = oc_svm.fit_predict(X)

Choosing the Right Outlier Detection Method

Use IQR or Z-score when:

Univariate data (single variable)
Quick screening needed
Interpretability is important

Use Modified Z-score (MAD) when:

Data may already contain outliers (MAD is robust)
Distribution is roughly symmetric

Use Isolation Forest when:

High-dimensional data
No assumptions about data distribution
Fast computation needed
Good all-purpose choice

Use LOF when:

Varying density clusters in data
Local structure is important
Willing to tune n_neighbors parameter

Use COPOD/ECOD when:

Very fast detection needed
High-dimensional tabular data
Don't want to tune parameters

Use PCA-based when:

Data lies in lower-dimensional manifold
Want to understand which features contribute to outlierness

Use MCD when:

Multivariate Gaussian assumption reasonable
Robust covariance estimation needed

Use ensemble methods when:

Maximum robustness needed
Can afford computational cost
Different detector types disagree

Complete Workflow Examples

Example 1: Statistical Regression with Full Diagnostics

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.gofplots import qqplot
from statsmodels.stats.outliers_influence import variance_inflation_factor, OLSInfluence
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.stattools import durbin_watson
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv('data.csv')

# 1. Exploratory analysis
print(df.describe())
df.hist(bins=30, figsize=(15, 10))
pd.plotting.scatter_matrix(df, figsize=(15, 15))

# 2. Fit initial model
model = smf.ols('price ~ area + bedrooms + age + location', data=df)
results = model.fit()
print(results.summary())

# 3. Check assumptions
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Residuals vs Fitted
axes[0, 0].scatter(results.fittedvalues, results.resid)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')

# Q-Q plot
qqplot(results.resid, line='s', ax=axes[0, 1])
axes[0, 1].set_title('Normal Q-Q')

# Scale-Location
axes[1, 0].scatter(results.fittedvalues, np.sqrt(np.abs(results.resid_pearson)))
axes[1, 0].set_xlabel('Fitted values')
axes[1, 0].set_ylabel('√|Standardized Residuals|')
axes[1, 0].set_title('Scale-Location')

# Residuals vs Leverage
influence = OLSInfluence(results)
axes[1, 1].scatter(influence.hat_matrix_diag, results.resid_pearson)
axes[1, 1].set_xlabel('Leverage')
axes[1, 1].set_ylabel('Standardized Residuals')
axes[1, 1].set_title('Residuals vs Leverage')

plt.tight_layout()

# 4. Statistical tests
# Heteroskedasticity
lm_stat, lm_pval, f_stat, f_pval = het_breuschpagan(results.resid, results.model.exog)
print(f"\nBreusch-Pagan test p-value: {lm_pval:.4f}")

# Autocorrelation
dw = durbin_watson(results.resid)
print(f"Durbin-Watson statistic: {dw:.4f}")

# 5. Check multicollinearity
X = df[['area', 'bedrooms', 'age']]  # Exclude categorical
X = sm.add_constant(X)
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print("\n", vif_data)

# 6. Identify influential points
influence_df = influence.summary_frame()
print("\nInfluential points (Cook's D > 4/n):")
n = len(df)
influential = influence_df[influence_df['cooks_d'] > 4/n]
print(influential[['cooks_d', 'student_resid', 'hat_diag']])

# 7. If assumptions violated, try transformations or robust regression
if lm_pval < 0.05:  # Heteroskedasticity detected
    print("\nFitting robust regression (RLM)...")
    from statsmodels.robust.robust_linear_model import RLM
    robust_model = RLM(df['price'], X, M=sm.robust.norms.HuberT())
    robust_results = robust_model.fit()
    print(robust_results.summary())

Example 2: ML Regression with Cross-Validation

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import pandas as pd
import numpy as np

# Load and split data
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define models to compare
models = {
    'Ridge': Ridge(),
    'Lasso': Lasso(),
    'ElasticNet': ElasticNet(),
    'RandomForest': RandomForestRegressor(random_state=42),
    'GradientBoosting': GradientBoostingRegressor(random_state=42)
}

# Cross-validation comparison
results = {}
for name, model in models.items():
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('model', model)
    ])

    scores = cross_val_score(
        pipeline, X_train, y_train,
        cv=5, scoring='neg_mean_squared_error'
    )
    rmse_scores = np.sqrt(-scores)
    results[name] = {
        'mean_rmse': rmse_scores.mean(),
        'std_rmse': rmse_scores.std()
    }
    print(f"{name}: RMSE = {rmse_scores.mean():.3f} (+/- {rmse_scores.std():.3f})")

# Hyperparameter tuning for best model
param_grid = {
    'model__n_estimators': [50, 100, 200],
    'model__max_depth': [5, 10, 15, None],
    'model__min_samples_split': [2, 5, 10]
}

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestRegressor(random_state=42))
])

grid_search = GridSearchCV(
    pipeline, param_grid, cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV RMSE: {np.sqrt(-grid_search.best_score_):.3f}")

# Evaluate on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

print(f"\nTest set performance:")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.3f}")
print(f"R²: {r2_score(y_test, y_pred):.3f}")

# Feature importance
if hasattr(best_model.named_steps['model'], 'feature_importances_'):
    importance_df = pd.DataFrame({
        'feature': X.columns,
        'importance': best_model.named_steps['model'].feature_importances_
    }).sort_values('importance', ascending=False)
    print("\nTop 10 features:")
    print(importance_df.head(10))

Example 3: Outlier Detection Pipeline

import numpy as np
import pandas as pd
from pyod.models.iforest import IForest
from pyod.models.lof import LOF
from pyod.models.copod import COPOD
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1).values

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Multiple detection methods
detectors = {
    'Isolation Forest': IForest(contamination=0.1, random_state=42),
    'LOF': LOF(contamination=0.1, n_neighbors=20),
    'COPOD': COPOD(contamination=0.1)
}

# Fit detectors and collect scores
scores_df = pd.DataFrame(index=df.index)
labels_df = pd.DataFrame(index=df.index)

for name, detector in detectors.items():
    detector.fit(X_scaled)
    scores_df[name] = detector.decision_scores_
    labels_df[name] = detector.labels_

# Consensus: point is outlier if flagged by at least 2 methods
consensus_outliers = (labels_df.sum(axis=1) >= 2)
print(f"Outliers detected by consensus: {consensus_outliers.sum()}")

# Investigate outliers
outlier_indices = df[consensus_outliers].index
print("\nOutlier summary statistics:")
print(df.loc[outlier_indices].describe())
print("\nNormal data summary statistics:")
print(df.loc[~consensus_outliers].describe())

# Visualize outlier scores
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i, (name, scores) in enumerate(scores_df.items()):
    axes[i].hist(scores, bins=50, alpha=0.7)
    threshold = np.percentile(scores, 90)  # 90th percentile
    axes[i].axvline(threshold, color='r', linestyle='--', label='Threshold')
    axes[i].set_title(f'{name} Scores')
    axes[i].set_xlabel('Outlier Score')
    axes[i].set_ylabel('Frequency')
    axes[i].legend()
plt.tight_layout()

# Decision: remove, investigate, or keep?
# Option 1: Remove outliers
df_clean = df[~consensus_outliers].copy()

# Option 2: Use robust regression that handles outliers
# (see RLM, Huber, RANSAC examples above)

# Option 3: Keep but analyze separately
df['is_outlier'] = consensus_outliers

Example 4: Combined Regression + Outlier Detection

import numpy as np
import pandas as pd
import statsmodels.api as sm
from pyod.models.iforest import IForest
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv('data.csv')
X = df[['x1', 'x2', 'x3']].values
y = df['y'].values

# Step 1: Detect outliers in predictor space
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

detector = IForest(contamination=0.05, random_state=42)
detector.fit(X_scaled)
outliers_X = detector.labels_ == 1

print(f"Outliers in predictor space: {outliers_X.sum()}")

# Step 2: Fit initial model
X_with_const = sm.add_constant(X)
model = sm.OLS(y, X_with_const)
results = model.fit()

# Step 3: Identify outliers in response space
from statsmodels.stats.outliers_influence import OLSInfluence
influence = OLSInfluence(results)
studentized_resid = influence.resid_studentized_internal
outliers_y = np.abs(studentized_resid) > 3

print(f"Outliers in response space: {outliers_y.sum()}")

# Step 4: Identify influential points
cooks_d = influence.cooks_distance[0]
n, p = X_with_const.shape
influential = cooks_d > 4/n

print(f"Influential points: {influential.sum()}")

# Step 5: Categorize points
df['outlier_X'] = outliers_X
df['outlier_y'] = outliers_y
df['influential'] = influential
df['any_flag'] = outliers_X | outliers_y | influential

# Step 6: Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Plot 1: Residuals vs Fitted
colors = ['red' if flag else 'blue' for flag in df['any_flag']]
axes[0].scatter(results.fittedvalues, results.resid, c=colors, alpha=0.6)
axes[0].axhline(y=0, color='black', linestyle='--')
axes[0].set_xlabel('Fitted values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Fitted (red = flagged)')

# Plot 2: Cook's distance
axes[1].stem(range(len(cooks_d)), cooks_d, markerfmt=',')
axes[1].axhline(y=4/n, color='r', linestyle='--', label='Threshold (4/n)')
axes[1].set_xlabel('Observation index')
axes[1].set_ylabel("Cook's distance")
axes[1].set_title("Cook's Distance")
axes[1].legend()

plt.tight_layout()

# Step 7: Compare models
print("\n=== Original Model ===")
print(results.summary())

# Remove flagged points and refit
df_clean = df[~df['any_flag']].copy()
X_clean = df_clean[['x1', 'x2', 'x3']].values
y_clean = df_clean['y'].values
X_clean_const = sm.add_constant(X_clean)

model_clean = sm.OLS(y_clean, X_clean_const)
results_clean = model_clean.fit()

print("\n=== Model After Removing Flagged Points ===")
print(results_clean.summary())

# Alternative: Robust regression (keeps all data)
from statsmodels.robust.robust_linear_model import RLM
robust_model = RLM(y, X_with_const, M=sm.robust.norms.HuberT())
robust_results = robust_model.fit()

print("\n=== Robust Regression (All Data) ===")
print(robust_results.summary())

Common Pitfalls and Solutions

1. P-hacking and Multiple Testing

Problem: Testing many models/features and reporting only significant ones.

Solution:

Pre-specify your model based on theory/prior knowledge
Use adjusted p-values (Bonferroni, Benjamini-Hochberg) for multiple comparisons
Focus on effect sizes and confidence intervals, not just p-values
Use train/test split or cross-validation for model selection

from statsmodels.stats.multitest import multipletests

# Multiple hypothesis testing correction
p_values = [0.04, 0.03, 0.001, 0.15, 0.08]
reject, pvals_corrected, _, _ = multipletests(p_values, method='fdr_bh')
# Benjamini-Hochberg controls false discovery rate

2. Overfitting with Many Features

Problem: Model fits training data perfectly but generalizes poorly.

Solution:

Use regularization (Ridge, Lasso, ElasticNet)
Cross-validation to estimate generalization performance
Feature selection based on domain knowledge
Simpler models when sample size is small

# Rule of thumb: need at least 10-20 observations per predictor
n_samples, n_features = X.shape
if n_samples < 10 * n_features:
    print(f"Warning: Only {n_samples/n_features:.1f} samples per feature")
    print("Consider: (1) regularization, (2) feature selection, (3) collect more data")

3. Ignoring Assumptions

Problem: Using OLS when assumptions are violated leads to invalid inference.

Solution:

Always check diagnostics (residual plots, Q-Q plots, tests)
Transform variables if needed (log, Box-Cox, square root)
Use robust methods when outliers present
Use WLS when heteroskedasticity detected
Use GLM when response distribution is non-normal

# Box-Cox transformation for normality
from scipy.stats import boxcox

y_transformed, lambda_param = boxcox(y)  # Only works for positive y
print(f"Optimal lambda: {lambda_param}")
# lambda=0: log transform
# lambda=0.5: square root
# lambda=1: no transform needed

4. Removing Outliers Without Justification

Problem: Arbitrarily removing points to improve fit.

Solution:

Investigate WHY points are outliers (data errors vs real extremes)
Document all data cleaning decisions
Try robust methods before removing
Compare results with/without outliers
Never remove points just because they don't fit the model

# Systematic approach to outlier investigation
def investigate_outliers(df, outlier_mask):
    """
    Print detailed information about detected outliers
    """
    outliers = df[outlier_mask]
    normal = df[~outlier_mask]

    print(f"Number of outliers: {len(outliers)} ({100*len(outliers)/len(df):.1f}%)")
    print("\nOutlier characteristics:")
    print(outliers.describe())
    print("\nNormal data characteristics:")
    print(normal.describe())

    # Check for data entry errors
    print("\nPotential data errors:")
    for col in df.columns:
        if df[col].dtype in [np.float64, np.int64]:
            min_val, max_val = df[col].min(), df[col].max()
            suspicious = (outliers[col] < min_val * 0.1) | (outliers[col] > max_val * 0.9)
            if suspicious.any():
                print(f"  {col}: {suspicious.sum()} suspicious values")

5. Confusing Outliers vs Influential Points

Problem: Not all outliers are influential, and not all influential points are outliers.

Concepts:

Outlier in X-space: Unusual predictor values (high leverage)
Outlier in Y-space: Unusual response value (large residual)
Influential point: Removing it significantly changes model (high Cook's D)

Solution:

Check multiple diagnostics: leverage, residuals, Cook's D
High leverage + small residual = not very influential
Low leverage + large residual = not very influential
High leverage + large residual = very influential

# Categorize points
def categorize_points(influence, results, n, p):
    """
    Categorize observations based on influence measures
    """
    leverage = influence.hat_matrix_diag
    studentized_resid = influence.resid_studentized_internal
    cooks_d = influence.cooks_distance[0]

    high_leverage = leverage > 2 * p / n
    outlier_y = np.abs(studentized_resid) > 3
    influential = cooks_d > 4 / n

    categories = []
    for i in range(n):
        if influential[i]:
            categories.append('Influential')
        elif high_leverage[i] and outlier_y[i]:
            categories.append('High Leverage Outlier')
        elif high_leverage[i]:
            categories.append('High Leverage')
        elif outlier_y[i]:
            categories.append('Outlier')
        else:
            categories.append('Normal')

    return pd.Series(categories, name='Category')

6. Not Checking Prediction Intervals

Problem: Reporting point predictions without uncertainty.

Solution:

Use prediction intervals (statistical models) or quantile regression
Report confidence intervals for coefficients
Visualize uncertainty in predictions

# Prediction intervals with statsmodels
predictions = results.get_prediction(X_new)
pred_df = predictions.summary_frame(alpha=0.05)  # 95% intervals

# pred_df has columns: mean, mean_se, mean_ci_lower, mean_ci_upper,
#                       obs_ci_lower, obs_ci_upper

# Confidence interval: uncertainty in mean prediction
# Prediction interval: where we expect new observation to fall (wider)

import matplotlib.pyplot as plt
plt.scatter(X_test, y_test, label='Actual')
plt.plot(X_test, pred_df['mean'], 'r-', label='Prediction')
plt.fill_between(X_test, pred_df['obs_ci_lower'], pred_df['obs_ci_upper'],
                 alpha=0.2, label='95% Prediction Interval')
plt.legend()

Visualization Best Practices

Diagnostic Plots

import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.graphics.gofplots import qqplot

def plot_diagnostics(results, figsize=(12, 10)):
    """
    Create comprehensive diagnostic plots for regression
    """
    fig, axes = plt.subplots(2, 2, figsize=figsize)

    # 1. Residuals vs Fitted
    axes[0, 0].scatter(results.fittedvalues, results.resid, alpha=0.6)
    axes[0, 0].axhline(y=0, color='r', linestyle='--')
    axes[0, 0].set_xlabel('Fitted values')
    axes[0, 0].set_ylabel('Residuals')
    axes[0, 0].set_title('Residuals vs Fitted')

    # Add lowess smooth
    from statsmodels.nonparametric.smoothers_lowess import lowess
    smoothed = lowess(results.resid, results.fittedvalues, frac=0.3)
    axes[0, 0].plot(smoothed[:, 0], smoothed[:, 1], 'g-', lw=2)

    # 2. Q-Q plot
    qqplot(results.resid, line='s', ax=axes[0, 1])
    axes[0, 1].set_title('Normal Q-Q')

    # 3. Scale-Location
    resid_std = np.sqrt(np.abs(results.resid_pearson))
    axes[1, 0].scatter(results.fittedvalues, resid_std, alpha=0.6)
    axes[1, 0].set_xlabel('Fitted values')
    axes[1, 0].set_ylabel('√|Standardized Residuals|')
    axes[1, 0].set_title('Scale-Location')
    smoothed = lowess(resid_std, results.fittedvalues, frac=0.3)
    axes[1, 0].plot(smoothed[:, 0], smoothed[:, 1], 'r-', lw=2)

    # 4. Residuals vs Leverage
    from statsmodels.stats.outliers_influence import OLSInfluence
    influence = OLSInfluence(results)
    leverage = influence.hat_matrix_diag
    cooks_d = influence.cooks_distance[0]

    axes[1, 1].scatter(leverage, results.resid_pearson, alpha=0.6)
    axes[1, 1].set_xlabel('Leverage')
    axes[1, 1].set_ylabel('Standardized Residuals')
    axes[1, 1].set_title('Residuals vs Leverage')

    # Highlight high Cook's D points
    n = len(results.resid)
    high_cooks = cooks_d > 4/n
    axes[1, 1].scatter(leverage[high_cooks], results.resid_pearson[high_cooks],
                       color='red', s=100, alpha=0.8, label="High Cook's D")
    axes[1, 1].legend()

    plt.tight_layout()
    return fig

# Usage
# fig = plot_diagnostics(results)
# plt.show()

Outlier Score Distributions

def plot_outlier_scores(detectors_dict, X, figsize=(15, 5)):
    """
    Plot outlier score distributions for multiple detectors
    """
    n_detectors = len(detectors_dict)
    fig, axes = plt.subplots(1, n_detectors, figsize=figsize)
    if n_detectors == 1:
        axes = [axes]

    for ax, (name, detector) in zip(axes, detectors_dict.items()):
        detector.fit(X)
        scores = detector.decision_scores_
        labels = detector.labels_

        # Histogram
        ax.hist(scores[labels == 0], bins=50, alpha=0.6, label='Inliers')
        ax.hist(scores[labels == 1], bins=50, alpha=0.6, label='Outliers')
        ax.set_xlabel('Outlier Score')
        ax.set_ylabel('Frequency')
        ax.set_title(name)
        ax.legend()

    plt.tight_layout()
    return fig

Quick Reference

When to Use Which Method

Key Diagnostics Checklist

[ ] Linearity: Residual vs fitted plot shows random scatter
[ ] Independence: Durbin-Watson ≈ 2, no autocorrelation
[ ] Normality: Q-Q plot follows diagonal, Shapiro-Wilk p > 0.05
[ ] Equal variance: Scale-location plot shows constant spread, Breusch-Pagan p > 0.05
[ ] Multicollinearity: All VIF < 5 (ideally < 10)
[ ] Outliers: Identify using |studentized residual| > 3
[ ] Influential points: Cook's D < 4/n, check DFFITS
[ ] Model comparison: Use AIC, BIC, cross-validation RMSE

Installation Commands

# Statistical modeling
pip install statsmodels scipy

# Machine learning
pip install scikit-learn

# Outlier detection
pip install pyod

# Optional: advanced methods
pip install xgboost lightgbm

Additional Resources

Documentation

statsmodels: https://www.statsmodels.org/stable/index.html
scikit-learn: https://scikit-learn.org/stable/
PyOD: https://pyod.readthedocs.io/
scipy.stats: https://docs.scipy.org/doc/scipy/reference/stats.html

Key Statsmodels Pages

OLS: https://www.statsmodels.org/stable/regression.html
GLM: https://www.statsmodels.org/stable/glm.html
RLM: https://www.statsmodels.org/stable/rlm.html
Diagnostics: https://www.statsmodels.org/stable/diagnostic.html

Key Sklearn Pages

Linear Models: https://scikit-learn.org/stable/modules/linear_model.html
Ensemble Methods: https://scikit-learn.org/stable/modules/ensemble.html
Outlier Detection: https://scikit-learn.org/stable/modules/outlier_detection.html
Model Selection: https://scikit-learn.org/stable/model_selection.html

Adoption

Hongyu-yu/python-regression-statistics

$ install --global

Security Scan Results

SKILL.md

Python Regression & Statistical Analysis

When to Use This Skill

Philosophy: Statistical vs Machine Learning Approaches

Statistical Regression (statsmodels, scipy.stats)

Machine Learning Regression (scikit-learn)

Recommended Workflow

Statistical Regression with statsmodels

Ordinary Least Squares (OLS)

Weighted Least Squares (WLS)

Robust Regression (RLM)

Generalized Linear Models (GLM)

Time Series Models

Model Diagnostics - statsmodels

Influential Points and Multicollinearity

Machine Learning Regression with scikit-learn

Linear Models with Regularization

Cross-Validation and Hyperparameter Tuning

Tree-Based and Ensemble Methods

Other ML Regression Methods

Building Complete Pipelines

Outlier Detection

Statistical Methods (scipy, statsmodels)

PyOD - Comprehensive Outlier Detection

Proximity-Based Detectors

Probabilistic Detectors

Linear Model Detectors

Isolation-Based Detectors

Ensemble Methods

scikit-learn Outlier Detection

Choosing the Right Outlier Detection Method

Complete Workflow Examples

Example 1: Statistical Regression with Full Diagnostics

Example 2: ML Regression with Cross-Validation

Example 3: Outlier Detection Pipeline

Example 4: Combined Regression + Outlier Detection

Common Pitfalls and Solutions

1. P-hacking and Multiple Testing

2. Overfitting with Many Features

3. Ignoring Assumptions

4. Removing Outliers Without Justification

5. Confusing Outliers vs Influential Points

6. Not Checking Prediction Intervals

Visualization Best Practices

Diagnostic Plots

Outlier Score Distributions

Quick Reference

When to Use Which Method

Key Diagnostics Checklist

Installation Commands

Additional Resources

Documentation

Key Statsmodels Pages

Key Sklearn Pages

Related Skills

Hongyu-yu/xlsx

Hongyu-yu/vasp

Hongyu-yu/transformers

Hongyu-yu/torch-geometric

Hongyu-yu/python-regression-statistics

$ install --global

Security Scan Results

SKILL.md

Python Regression & Statistical Analysis

When to Use This Skill

Philosophy: Statistical vs Machine Learning Approaches

Statistical Regression (statsmodels, scipy.stats)

Machine Learning Regression (scikit-learn)

Recommended Workflow

Statistical Regression with statsmodels

Ordinary Least Squares (OLS)

Weighted Least Squares (WLS)

Robust Regression (RLM)

Generalized Linear Models (GLM)

Time Series Models

Model Diagnostics - statsmodels