skills/analysis/statistics/modeling-strategy-guide/SKILL.md
Strategic statistical modeling, experimentation, and causal inference
npx skillsauth add wentorai/research-plugins modeling-strategy-guideInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A skill for strategic statistical modeling applied to academic research. Covers advanced modeling decisions, experimental design, causal inference, feature engineering, and the critical thinking required to move from data to defensible conclusions.
Senior data scientists distinguish themselves not by knowing more algorithms but by asking better questions, designing cleaner experiments, and being honest about what the data can and cannot tell them. This skill translates that professional discipline into a research context, helping academics apply modern data science practices to their empirical work. It covers the strategic decisions that matter most: when to use simple models versus complex ones, how to establish causality rather than mere correlation, and how to communicate uncertainty honestly.
The skill is particularly useful for researchers working with observational data who need causal inference techniques, those designing randomized experiments who need proper power calculations and analysis plans, and anyone building predictive models who needs to avoid common overfitting and leakage pitfalls.
Decision Framework:
1. Start with the simplest model that could answer your question
2. Add complexity only when diagnostics reveal inadequacy
3. Prefer interpretable models unless prediction accuracy is the sole goal
4. Always have a baseline (mean, majority class, last observation)
Model Complexity Ladder:
Level 1: Descriptive statistics, cross-tabulations
Level 2: Linear/logistic regression
Level 3: Regularized regression (Lasso, Ridge, Elastic Net)
Level 4: Tree ensembles (Random Forest, Gradient Boosting)
Level 5: Deep learning (only with sufficient data and clear justification)
import pandas as pd
import numpy as np
def engineer_features(df: pd.DataFrame, config: dict) -> pd.DataFrame:
"""
Apply systematic feature engineering based on domain knowledge.
config example:
{
'log_transform': ['income', 'citations'],
'interactions': [('experience', 'education')],
'polynomial': {'age': 2},
'time_features': 'date_column',
'lag_features': {'metric': [1, 7, 30]}
}
"""
df = df.copy()
# Log transforms for right-skewed variables
for col in config.get('log_transform', []):
df[f'{col}_log'] = np.log1p(df[col])
# Interaction terms
for col_a, col_b in config.get('interactions', []):
df[f'{col_a}_x_{col_b}'] = df[col_a] * df[col_b]
# Polynomial features
for col, degree in config.get('polynomial', {}).items():
for d in range(2, degree + 1):
df[f'{col}_pow{d}'] = df[col] ** d
# Time-based features
if 'time_features' in config:
time_col = config['time_features']
df[time_col] = pd.to_datetime(df[time_col])
df[f'{time_col}_month'] = df[time_col].dt.month
df[f'{time_col}_dayofweek'] = df[time_col].dt.dayofweek
df[f'{time_col}_quarter'] = df[time_col].dt.quarter
return df
| Method | When to Use | Key Assumption | |--------|-----------|---------------| | Randomized experiment | You can randomly assign treatment | Proper randomization, no attrition | | Difference-in-differences | Policy change affects one group | Parallel trends pre-treatment | | Regression discontinuity | Treatment assigned by cutoff | No manipulation near cutoff | | Instrumental variables | Endogeneity present | Valid instrument (relevance + exclusion) | | Propensity score matching | Observational data, many confounders | No unobserved confounders | | Synthetic control | Single treated unit, many controls | Good pre-treatment fit |
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
def propensity_score_match(df, treatment_col, covariates, caliper=0.05):
"""
Match treated and control units based on propensity scores.
"""
# Estimate propensity scores
X = df[covariates].values
y = df[treatment_col].values
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X, y)
df['pscore'] = lr.predict_proba(X)[:, 1]
# Match using nearest neighbor within caliper
treated = df[df[treatment_col] == 1]
control = df[df[treatment_col] == 0]
nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(control[['pscore']].values)
distances, indices = nn.kneighbors(treated[['pscore']].values)
# Apply caliper
valid = distances.flatten() < caliper
matched_treated = treated[valid].index.tolist()
matched_control = control.iloc[indices.flatten()[valid]].index.tolist()
return {
'matched_treated': matched_treated,
'matched_control': matched_control,
'n_matched': sum(valid),
'n_unmatched': sum(~valid),
'balance_check': 'Run standardized mean differences on covariates'
}
from scipy import stats
import numpy as np
def design_experiment(baseline_rate, mde, alpha=0.05, power=0.80):
"""
Calculate required sample size for a two-proportion z-test.
Args:
baseline_rate: Current conversion/success rate
mde: Minimum detectable effect (absolute change)
alpha: Significance level
power: Statistical power
"""
from statsmodels.stats.power import NormalIndPower
effect_size = mde / np.sqrt(baseline_rate * (1 - baseline_rate))
analysis = NormalIndPower()
n = analysis.solve_power(
effect_size=effect_size, alpha=alpha, power=power, ratio=1.0
)
return {
'sample_size_per_group': int(np.ceil(n)),
'total_sample_size': int(np.ceil(n)) * 2,
'baseline_rate': baseline_rate,
'minimum_detectable_effect': mde,
'alpha': alpha,
'power': power
}
Before running any experiment, document:
| Data Type | Recommended CV | Rationale | |-----------|---------------|-----------| | i.i.d. data | Stratified K-fold (K=5 or 10) | Preserves class balance | | Time series | Time-series split (expanding window) | Prevents look-ahead bias | | Grouped data | Group K-fold | Prevents data leakage across groups | | Small dataset (n<200) | Leave-one-out or repeated K-fold | Maximizes training data | | Spatial data | Spatial blocking | Prevents spatial autocorrelation leakage |
documentation
Write Tsinghua University theses using the ThuThesis LaTeX template
development
Templates, formatting rules, and strategies for thesis and dissertation writing
documentation
Set up LaTeX templates for PhD and Master's thesis documents
documentation
Write SJTU theses using the SJTUThesis LaTeX template with full compliance