name: tooluniverse-statistical-modeling description: Perform statistical modeling and regression analysis on biomedical datasets. Supports linear regression, logistic regression (binary/ordinal/multinomial), mixed-effects models, Cox proportional hazards survival analysis, Kaplan-Meier estimation, and comprehensive model diagnostics. Extracts odds ratios, hazard ratios, confidence intervals, p-values, and effect sizes. Designed to solve BixBench statistical reasoning questions involving clinical/experimental data. Use when asked to fit regression models, compute odds ratios, perform survival analysis, run statistical tests, or interpret model coefficients from provided data.

Statistical Modeling for Biomedical Data Analysis

Comprehensive statistical modeling skill for fitting regression models, survival models, and mixed-effects models to biomedical data. Produces publication-quality statistical summaries with odds ratios, hazard ratios, confidence intervals, and p-values.

Features

✅ Linear Regression - OLS for continuous outcomes with diagnostic tests ✅ Logistic Regression - Binary, ordinal, and multinomial models with odds ratios ✅ Survival Analysis - Cox proportional hazards and Kaplan-Meier curves ✅ Mixed-Effects Models - LMM/GLMM for hierarchical/repeated measures data ✅ ANOVA - One-way/two-way ANOVA, per-feature ANOVA for omics data ✅ Model Diagnostics - Assumption checking, fit statistics, residual analysis ✅ Statistical Tests - t-tests, chi-square, Mann-Whitney, Kruskal-Wallis, etc.

Quick Start

Binary Logistic Regression

import statsmodels.formula.api as smf
import numpy as np

# Fit logistic regression
model = smf.logit('disease ~ exposure + age + sex', data=df).fit(disp=0)

# Extract odds ratios
odds_ratios = np.exp(model.params)
conf_int = np.exp(model.conf_int())

print(f"Odds Ratio for exposure: {odds_ratios['exposure']:.4f}")
print(f"95% CI: ({conf_int.loc['exposure', 0]:.4f}, {conf_int.loc['exposure', 1]:.4f})")
print(f"P-value: {model.pvalues['exposure']:.6f}")

Cox Proportional Hazards

from lifelines import CoxPHFitter

# Fit Cox model
cph = CoxPHFitter()
cph.fit(df[['time', 'event', 'treatment', 'age', 'stage']],
        duration_col='time', event_col='event')

# Get hazard ratio
hr = cph.hazard_ratios_['treatment']
print(f"Hazard Ratio: {hr:.4f}")
print(f"Concordance: {cph.concordance_index_:.4f}")

See QUICK_START.md for 8 complete examples.

Model Selection Decision Tree

START: What type of outcome variable?
│
├─ CONTINUOUS (height, blood pressure, score)
│  ├─ Independent observations → Linear Regression (OLS)
│  ├─ Repeated measures → Mixed-Effects Model (LMM)
│  └─ Count data → Poisson/Negative Binomial
│
├─ BINARY (yes/no, disease/healthy)
│  ├─ Independent observations → Logistic Regression
│  ├─ Repeated measures → Logistic Mixed-Effects (GLMM/GEE)
│  └─ Rare events → Firth logistic regression
│
├─ ORDINAL (mild/moderate/severe, stages I/II/III/IV)
│  └─ Ordinal Logistic Regression (Proportional Odds)
│
├─ MULTINOMIAL (>2 unordered categories)
│  └─ Multinomial Logistic Regression
│
└─ TIME-TO-EVENT (survival time + censoring)
   ├─ Regression → Cox Proportional Hazards
   └─ Survival curves → Kaplan-Meier

When to Use

Apply this skill when user asks:

"What is the odds ratio of X associated with Y?"
"What is the hazard ratio for treatment?"
"Fit a linear regression of Y on X1, X2, X3"
"Perform ordinal logistic regression for severity outcome"
"What is the Kaplan-Meier survival estimate at time T?"
"What is the percentage reduction in odds ratio after adjusting for confounders?"
"Run a mixed-effects model with random intercepts"
"Compute the interaction term between A and B"
"What is the F-statistic from ANOVA comparing groups?"
"Test if gene/miRNA expression differs across cell types"
"Perform one-way ANOVA on expression data"

Workflow

Phase 0: Data Validation

Goal: Load data, identify variable types, check for missing values.

⚠️ CRITICAL: Identify the Outcome Variable First

Before any analysis, verify what you're actually predicting:

Read the full question - Look for "predict [outcome]", "model [outcome]", or "dependent variable"
Examine available columns - List all columns in the dataset
Match question to data - Find the column that matches the described outcome
Verify outcome exists - Don't create outcome variables from predictors

Common mistake (bix-51-q3 example):

❌ Question mentions "obesity" → Assumed outcome = BMI ≥ 30 (circular logic with BMI predictor)
✅ Read full question → Actual outcome = treatment response (PR vs non-PR)
Always check data columns first: print(df.columns.tolist())

import pandas as pd
import numpy as np

# Load data
df = pd.read_csv('data.csv')

# Check structure
print(f"Observations: {len(df)}")
print(f"Variables: {len(df.columns)}")
print(f"Missing: {df.isnull().sum().sum()}")

# Detect variable types
for col in df.columns:
    n_unique = df[col].nunique()
    if n_unique == 2:
        print(f"{col}: binary")
    elif n_unique <= 10 and df[col].dtype == 'object':
        print(f"{col}: categorical ({n_unique} levels)")
    elif df[col].dtype in ['float64', 'int64']:
        print(f"{col}: continuous (mean={df[col].mean():.2f})")

Phase 1: Model Fitting

Goal: Fit appropriate model based on outcome type.

Linear Regression

import statsmodels.formula.api as smf

# R-style formula (recommended)
model = smf.ols('outcome ~ predictor1 + predictor2 + age', data=df).fit()

# Results
print(f"R-squared: {model.rsquared:.4f}")
print(f"AIC: {model.aic:.2f}")
print(model.summary())

Logistic Regression

# Fit model
model = smf.logit('disease ~ exposure + age + sex', data=df).fit(disp=0)

# Odds ratios
ors = np.exp(model.params)
ci = np.exp(model.conf_int())

for var in ['exposure', 'age', 'sex_M']:
    print(f"{var}: OR={ors[var]:.4f}, CI=({ci.loc[var, 0]:.4f}, {ci.loc[var, 1]:.4f})")

Ordinal Logistic Regression

from statsmodels.miscmodels.ordinal_model import OrderedModel

# Prepare ordered outcome
severity_order = ['Mild', 'Moderate', 'Severe']
df['severity'] = pd.Categorical(df['severity'], categories=severity_order, ordered=True)
y = df['severity'].cat.codes

# Fit model
X = pd.get_dummies(df[['exposure', 'age', 'sex']], drop_first=True, dtype=float)
model = OrderedModel(y, X, distr='logit').fit(method='bfgs', disp=0)

# Odds ratios
ors = np.exp(model.params[:len(X.columns)])
print(f"Exposure OR: {ors[0]:.4f}")

Cox Proportional Hazards

from lifelines import CoxPHFitter

# Fit model
cph = CoxPHFitter()
cph.fit(df[['time', 'event', 'treatment', 'age']],
        duration_col='time', event_col='event')

# Hazard ratios
print(f"HR (treatment): {cph.hazard_ratios_['treatment']:.4f}")
print(f"Concordance: {cph.concordance_index_:.4f}")

See references/ for detailed examples of each model type.

Statistical Tests (t-test, ANOVA, Chi-square)

One-way ANOVA: Compare means across ≥3 groups

from scipy import stats

# Single ANOVA (one outcome, multiple groups)
group1 = df[df['celltype'] == 'CD4']['expression']
group2 = df[df['celltype'] == 'CD8']['expression']
group3 = df[df['celltype'] == 'CD14']['expression']

f_stat, p_value = stats.f_oneway(group1, group2, group3)
print(f"F-statistic: {f_stat:.4f}, p-value: {p_value:.6f}")

⚠️ CRITICAL: Multi-feature ANOVA Decision Tree

When data has multiple features (genes, miRNAs, metabolites, etc.), there are TWO approaches:

Question: "What is the F-statistic comparing [feature] expression across groups?"

DECISION TREE:
│
├─ Does question specify "the F-statistic" (singular)?
│  │
│  ├─ YES, singular → Likely asking for SPECIFIC FEATURE(S) F-statistic
│  │  │
│  │  ├─ Are there thousands of features (genes, miRNAs)?
│  │  │  YES → Per-feature approach (Method B below)
│  │  │
│  │  └─ Is there one feature of interest?
│  │     YES → Single feature ANOVA (Method A below)
│  │
│  └─ NO, asks about "all features" or "genes" (plural)?
│     YES → Aggregate approach or per-feature summary
│
└─ When unsure: Calculate PER-FEATURE and report summary statistics

Method A: Aggregate ANOVA (all features combined)

Use when: Testing overall expression differences across all features
Result: Single F-statistic representing global effect

# Flatten all features across all samples per group
groups_agg = []
for celltype in ['CD4', 'CD8', 'CD14']:
    samples = df[df['celltype'] == celltype]
    # Flatten: all features × all samples in this group
    all_values = expression_matrix.loc[:, samples.index].values.flatten()
    groups_agg.append(all_values)

f_stat_agg, p_value = stats.f_oneway(*groups_agg)
print(f"Aggregate F-statistic: {f_stat_agg:.4f}")
# Result: Very large F-statistic (e.g., 153.8)

Method B: Per-feature ANOVA (each feature separately) ⭐ RECOMMENDED for gene expression

Use when: Testing EACH feature individually (most common in genomics)
Result: Distribution of F-statistics (one per feature)

# Calculate F-statistic FOR EACH FEATURE separately
per_feature_f_stats = []

for feature in expression_matrix.index:  # For each gene/miRNA/metabolite
    groups = []
    for celltype in ['CD4', 'CD8', 'CD14']:
        samples = df[df['celltype'] == celltype]
        # Get expression of THIS feature in THIS cell type
        values = expression_matrix.loc[feature, samples.index].values
        groups.append(values)

    f_stat, _ = stats.f_oneway(*groups)
    if not np.isnan(f_stat):
        per_feature_f_stats.append((feature, f_stat))

# Summary statistics
f_values = [f for _, f in per_feature_f_stats]
print(f"Per-feature F-statistics:")
print(f"  Median: {np.median(f_values):.4f}")
print(f"  Mean: {np.mean(f_values):.4f}")
print(f"  Range: [{np.min(f_values):.4f}, {np.max(f_values):.4f}]")

# Find features in specific range (e.g., 0.76-0.78)
target_features = [(name, f) for name, f in per_feature_f_stats
                   if 0.76 <= f <= 0.78]
if target_features:
    print(f"Features with F ∈ [0.76, 0.78]: {len(target_features)}")
    for name, f_val in target_features:
        print(f"  {name}: F = {f_val:.6f}")

Key Differences:

| Aspect | Method A (Aggregate) | Method B (Per-feature) | |--------|---------------------|------------------------| | Interpretation | Overall expression difference | Feature-specific differences | | Result | 1 F-statistic | N F-statistics (N = # features) | | Typical value | Very large (e.g., 153.8) | Small to large (e.g., 0.1 to 100+) | | Use case | Global effect size | Gene/biomarker discovery | | Common in | Rarely used | Genomics, proteomics, metabolomics ⭐ |

Real-world example (BixBench bix-36-q1):

Question: "What is the F-statistic comparing miRNA expression across immune cell types?"
Expected: 0.76-0.78
Method A (aggregate): 153.836 ❌ WRONG
Method B (per-miRNA): Found 2 miRNAs with F ∈ [0.76, 0.78] ✅ CORRECT

Default assumption for gene expression data: Use Method B (per-feature)

Phase 2: Model Diagnostics

Goal: Check model assumptions and fit quality.

OLS Diagnostics

from scipy import stats as scipy_stats
from statsmodels.stats.diagnostic import het_breuschpagan

# Residual normality
residuals = model.resid
sw_stat, sw_p = scipy_stats.shapiro(residuals)
print(f"Shapiro-Wilk: p={sw_p:.4f} (normal if p>0.05)")

# Heteroscedasticity
bp_stat, bp_p, _, _ = het_breuschpagan(residuals, model.model.exog)
print(f"Breusch-Pagan: p={bp_p:.4f} (homoscedastic if p>0.05)")

# VIF (multicollinearity)
from statsmodels.stats.outliers_influence import variance_inflation_factor
X = model.model.exog
for i in range(1, X.shape[1]):  # Skip intercept
    vif = variance_inflation_factor(X, i)
    print(f"{model.model.exog_names[i]}: VIF={vif:.2f}")

Proportional Hazards Test

# Test PH assumption for Cox model
results = cph.check_assumptions(df, p_value_threshold=0.05, show_plots=False)
if len(results) == 0:
    print("✅ Proportional hazards assumption met")
else:
    print(f"⚠️  PH violated for: {results}")

See references/troubleshooting.md for common diagnostic issues.

Phase 3: Interpretation

Goal: Generate publication-quality summary.

Odds Ratio Interpretation

def interpret_odds_ratio(or_val, ci_lower, ci_upper, p_value):
    """Interpret odds ratio with clinical meaning."""
    if or_val > 1:
        pct_increase = (or_val - 1) * 100
        direction = f"{pct_increase:.1f}% increase in odds"
    else:
        pct_decrease = (1 - or_val) * 100
        direction = f"{pct_decrease:.1f}% decrease in odds"

    sig = "significant" if p_value < 0.05 else "not significant"
    ci_contains_null = ci_lower <= 1 <= ci_upper

    return f"{direction} (OR={or_val:.4f}, 95% CI [{ci_lower:.4f}, {ci_upper:.4f}], p={p_value:.6f}, {sig})"

Common BixBench Patterns

Pattern 1: Odds Ratio from Ordinal Regression

Question: "What is the odds ratio of disease severity associated with exposure?"

Solution:

Identify ordinal outcome (mild/moderate/severe)
Fit ordinal logistic regression (proportional odds model)
Extract OR = exp(coefficient for exposure)
Report with CI and p-value

Pattern 2: Percentage Reduction in Odds

Question: "What is the percentage reduction in OR after adjusting for confounders?"

Solution:

# Unadjusted model
model_crude = smf.logit('outcome ~ exposure', data=df).fit(disp=0)
or_crude = np.exp(model_crude.params['exposure'])

# Adjusted model
model_adj = smf.logit('outcome ~ exposure + age + sex', data=df).fit(disp=0)
or_adj = np.exp(model_adj.params['exposure'])

# Percentage reduction
pct_reduction = (or_crude - or_adj) / or_crude * 100
print(f"Percentage reduction: {pct_reduction:.1f}%")

Pattern 3: Interaction Effects

Question: "What is the odds ratio for the interaction between A and B?"

Solution:

# Fit model with interaction
model = smf.logit('outcome ~ A * B + age', data=df).fit(disp=0)

# Interaction OR
interaction_coef = model.params['A:B']
interaction_or = np.exp(interaction_coef)
print(f"Interaction OR: {interaction_or:.4f}")

Pattern 4: Survival Analysis

Question: "What is the hazard ratio for treatment?"

Solution:

Load survival data (time, event, covariates)
Fit Cox proportional hazards model
Extract HR = exp(coefficient)
Report with CI and concordance index

Pattern 5: Multi-feature ANOVA (Gene Expression)

Question: "What is the F-statistic comparing miRNA expression across cell types?"

Solution:

Identify that data has multiple features (genes/miRNAs)
Use per-feature ANOVA (NOT aggregate)
Calculate F-statistic for EACH feature separately
If question asks for "the F-statistic" (singular):
- Check if specific features match expected range
- Report those feature(s) F-statistics
If question asks for summary: report median/mean/distribution

Critical: For gene expression data, default to per-feature ANOVA. Aggregate ANOVA gives F-statistics ~200× larger and is rarely correct.

See references/bixbench_patterns.md for 15+ question patterns.

Statsmodels vs Scikit-learn

| Use Case | Library | Reason | |----------|---------|--------| | Inference (p-values, CIs, ORs) | statsmodels | Full statistical output | | Prediction (accuracy, AUC) | scikit-learn | Better prediction tools | | Mixed-effects models | statsmodels | Only option | | Regularization (LASSO, Ridge) | scikit-learn | Better optimization | | Survival analysis | lifelines | Specialized library |

General rule: Use statsmodels for BixBench questions (they ask for p-values, ORs, HRs).

Python Package Requirements

statsmodels>=0.14.0
scikit-learn>=1.3.0
lifelines>=0.27.0
pandas>=2.0.0
numpy>=1.24.0
scipy>=1.10.0

File Structure

tooluniverse-statistical-modeling/
├── SKILL.md                          # This file
├── QUICK_START.md                    # 8 quick examples
├── EXAMPLES.md                       # Legacy examples (kept for reference)
├── TOOLS_REFERENCE.md                # ToolUniverse tool catalog
├── test_skill.py                     # Comprehensive test suite
├── references/
│   ├── logistic_regression.md        # Detailed logistic examples
│   ├── ordinal_logistic.md           # Ordinal logit guide
│   ├── cox_regression.md             # Survival analysis guide
│   ├── linear_models.md              # OLS and mixed-effects
│   ├── bixbench_patterns.md          # 15+ question patterns
│   └── troubleshooting.md            # Diagnostic issues
└── scripts/
    ├── format_statistical_output.py  # Format results for reporting
    └── model_diagnostics.py          # Automated diagnostics

Key Principles

Data-first approach - Always inspect and validate data before modeling
Model selection by outcome type - Use decision tree above
Assumption checking - Verify model assumptions (linearity, proportional hazards, etc.)
Complete reporting - Always report effect sizes, CIs, p-values, and model fit statistics
Confounder awareness - Adjust for confounders when specified or clinically relevant
Reproducible analysis - All code must be deterministic and reproducible
Robust error handling - Graceful handling of convergence failures, separation, collinearity
Round correctly - Match the precision requested (typically 2-4 decimal places)

Completeness Checklist

Before finalizing any statistical analysis:

[ ] Outcome variable identified: Verified which column is the actual outcome (not assumed)
[ ] Data validated: N, missing values, variable types confirmed
[ ] Multi-feature data identified: If data has multiple features (genes, miRNAs), use per-feature approach
[ ] Model appropriate: Outcome type matches model family
[ ] Assumptions checked: Relevant diagnostics performed
[ ] Effect sizes reported: OR/HR/Cohen's d with CIs
[ ] P-values reported: With appropriate correction if needed
[ ] Model fit assessed: R-squared, AIC/BIC, concordance
[ ] Results interpreted: Plain-language interpretation
[ ] Precision correct: Numbers rounded appropriately
[ ] Confounders addressed: Adjusted analyses if applicable

References

statsmodels: https://www.statsmodels.org/
lifelines: https://lifelines.readthedocs.io/
scikit-learn: https://scikit-learn.org/
Ordinal models: statsmodels.miscmodels.ordinal_model.OrderedModel

ToolUniverse Integration

While this skill is primarily computational, ToolUniverse tools can provide data:

| Use Case | Tools | |----------|-------| | Clinical trial data | clinical_trials_search | | Drug safety outcomes | FAERS_calculate_disproportionality | | Gene-disease associations | OpenTargets_target_disease_evidence | | Biomarker data | fda_pharmacogenomic_biomarkers |

See TOOLS_REFERENCE.md for complete tool catalog.

Support

For detailed examples and troubleshooting:

Logistic regression: references/logistic_regression.md
Ordinal models: references/ordinal_logistic.md
Survival analysis: references/cox_regression.md
Linear/mixed models: references/linear_models.md
BixBench patterns: references/bixbench_patterns.md
Diagnostics: references/troubleshooting.md

name: tooluniverse-statistical-modeling description: Perform statistical modeling and regression analysis on biomedical datasets. Supports linear regression, logistic regression (binary/ordinal/multinomial), mixed-effects models, Cox proportional hazards survival analysis, Kaplan-Meier estimation, and comprehensive model diagnostics. Extracts odds ratios, hazard ratios, confidence intervals, p-values, and effect sizes. Designed to solve BixBench statistical reasoning questions involving clinical/experimental data. Use when asked to fit regression models, compute odds ratios, perform survival analysis, run statistical tests, or interpret model coefficients from provided data.

Statistical Modeling for Biomedical Data Analysis

Features

Quick Start

Binary Logistic Regression

import statsmodels.formula.api as smf
import numpy as np

# Fit logistic regression
model = smf.logit('disease ~ exposure + age + sex', data=df).fit(disp=0)

# Extract odds ratios
odds_ratios = np.exp(model.params)
conf_int = np.exp(model.conf_int())

print(f"Odds Ratio for exposure: {odds_ratios['exposure']:.4f}")
print(f"95% CI: ({conf_int.loc['exposure', 0]:.4f}, {conf_int.loc['exposure', 1]:.4f})")
print(f"P-value: {model.pvalues['exposure']:.6f}")

Cox Proportional Hazards

from lifelines import CoxPHFitter

# Fit Cox model
cph = CoxPHFitter()
cph.fit(df[['time', 'event', 'treatment', 'age', 'stage']],
        duration_col='time', event_col='event')

# Get hazard ratio
hr = cph.hazard_ratios_['treatment']
print(f"Hazard Ratio: {hr:.4f}")
print(f"Concordance: {cph.concordance_index_:.4f}")

See QUICK_START.md for 8 complete examples.

Model Selection Decision Tree

START: What type of outcome variable?
│
├─ CONTINUOUS (height, blood pressure, score)
│  ├─ Independent observations → Linear Regression (OLS)
│  ├─ Repeated measures → Mixed-Effects Model (LMM)
│  └─ Count data → Poisson/Negative Binomial
│
├─ BINARY (yes/no, disease/healthy)
│  ├─ Independent observations → Logistic Regression
│  ├─ Repeated measures → Logistic Mixed-Effects (GLMM/GEE)
│  └─ Rare events → Firth logistic regression
│
├─ ORDINAL (mild/moderate/severe, stages I/II/III/IV)
│  └─ Ordinal Logistic Regression (Proportional Odds)
│
├─ MULTINOMIAL (>2 unordered categories)
│  └─ Multinomial Logistic Regression
│
└─ TIME-TO-EVENT (survival time + censoring)
   ├─ Regression → Cox Proportional Hazards
   └─ Survival curves → Kaplan-Meier

When to Use

Apply this skill when user asks:

"What is the odds ratio of X associated with Y?"
"What is the hazard ratio for treatment?"
"Fit a linear regression of Y on X1, X2, X3"
"Perform ordinal logistic regression for severity outcome"
"What is the Kaplan-Meier survival estimate at time T?"
"What is the percentage reduction in odds ratio after adjusting for confounders?"
"Run a mixed-effects model with random intercepts"
"Compute the interaction term between A and B"
"What is the F-statistic from ANOVA comparing groups?"
"Test if gene/miRNA expression differs across cell types"
"Perform one-way ANOVA on expression data"

Workflow

Phase 0: Data Validation

Goal: Load data, identify variable types, check for missing values.

⚠️ CRITICAL: Identify the Outcome Variable First

Before any analysis, verify what you're actually predicting:

Read the full question - Look for "predict [outcome]", "model [outcome]", or "dependent variable"
Examine available columns - List all columns in the dataset
Match question to data - Find the column that matches the described outcome
Verify outcome exists - Don't create outcome variables from predictors

Common mistake (bix-51-q3 example):

❌ Question mentions "obesity" → Assumed outcome = BMI ≥ 30 (circular logic with BMI predictor)
✅ Read full question → Actual outcome = treatment response (PR vs non-PR)
Always check data columns first: print(df.columns.tolist())

import pandas as pd
import numpy as np

# Load data
df = pd.read_csv('data.csv')

# Check structure
print(f"Observations: {len(df)}")
print(f"Variables: {len(df.columns)}")
print(f"Missing: {df.isnull().sum().sum()}")

# Detect variable types
for col in df.columns:
    n_unique = df[col].nunique()
    if n_unique == 2:
        print(f"{col}: binary")
    elif n_unique <= 10 and df[col].dtype == 'object':
        print(f"{col}: categorical ({n_unique} levels)")
    elif df[col].dtype in ['float64', 'int64']:
        print(f"{col}: continuous (mean={df[col].mean():.2f})")

Phase 1: Model Fitting

Goal: Fit appropriate model based on outcome type.

Linear Regression

import statsmodels.formula.api as smf

# R-style formula (recommended)
model = smf.ols('outcome ~ predictor1 + predictor2 + age', data=df).fit()

# Results
print(f"R-squared: {model.rsquared:.4f}")
print(f"AIC: {model.aic:.2f}")
print(model.summary())

Logistic Regression

# Fit model
model = smf.logit('disease ~ exposure + age + sex', data=df).fit(disp=0)

# Odds ratios
ors = np.exp(model.params)
ci = np.exp(model.conf_int())

for var in ['exposure', 'age', 'sex_M']:
    print(f"{var}: OR={ors[var]:.4f}, CI=({ci.loc[var, 0]:.4f}, {ci.loc[var, 1]:.4f})")

Ordinal Logistic Regression

from statsmodels.miscmodels.ordinal_model import OrderedModel

# Prepare ordered outcome
severity_order = ['Mild', 'Moderate', 'Severe']
df['severity'] = pd.Categorical(df['severity'], categories=severity_order, ordered=True)
y = df['severity'].cat.codes

# Fit model
X = pd.get_dummies(df[['exposure', 'age', 'sex']], drop_first=True, dtype=float)
model = OrderedModel(y, X, distr='logit').fit(method='bfgs', disp=0)

# Odds ratios
ors = np.exp(model.params[:len(X.columns)])
print(f"Exposure OR: {ors[0]:.4f}")

Cox Proportional Hazards

from lifelines import CoxPHFitter

# Fit model
cph = CoxPHFitter()
cph.fit(df[['time', 'event', 'treatment', 'age']],
        duration_col='time', event_col='event')

# Hazard ratios
print(f"HR (treatment): {cph.hazard_ratios_['treatment']:.4f}")
print(f"Concordance: {cph.concordance_index_:.4f}")

See references/ for detailed examples of each model type.

Statistical Tests (t-test, ANOVA, Chi-square)

One-way ANOVA: Compare means across ≥3 groups

from scipy import stats

# Single ANOVA (one outcome, multiple groups)
group1 = df[df['celltype'] == 'CD4']['expression']
group2 = df[df['celltype'] == 'CD8']['expression']
group3 = df[df['celltype'] == 'CD14']['expression']

f_stat, p_value = stats.f_oneway(group1, group2, group3)
print(f"F-statistic: {f_stat:.4f}, p-value: {p_value:.6f}")

⚠️ CRITICAL: Multi-feature ANOVA Decision Tree

When data has multiple features (genes, miRNAs, metabolites, etc.), there are TWO approaches:

Question: "What is the F-statistic comparing [feature] expression across groups?"

DECISION TREE:
│
├─ Does question specify "the F-statistic" (singular)?
│  │
│  ├─ YES, singular → Likely asking for SPECIFIC FEATURE(S) F-statistic
│  │  │
│  │  ├─ Are there thousands of features (genes, miRNAs)?
│  │  │  YES → Per-feature approach (Method B below)
│  │  │
│  │  └─ Is there one feature of interest?
│  │     YES → Single feature ANOVA (Method A below)
│  │
│  └─ NO, asks about "all features" or "genes" (plural)?
│     YES → Aggregate approach or per-feature summary
│
└─ When unsure: Calculate PER-FEATURE and report summary statistics

Method A: Aggregate ANOVA (all features combined)

Use when: Testing overall expression differences across all features
Result: Single F-statistic representing global effect

# Flatten all features across all samples per group
groups_agg = []
for celltype in ['CD4', 'CD8', 'CD14']:
    samples = df[df['celltype'] == celltype]
    # Flatten: all features × all samples in this group
    all_values = expression_matrix.loc[:, samples.index].values.flatten()
    groups_agg.append(all_values)

f_stat_agg, p_value = stats.f_oneway(*groups_agg)
print(f"Aggregate F-statistic: {f_stat_agg:.4f}")
# Result: Very large F-statistic (e.g., 153.8)

Method B: Per-feature ANOVA (each feature separately) ⭐ RECOMMENDED for gene expression

Use when: Testing EACH feature individually (most common in genomics)
Result: Distribution of F-statistics (one per feature)

# Calculate F-statistic FOR EACH FEATURE separately
per_feature_f_stats = []

for feature in expression_matrix.index:  # For each gene/miRNA/metabolite
    groups = []
    for celltype in ['CD4', 'CD8', 'CD14']:
        samples = df[df['celltype'] == celltype]
        # Get expression of THIS feature in THIS cell type
        values = expression_matrix.loc[feature, samples.index].values
        groups.append(values)

    f_stat, _ = stats.f_oneway(*groups)
    if not np.isnan(f_stat):
        per_feature_f_stats.append((feature, f_stat))

# Summary statistics
f_values = [f for _, f in per_feature_f_stats]
print(f"Per-feature F-statistics:")
print(f"  Median: {np.median(f_values):.4f}")
print(f"  Mean: {np.mean(f_values):.4f}")
print(f"  Range: [{np.min(f_values):.4f}, {np.max(f_values):.4f}]")

# Find features in specific range (e.g., 0.76-0.78)
target_features = [(name, f) for name, f in per_feature_f_stats
                   if 0.76 <= f <= 0.78]
if target_features:
    print(f"Features with F ∈ [0.76, 0.78]: {len(target_features)}")
    for name, f_val in target_features:
        print(f"  {name}: F = {f_val:.6f}")

Key Differences:

Real-world example (BixBench bix-36-q1):

Question: "What is the F-statistic comparing miRNA expression across immune cell types?"
Expected: 0.76-0.78
Method A (aggregate): 153.836 ❌ WRONG
Method B (per-miRNA): Found 2 miRNAs with F ∈ [0.76, 0.78] ✅ CORRECT

Default assumption for gene expression data: Use Method B (per-feature)

Phase 2: Model Diagnostics

Goal: Check model assumptions and fit quality.

OLS Diagnostics

from scipy import stats as scipy_stats
from statsmodels.stats.diagnostic import het_breuschpagan

# Residual normality
residuals = model.resid
sw_stat, sw_p = scipy_stats.shapiro(residuals)
print(f"Shapiro-Wilk: p={sw_p:.4f} (normal if p>0.05)")

# Heteroscedasticity
bp_stat, bp_p, _, _ = het_breuschpagan(residuals, model.model.exog)
print(f"Breusch-Pagan: p={bp_p:.4f} (homoscedastic if p>0.05)")

# VIF (multicollinearity)
from statsmodels.stats.outliers_influence import variance_inflation_factor
X = model.model.exog
for i in range(1, X.shape[1]):  # Skip intercept
    vif = variance_inflation_factor(X, i)
    print(f"{model.model.exog_names[i]}: VIF={vif:.2f}")

Proportional Hazards Test

# Test PH assumption for Cox model
results = cph.check_assumptions(df, p_value_threshold=0.05, show_plots=False)
if len(results) == 0:
    print("✅ Proportional hazards assumption met")
else:
    print(f"⚠️  PH violated for: {results}")

See references/troubleshooting.md for common diagnostic issues.

Phase 3: Interpretation

Goal: Generate publication-quality summary.

Odds Ratio Interpretation

def interpret_odds_ratio(or_val, ci_lower, ci_upper, p_value):
    """Interpret odds ratio with clinical meaning."""
    if or_val > 1:
        pct_increase = (or_val - 1) * 100
        direction = f"{pct_increase:.1f}% increase in odds"
    else:
        pct_decrease = (1 - or_val) * 100
        direction = f"{pct_decrease:.1f}% decrease in odds"

    sig = "significant" if p_value < 0.05 else "not significant"
    ci_contains_null = ci_lower <= 1 <= ci_upper

    return f"{direction} (OR={or_val:.4f}, 95% CI [{ci_lower:.4f}, {ci_upper:.4f}], p={p_value:.6f}, {sig})"

Common BixBench Patterns

Pattern 1: Odds Ratio from Ordinal Regression

Question: "What is the odds ratio of disease severity associated with exposure?"

Solution:

Identify ordinal outcome (mild/moderate/severe)
Fit ordinal logistic regression (proportional odds model)
Extract OR = exp(coefficient for exposure)
Report with CI and p-value

Pattern 2: Percentage Reduction in Odds

Question: "What is the percentage reduction in OR after adjusting for confounders?"

Solution:

# Unadjusted model
model_crude = smf.logit('outcome ~ exposure', data=df).fit(disp=0)
or_crude = np.exp(model_crude.params['exposure'])

# Adjusted model
model_adj = smf.logit('outcome ~ exposure + age + sex', data=df).fit(disp=0)
or_adj = np.exp(model_adj.params['exposure'])

# Percentage reduction
pct_reduction = (or_crude - or_adj) / or_crude * 100
print(f"Percentage reduction: {pct_reduction:.1f}%")

Pattern 3: Interaction Effects

Question: "What is the odds ratio for the interaction between A and B?"

Solution:

# Fit model with interaction
model = smf.logit('outcome ~ A * B + age', data=df).fit(disp=0)

# Interaction OR
interaction_coef = model.params['A:B']
interaction_or = np.exp(interaction_coef)
print(f"Interaction OR: {interaction_or:.4f}")

Pattern 4: Survival Analysis

Question: "What is the hazard ratio for treatment?"

Solution:

Load survival data (time, event, covariates)
Fit Cox proportional hazards model
Extract HR = exp(coefficient)
Report with CI and concordance index

Pattern 5: Multi-feature ANOVA (Gene Expression)

Question: "What is the F-statistic comparing miRNA expression across cell types?"

Solution:

Identify that data has multiple features (genes/miRNAs)
Use per-feature ANOVA (NOT aggregate)
Calculate F-statistic for EACH feature separately
If question asks for "the F-statistic" (singular):
- Check if specific features match expected range
- Report those feature(s) F-statistics
If question asks for summary: report median/mean/distribution

Critical: For gene expression data, default to per-feature ANOVA. Aggregate ANOVA gives F-statistics ~200× larger and is rarely correct.

See references/bixbench_patterns.md for 15+ question patterns.

Statsmodels vs Scikit-learn

General rule: Use statsmodels for BixBench questions (they ask for p-values, ORs, HRs).

Python Package Requirements

statsmodels>=0.14.0
scikit-learn>=1.3.0
lifelines>=0.27.0
pandas>=2.0.0
numpy>=1.24.0
scipy>=1.10.0

File Structure

tooluniverse-statistical-modeling/
├── SKILL.md                          # This file
├── QUICK_START.md                    # 8 quick examples
├── EXAMPLES.md                       # Legacy examples (kept for reference)
├── TOOLS_REFERENCE.md                # ToolUniverse tool catalog
├── test_skill.py                     # Comprehensive test suite
├── references/
│   ├── logistic_regression.md        # Detailed logistic examples
│   ├── ordinal_logistic.md           # Ordinal logit guide
│   ├── cox_regression.md             # Survival analysis guide
│   ├── linear_models.md              # OLS and mixed-effects
│   ├── bixbench_patterns.md          # 15+ question patterns
│   └── troubleshooting.md            # Diagnostic issues
└── scripts/
    ├── format_statistical_output.py  # Format results for reporting
    └── model_diagnostics.py          # Automated diagnostics

Key Principles

Data-first approach - Always inspect and validate data before modeling
Model selection by outcome type - Use decision tree above
Assumption checking - Verify model assumptions (linearity, proportional hazards, etc.)
Complete reporting - Always report effect sizes, CIs, p-values, and model fit statistics
Confounder awareness - Adjust for confounders when specified or clinically relevant
Reproducible analysis - All code must be deterministic and reproducible
Robust error handling - Graceful handling of convergence failures, separation, collinearity
Round correctly - Match the precision requested (typically 2-4 decimal places)

Completeness Checklist

Before finalizing any statistical analysis:

[ ] Outcome variable identified: Verified which column is the actual outcome (not assumed)
[ ] Data validated: N, missing values, variable types confirmed
[ ] Multi-feature data identified: If data has multiple features (genes, miRNAs), use per-feature approach
[ ] Model appropriate: Outcome type matches model family
[ ] Assumptions checked: Relevant diagnostics performed
[ ] Effect sizes reported: OR/HR/Cohen's d with CIs
[ ] P-values reported: With appropriate correction if needed
[ ] Model fit assessed: R-squared, AIC/BIC, concordance
[ ] Results interpreted: Plain-language interpretation
[ ] Precision correct: Numbers rounded appropriately
[ ] Confounders addressed: Adjusted analyses if applicable

References

statsmodels: https://www.statsmodels.org/
lifelines: https://lifelines.readthedocs.io/
scikit-learn: https://scikit-learn.org/
Ordinal models: statsmodels.miscmodels.ordinal_model.OrderedModel

ToolUniverse Integration

While this skill is primarily computational, ToolUniverse tools can provide data:

See TOOLS_REFERENCE.md for complete tool catalog.

Support

For detailed examples and troubleshooting:

Logistic regression: references/logistic_regression.md
Ordinal models: references/ordinal_logistic.md
Survival analysis: references/cox_regression.md
Linear/mixed models: references/linear_models.md
BixBench patterns: references/bixbench_patterns.md
Diagnostics: references/troubleshooting.md

Adoption

lamm-mit/statistical-modeling

$ install --global

Security Scan Results

SKILL.md

Statistical Modeling for Biomedical Data Analysis

Features

Quick Start

Binary Logistic Regression

Cox Proportional Hazards

Model Selection Decision Tree

When to Use

Workflow

Phase 0: Data Validation

Phase 1: Model Fitting

Linear Regression

Logistic Regression

Ordinal Logistic Regression

Cox Proportional Hazards

Statistical Tests (t-test, ANOVA, Chi-square)

Phase 2: Model Diagnostics

OLS Diagnostics

Proportional Hazards Test

Phase 3: Interpretation

Odds Ratio Interpretation

Common BixBench Patterns

Pattern 1: Odds Ratio from Ordinal Regression

Pattern 2: Percentage Reduction in Odds

Pattern 3: Interaction Effects

Pattern 4: Survival Analysis

Pattern 5: Multi-feature ANOVA (Gene Expression)

Statsmodels vs Scikit-learn

Python Package Requirements

File Structure

Key Principles

Completeness Checklist

References

ToolUniverse Integration

Support

Related Skills

lamm-mit/paperclip

lamm-mit/perplexity-search

lamm-mit/scientific-report-pdf

lamm-mit/python-exec

lamm-mit/statistical-modeling

$ install --global

Security Scan Results

SKILL.md

Statistical Modeling for Biomedical Data Analysis

Features

Quick Start

Binary Logistic Regression

Cox Proportional Hazards

Model Selection Decision Tree

When to Use

Workflow

Phase 0: Data Validation

Phase 1: Model Fitting

Linear Regression

Logistic Regression

Ordinal Logistic Regression

Cox Proportional Hazards

Statistical Tests (t-test, ANOVA, Chi-square)

Phase 2: Model Diagnostics

OLS Diagnostics

Proportional Hazards Test

Phase 3: Interpretation

Odds Ratio Interpretation

Common BixBench Patterns

Pattern 1: Odds Ratio from Ordinal Regression

Pattern 2: Percentage Reduction in Odds

Pattern 3: Interaction Effects

Pattern 4: Survival Analysis

Pattern 5: Multi-feature ANOVA (Gene Expression)

Statsmodels vs Scikit-learn

Python Package Requirements

File Structure

Key Principles

Completeness Checklist

References