machine-learning/model-validation/SKILL.md
Implements nested cross-validation and stratified splits for unbiased model evaluation on biomedical datasets. Prevents data leakage and overfitting in biomarker discovery. Use when validating classifiers or optimizing hyperparameters on omics data.
npx skillsauth add GPTomics/bioSkills bio-machine-learning-model-validationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: numpy 1.26+, scikit-learn 1.4+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturesIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Properly validate my omics classifier" -> Use nested cross-validation with stratified splits to get unbiased performance estimates while tuning hyperparameters on small biomedical datasets.
sklearn.model_selection.cross_val_score() with StratifiedKFold inner/outer loopsSimple train/test splits overestimate performance on small omics datasets. Nested CV provides unbiased estimates by separating hyperparameter tuning from performance evaluation.
Goal: Obtain unbiased performance estimates by separating hyperparameter tuning from evaluation.
Approach: Use an outer CV loop for scoring and an inner CV loop for grid search, preventing information leakage between tuning and evaluation.
from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
pipe = Pipeline([
('scaler', StandardScaler()),
('clf', RandomForestClassifier(random_state=42))
])
param_grid = {
'clf__n_estimators': [50, 100, 200],
'clf__max_depth': [5, 10, None]
}
# Outer CV: performance estimation (5 folds)
# Inner CV: hyperparameter tuning (3 folds)
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
nested_scores = []
for train_idx, test_idx in outer_cv.split(X, y):
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
grid = GridSearchCV(pipe, param_grid, cv=inner_cv, scoring='roc_auc', n_jobs=-1)
grid.fit(X_train, y_train)
score = grid.score(X_test, y_test)
nested_scores.append(score)
print(f'Nested CV AUC: {np.mean(nested_scores):.3f} +/- {np.std(nested_scores):.3f}')
Goal: Evaluate model performance while preserving class proportions in each fold.
Approach: Split data into stratified folds and compute cross-validated scores to account for class imbalance.
from sklearn.model_selection import StratifiedKFold, cross_val_score
# Always stratify for class imbalance
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc')
print(f'CV AUC: {scores.mean():.3f} +/- {scores.std():.3f}')
Goal: Produce more stable performance estimates by averaging across multiple CV repetitions.
Approach: Repeat stratified K-fold splitting with different random seeds and aggregate scores across all iterations.
from sklearn.model_selection import RepeatedStratifiedKFold
# More robust estimate with multiple repeats
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc')
print(f'Repeated CV AUC: {scores.mean():.3f} +/- {scores.std():.3f}')
Goal: Maximize training data when sample size is very small (n < 30).
Approach: Hold out one sample at a time for testing and train on all remaining samples, then aggregate predictions.
from sklearn.model_selection import LeaveOneOut, cross_val_predict
# Use for very small datasets (n < 30)
loo = LeaveOneOut()
y_pred = cross_val_predict(pipe, X, y, cv=loo, method='predict_proba')[:, 1]
auc = roc_auc_score(y, y_pred)
print(f'LOO AUC: {auc:.3f}')
Goal: Prevent data leakage when samples from the same patient or batch are correlated.
Approach: Use group-aware splitting to ensure all samples from a single group stay in the same fold.
from sklearn.model_selection import GroupKFold, LeaveOneGroupOut
# When samples from same patient/batch must stay together
groups = meta['patient_id'].values
group_cv = GroupKFold(n_splits=5)
scores = cross_val_score(pipe, X, y, cv=group_cv, groups=groups, scoring='roc_auc')
| Dataset Size | Strategy | Notes | |--------------|----------|-------| | n > 100 | StratifiedKFold(5) | Standard choice | | n = 50-100 | StratifiedKFold(10) | More train data per fold | | n < 30 | LeaveOneOut | Maximum train data | | Repeated measures | GroupKFold | Keep patients together | | High variance | RepeatedStratifiedKFold | More stable estimates |
Goal: Ensure feature selection does not use test-fold information, which inflates performance estimates.
Approach: Embed feature selection inside a pipeline so it executes independently within each CV fold.
# WRONG: Feature selection before CV
# selected = SelectKBest(k=100).fit_transform(X, y) # Leaks info!
# scores = cross_val_score(clf, selected, y, cv=cv)
# CORRECT: Feature selection inside CV
from sklearn.feature_selection import SelectKBest
pipe = Pipeline([
('scaler', StandardScaler()),
('select', SelectKBest(k=100)), # Done per fold
('clf', RandomForestClassifier())
])
scores = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc')
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.