skills/cv/patient-level-stratified-kfold/SKILL.md
Stratifies CV folds at the patient level rather than image level, preventing data leakage when multiple images exist per patient.
npx skillsauth add wenmin-wu/ds-skills cv-patient-level-stratified-kfoldInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
In medical imaging, each patient has multiple images (e.g., left/right breast, multiple slices, follow-up scans). Standard image-level KFold leaks information — images from the same patient can appear in both train and validation, inflating metrics by 0.01–0.05. Patient-level stratification ensures all images from one patient are in the same fold, while still balancing the target distribution across folds. Essential for any medical competition (RSNA, SIIM, VinBigData).
import pandas as pd
from sklearn.model_selection import StratifiedKFold
# Aggregate to patient-level label (any positive image = positive patient)
patient_labels = train_df.groupby('patient_id')['target'].max().reset_index()
# Split at patient level, stratified by patient-level label
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
patient_labels['fold'] = -1
for fold, (_, val_idx) in enumerate(skf.split(
patient_labels['patient_id'], patient_labels['target']
)):
patient_labels.loc[val_idx, 'fold'] = fold
# Map fold back to image-level DataFrame
train_df = train_df.merge(
patient_labels[['patient_id', 'fold']], on='patient_id'
)
# Use in training
for fold in range(5):
train_idx = train_df[train_df['fold'] != fold].index
val_idx = train_df[train_df['fold'] == fold].index
# No patient overlap between train_idx and val_idx
groupby('patient_id').target.max())max() for binary (any positive = positive patient); mean() for regressionGroupKFold prevents leakage but doesn't stratify; this does bothdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF