skills/cv/triple-stratified-shard-folds/SKILL.md
Pre-shard TFRecords into N files each balanced along 3 axes (patient, target, image-count), then KFold over file indices for leak-free triple-stratified folds
npx skillsauth add wenmin-wu/ds-skills cv-triple-stratified-shard-foldsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Kaggle medical-imaging competitions often need splits stratified along several axes at once: (a) no patient appears in both train and val, (b) class balance held constant, (c) per-patient image-count distribution preserved. Doing this on-the-fly per run is slow and hard to verify. The trick: pre-shard the dataset into 15 TFRecord files where each file already contains a balanced mix along all three axes. At train time a simple KFold over file indices gives you triple-stratified, leak-free folds with zero runtime stratification logic.
import numpy as np
from sklearn.model_selection import KFold
import tensorflow as tf
FOLDS, N_SHARDS, SEED = 5, 15, 42
GCS = 'gs://my-bucket/melanoma-256'
# Each train%.2i.tfrec contains a balanced mix along patient/target/image-count
skf = KFold(n_splits=FOLDS, shuffle=True, random_state=SEED)
for fold, (idxT, idxV) in enumerate(skf.split(np.arange(N_SHARDS))):
files_train = tf.io.gfile.glob(
[f'{GCS}/train{x:02d}*.tfrec' for x in idxT])
files_valid = tf.io.gfile.glob(
[f'{GCS}/train{x:02d}*.tfrec' for x in idxV])
train_ds = tf.data.TFRecordDataset(files_train).map(parse_fn)
# Patients never cross folds because each shard is one patient-group
# Target + image-count are balanced by construction when sharding
train00.tfrec, train01.tfrec, ...KFold(n_splits=F) over np.arange(N) to get shard-index splitstf.data pipelinesdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF