skills/tabular/group-shuffle-split/SKILL.md
Splits train/validation using GroupShuffleSplit so that related samples (forks, families, sessions) never span both sets.
npx skillsauth add wenmin-wu/ds-skills tabular-group-shuffle-splitInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When samples have group relationships (notebook forks sharing an ancestor, patients in the same hospital, users in the same household), random splits leak information across train/validation. GroupShuffleSplit ensures all samples from the same group land in the same split, preventing data leakage while still allowing a simple holdout split (unlike GroupKFold which requires K folds).
from sklearn.model_selection import GroupShuffleSplit
splitter = GroupShuffleSplit(n_splits=1, test_size=0.1, random_state=42)
groups = df["group_id"] # e.g., ancestor_id, patient_id, session_id
train_idx, val_idx = next(splitter.split(df, groups=groups))
train_df = df.iloc[train_idx].reset_index(drop=True)
val_df = df.iloc[val_idx].reset_index(drop=True)
GroupShuffleSplit with desired test sizedata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF