skills/tabular/group-kfold-leak-prevention/SKILL.md
Uses GroupKFold to prevent data leakage when multiple rows share a common entity (e.g., same user, question, or document).
npx skillsauth add wenmin-wu/ds-skills tabular-group-kfold-leak-preventionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When rows in a dataset share a group identity (same question, user, session, etc.), standard KFold can place related rows in both train and validation, causing leakage. GroupKFold ensures all rows from a group land in the same fold, giving honest validation scores.
from sklearn.model_selection import GroupKFold
def group_kfold_split(X, y, groups, n_splits=5):
"""Generate leak-proof train/val splits by group.
Args:
X: features array
y: target array
groups: array of group IDs (e.g., question_id, user_id)
n_splits: number of folds
Yields:
(train_idx, val_idx) tuples
"""
gkf = GroupKFold(n_splits=n_splits)
for train_idx, val_idx in gkf.split(X, y, groups=groups):
yield train_idx, val_idx
KFold / StratifiedKFold with GroupKFoldgroups parameterGroupKFold doesn't stratify; for imbalanced targets, use StratifiedGroupKFold (sklearn 1.0+)n_splits unique groupsdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF