skills/nlp/worker-stratified-kfold/SKILL.md
Stratifies CV folds by annotator/worker ID to prevent annotator style leakage across train and validation splits in crowd-sourced datasets.
npx skillsauth add wenmin-wu/ds-skills nlp-worker-stratified-kfoldInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
In crowd-sourced annotation datasets, multiple workers label different subsets of examples. If the same worker's annotations appear in both train and validation, the model can learn annotator-specific biases rather than the task signal. Stratifying folds by worker ID ensures each annotator's labels stay entirely within one fold, giving a more honest estimate of generalization. This is distinct from group-based splitting — here we stratify (balance worker distribution) rather than group (isolate entire groups).
from sklearn.model_selection import StratifiedKFold
# df has columns: 'text', 'label', 'worker' (annotator ID)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
df['fold'] = -1
for fold, (_, val_idx) in enumerate(skf.split(X=df, y=df['worker'])):
df.loc[val_idx, 'fold'] = fold
# Train/val split for a specific fold
train_df = df[df['fold'] != 0]
val_df = df[df['fold'] == 0]
y=worker_id to balance workers across foldsf"{worker}_{label}"data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF