skills/cv/stratified-fold-by-sequence-length/SKILL.md
Stratifies cross-validation folds by output sequence length to ensure balanced length distributions across train/val splits.
npx skillsauth add wenmin-wu/ds-skills cv-stratified-fold-by-sequence-lengthInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
In image-to-sequence tasks, output lengths vary widely (e.g., simple vs complex molecules). Random splits can create folds where one fold gets mostly short sequences and another mostly long ones, causing misleading validation scores. Stratify by binned sequence length so each fold has a representative length distribution.
from sklearn.model_selection import StratifiedKFold
import pandas as pd
# Compute target sequence length
train["seq_length"] = train["target_text"].str.len()
# Bin into discrete categories for stratification
train["length_bin"] = pd.qcut(train["seq_length"], q=10, labels=False, duplicates="drop")
# Create stratified folds
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
train["fold"] = -1
for fold, (_, val_idx) in enumerate(skf.split(train, train["length_bin"])):
train.loc[val_idx, "fold"] = fold
StratifiedKFold with the length bins as the stratification targetStratifiedKFold needs discrete labels, so bin continuous lengthsiterative-stratificationdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF