skills/nlp/source-balanced-stratified-fold/SKILL.md
Stratifies CV folds by both target label AND data source to prevent source-specific bias in each fold.
npx skillsauth add wenmin-wu/ds-skills nlp-source-balanced-stratified-foldInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When training data comes from multiple sources (different datasets, different LLMs, different collection methods), standard stratified K-fold only balances labels — individual folds may over-represent one source. Create a composite stratification key from label + source to ensure every fold has a representative mix of both dimensions.
import pandas as pd
from sklearn.model_selection import StratifiedKFold
def source_balanced_split(df, label_col, source_col, n_splits=5, seed=42):
"""Stratified K-fold balanced by both label and source."""
# Create composite key
df['stratify_key'] = df[label_col].astype(str) + '_' + df[source_col].astype(str)
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)
df['fold'] = -1
for fold, (_, val_idx) in enumerate(skf.split(df, df['stratify_key'])):
df.loc[val_idx, 'fold'] = fold
df.drop('stratify_key', axis=1, inplace=True)
return df
# Usage
df = source_balanced_split(df, label_col='label', source_col='source')
# Verify balance
for fold in range(5):
fold_df = df[df['fold'] == fold]
print(f"Fold {fold}: {fold_df['source'].value_counts(normalize=True).to_dict()}")
f"{label}_{source}"label_source_prompt for 3-way balancing if neededdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF