skills/nlp/class-balanced-dataset-merge/SKILL.md
Merges multiple training datasets while keeping all positive examples and downsampling negatives to control class imbalance.
npx skillsauth add wenmin-wu/ds-skills nlp-class-balanced-dataset-mergeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When combining datasets from different sources or competitions, class ratios can become severely skewed. A common pattern: keep all positive (minority) examples from every source but cap the number of negatives. This gives the model maximum signal on the rare class while preventing the majority class from dominating training.
import pandas as pd
# Dataset 1: primary training data (use all rows)
# Dataset 2: auxiliary data (keep all positives, downsample negatives)
train = pd.concat([
train1[["text", "label"]],
train2[["text", "label"]].query("label == 1"),
train2[["text", "label"]].query("label == 0").sample(n=100_000, random_state=42),
])
train = train.sample(frac=1, random_state=42).reset_index(drop=True)
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF