skills/cv/multilabel-rare-class-image-oversampling/SKILL.md
Oversample multi-label images by giving each image a duplication multiplier equal to the max per-class multiplier among its labels, so every rare class gets repetition without exploding common-class counts — the standard fix for long-tail multi-label distributions where SMOTE / per-row oversampling doesn't apply
npx skillsauth add wenmin-wu/ds-skills cv-multilabel-rare-class-image-oversamplingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Multi-label image data has a long-tail label distribution but each image has multiple labels at once, so per-row oversampling is ambiguous: which class drives the multiplier? The HPA-winning trick is to define a per-class multiplier vector (e.g. 1 for common classes, 2-4 for rare), and for each image take the max multiplier across its labels. Common-only images stay at 1 copy; an image carrying any rare class gets multiplied; images with multiple rare classes still get only one bump (you don't double-multiply). This shifts the rare-class effective frequency upward without bloating common-class samples or distorting label correlations the way per-class image generation would.
import pandas as pd
class Oversampling:
def __init__(self, csv_path, multi):
df = pd.read_csv(csv_path).set_index('Id')
df['Target'] = [[int(i) for i in s.split()] for s in df['Target']]
self.labels = df
self.multi = multi # per-class duplication factor
def get(self, image_id):
labels = self.labels.loc[image_id, 'Target']
return max((self.multi[l] for l in labels), default=1)
# 28-class HPA example: rare classes get 4x, common stay at 1x
multi = [1,1,1,1,1,1,1,1, 4,4,4,1,1,1,1,4,
1,1,1,1,2,1,1,1, 1,1,1,4]
sampler = Oversampling(LABELS_CSV, multi)
train_ids = [iid for iid in train_ids for _ in range(sampler.get(iid))]
multi[c] inversely proportional to count, but cap at ~4-8 (more inflates training time without lift)WeightedRandomSampler with the same ratios)WeightedRandomSampler: PyTorch users can pass the per-image multipliers as weights instead of materializing duplicates.data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF