skills/nlp/negative-sample-downsampling/SKILL.md
Downsamples documents with no entity labels while keeping all positive samples, balancing class distribution in NER training without discarding entity-bearing examples.
npx skillsauth add wenmin-wu/ds-skills nlp-negative-sample-downsamplingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
In NER datasets, most documents contain no entities — in PII detection, 70-90% of essays are clean. Training on all of them wastes compute on easy negatives and biases the model toward predicting O (non-entity). Downsampling negative documents to 20-33% of their original count while keeping all positive documents rebalances the training set. This is simpler than token-level class weighting and more effective — the model sees more entity examples per epoch, improving recall by 2-5%.
import random
def downsample_negatives(data, keep_ratio=0.2, entity_label='O'):
"""Keep all positive docs, downsample negative docs."""
positives = []
negatives = []
for doc in data:
labels = doc['labels'] if isinstance(doc['labels'], list) else doc['labels'].tolist()
if set(labels) != {entity_label}:
positives.append(doc)
else:
negatives.append(doc)
n_keep = int(len(negatives) * keep_ratio)
random.shuffle(negatives)
sampled_negatives = negatives[:n_keep]
print(f"Positives: {len(positives)}, Negatives: {len(negatives)} -> {n_keep}")
return positives + sampled_negatives
# Alternative: filter function for HuggingFace datasets
def filter_no_entity(example, keep_ratio=0.2):
has_entity = set(example['labels']) != {'O'}
return has_entity or (random.random() < keep_ratio)
dataset = dataset.filter(filter_no_entity)
keep_ratio of negative documentsdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF