Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

wenmin-wu/nlp-negative-sample-downsampling

Name: nlp-negative-sample-downsampling
Author: wenmin-wu

skills/nlp/negative-sample-downsampling/SKILL.md

npx skillsauth add wenmin-wu/ds-skills nlp-negative-sample-downsampling

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Negative Sample Downsampling

Overview

In NER datasets, most documents contain no entities — in PII detection, 70-90% of essays are clean. Training on all of them wastes compute on easy negatives and biases the model toward predicting O (non-entity). Downsampling negative documents to 20-33% of their original count while keeping all positive documents rebalances the training set. This is simpler than token-level class weighting and more effective — the model sees more entity examples per epoch, improving recall by 2-5%.

Quick Start

import random

def downsample_negatives(data, keep_ratio=0.2, entity_label='O'):
    """Keep all positive docs, downsample negative docs."""
    positives = []
    negatives = []
    for doc in data:
        labels = doc['labels'] if isinstance(doc['labels'], list) else doc['labels'].tolist()
        if set(labels) != {entity_label}:
            positives.append(doc)
        else:
            negatives.append(doc)

    n_keep = int(len(negatives) * keep_ratio)
    random.shuffle(negatives)
    sampled_negatives = negatives[:n_keep]

    print(f"Positives: {len(positives)}, Negatives: {len(negatives)} -> {n_keep}")
    return positives + sampled_negatives

# Alternative: filter function for HuggingFace datasets
def filter_no_entity(example, keep_ratio=0.2):
    has_entity = set(example['labels']) != {'O'}
    return has_entity or (random.random() < keep_ratio)

dataset = dataset.filter(filter_no_entity)

Workflow

Split documents into positive (contains entities) and negative (all O)
Keep all positive documents
Randomly sample keep_ratio of negative documents
Combine and shuffle for training

Key Decisions

keep_ratio: 0.2-0.33 is typical; too low causes false positives on clean documents
Per-epoch resampling: Resample negatives each epoch for diversity
vs class weights: Downsampling is simpler and works at document level; class weights work at token level
Validation: Do NOT downsample validation set — evaluate on the true distribution

References

DeBERTa3base Training

wenmin-wu/nlp-negative-sample-downsampling

skills/nlp/negative-sample-downsampling/SKILL.md

Downsamples documents with no entity labels while keeping all positive samples, balancing class distribution in NER training without discarding entity-bearing examples.

24 stars

testing

Updated Apr 18, 2026

$ install --global

skillsauth

npx skillsauth add wenmin-wu/ds-skills nlp-negative-sample-downsampling

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 18, 2026, 3:12 AM42.7s1 file scanned

SKILL.md

name:: nlp-negative-sample-downsampling
description:: >

Negative Sample Downsampling

Overview

Quick Start

import random

def downsample_negatives(data, keep_ratio=0.2, entity_label='O'):
    """Keep all positive docs, downsample negative docs."""
    positives = []
    negatives = []
    for doc in data:
        labels = doc['labels'] if isinstance(doc['labels'], list) else doc['labels'].tolist()
        if set(labels) != {entity_label}:
            positives.append(doc)
        else:
            negatives.append(doc)

    n_keep = int(len(negatives) * keep_ratio)
    random.shuffle(negatives)
    sampled_negatives = negatives[:n_keep]

    print(f"Positives: {len(positives)}, Negatives: {len(negatives)} -> {n_keep}")
    return positives + sampled_negatives

# Alternative: filter function for HuggingFace datasets
def filter_no_entity(example, keep_ratio=0.2):
    has_entity = set(example['labels']) != {'O'}
    return has_entity or (random.random() < keep_ratio)

dataset = dataset.filter(filter_no_entity)

Workflow

Split documents into positive (contains entities) and negative (all O)
Keep all positive documents
Randomly sample keep_ratio of negative documents
Combine and shuffle for training

Key Decisions

keep_ratio: 0.2-0.33 is typical; too low causes false positives on clean documents
Per-epoch resampling: Resample negatives each epoch for diversity
vs class weights: Downsampling is simpler and works at document level; class weights work at token level
Validation: Do NOT downsample validation set — evaluate on the true distribution

References

DeBERTa3base Training

Related Skills

wenmin-wu/timeseries-scaled-pinball-loss

data-ai

VerifiedTrustedCommunity

Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data

31SKILL.mdUpdated Apr 23, 2026

wenmin-wu/timeseries-scaled-pinball-loss

wenmin-wu/timeseries-retroactive-outlier-rescaling

data-ai

VerifiedTrustedCommunity

Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies

31SKILL.mdUpdated Apr 23, 2026

wenmin-wu/timeseries-retroactive-outlier-rescaling

wenmin-wu/timeseries-ratio-target-for-smape

testing

VerifiedTrustedCommunity

Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE

31SKILL.mdUpdated Apr 23, 2026

wenmin-wu/timeseries-ratio-target-for-smape

wenmin-wu/timeseries-quantile-ratio-scaling

tools

VerifiedTrustedCommunity

Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF

31SKILL.mdUpdated Apr 23, 2026

wenmin-wu/timeseries-quantile-ratio-scaling

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/wenmin-wu/ds-skills.git

# Copy into Claude Code skills folder (global)
cp -r ds-skills/skills/nlp/negative-sample-downsampling ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

wenmin-wu/ds-skills

24 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT