skills/nlp/spacy-custom-ner-span-extraction/SKILL.md
Train per-class spaCy NER models to extract task-specific spans as custom named entities with compounding batch sizes
npx skillsauth add wenmin-wu/ds-skills nlp-spacy-custom-ner-span-extractionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Frame span extraction as NER by creating a custom entity type (e.g. "selected_text") and training separate spaCy NER models per class. Each model specializes in extracting spans relevant to one class (positive, negative). Use compounding batch sizes (start small, grow large) for stable training. Simpler alternative to transformer QA models when data or compute is limited.
import spacy
from spacy.util import minibatch, compounding
import random
def prepare_ner_data(df, label='selected_text'):
"""Convert DataFrame to spaCy NER training format."""
train_data = []
for _, row in df.iterrows():
text = row['text']
start = text.find(row['selected_text'])
if start == -1:
continue
end = start + len(row['selected_text'])
train_data.append((text, {"entities": [(start, end, label)]}))
return train_data
def train_ner(train_data, n_iter=20, drop=0.5):
"""Train a blank spaCy NER model."""
nlp = spacy.blank("en")
ner = nlp.add_pipe("ner", last=True)
ner.add_label('selected_text')
nlp.begin_training()
for _ in range(n_iter):
random.shuffle(train_data)
batches = minibatch(train_data, size=compounding(4.0, 500.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, drop=drop)
return nlp
# Train one model per sentiment class
model_pos = train_ner(prepare_ner_data(df[df.sentiment == 'positive']))
model_neg = train_ner(prepare_ner_data(df[df.sentiment == 'negative']))
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF