skills/nlp/subtoken-labeling-strategy/SKILL.md
Controls whether all subtokens or only the first subtoken of each word receive NER labels during training and inference.
npx skillsauth add wenmin-wu/ds-skills nlp-subtoken-labeling-strategyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Transformer tokenizers split words into subword tokens ("playing" → "play", "##ing"). For token classification (NER), you must decide: label all subtokens of a word, or only the first? Labeling all subtokens gives more supervision signal and can improve recall, while first-only is cleaner and avoids label noise from meaningless subword pieces. At inference, always use the first subtoken's prediction to represent the word.
def align_labels(word_labels, word_ids, label_all_subtokens=True):
"""Align word-level labels to subtoken positions.
Args:
word_labels: list of labels, one per word
word_ids: tokenizer output word_ids(), None for special tokens
label_all_subtokens: if True, label all subtokens; if False, only first
"""
label_ids = []
previous_word_idx = None
for word_idx in word_ids:
if word_idx is None:
label_ids.append(-100) # ignore in loss
elif word_idx != previous_word_idx:
label_ids.append(word_labels[word_idx]) # first subtoken
else:
if label_all_subtokens:
label_ids.append(word_labels[word_idx]) # subsequent subtokens
else:
label_ids.append(-100) # ignore subsequent subtokens
previous_word_idx = word_idx
return label_ids
def predict_words(token_preds, word_ids):
"""Map token predictions back to words using first-subtoken strategy."""
word_preds = []
previous_word_idx = -1
for idx, word_idx in enumerate(word_ids):
if word_idx is not None and word_idx != previous_word_idx:
word_preds.append(token_preds[idx])
previous_word_idx = word_idx if word_idx is not None else previous_word_idx
return word_preds
return_offsets_mapping=True to get word_ids()data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF