skills/nlp/custom-bpe-tokenizer/SKILL.md
Trains a Byte-Pair Encoding tokenizer on the task corpus to capture domain-specific vocabulary, typos, and subword patterns.
npx skillsauth add wenmin-wu/ds-skills nlp-custom-bpe-tokenizerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Instead of relying on a pretrained tokenizer's vocabulary, train a BPE tokenizer on your task's text corpus. This captures domain-specific terms, common misspellings, and subword patterns unique to your data. Particularly effective when the test distribution differs from standard pretraining corpora (e.g., student essays, generated text, medical notes).
from tokenizers import Tokenizer, models, trainers, normalizers, pre_tokenizers
def train_bpe_tokenizer(texts, vocab_size=30000, lowercase=False):
"""Train BPE tokenizer on task corpus."""
tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
norm_list = [normalizers.NFC()]
if lowercase:
norm_list.append(normalizers.Lowercase())
tokenizer.normalizer = normalizers.Sequence(norm_list)
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.BpeTrainer(
vocab_size=vocab_size,
special_tokens=special_tokens,
min_frequency=2,
)
tokenizer.train_from_iterator(texts, trainer=trainer)
return tokenizer
# Train on both train + test text (unsupervised, no leakage)
all_texts = list(train_df['text']) + list(test_df['text'])
tokenizer = train_bpe_tokenizer(all_texts, vocab_size=30000)
# Use with TF-IDF or as input to models
tokenized = [tokenizer.encode(t).tokens for t in texts]
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF