skills/nlp/domain-wordpiece-tfidf/SKILL.md
Trains a WordPiece tokenizer on in-domain text, then feeds its subword token IDs into TF-IDF vectorization for domain-adapted sparse features.
npx skillsauth add wenmin-wu/ds-skills nlp-domain-wordpiece-tfidfInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Standard TF-IDF uses whitespace or regex tokenization, which misses subword patterns important in domain-specific text (misspellings, slang, coded language in toxic text). Training a WordPiece tokenizer on the domain corpus learns meaningful subword units, then feeding these token IDs into TF-IDF creates sparse features that capture domain-specific vocabulary at the subword level. This bridges the gap between neural tokenizers and classical ML — you get BPE-quality tokenization with TF-IDF + Ridge regression speed.
from tokenizers import Tokenizer, models, normalizers, pre_tokenizers, trainers
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import Ridge
# Train WordPiece tokenizer on domain text
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
trainer = trainers.WordPieceTrainer(vocab_size=25000)
def corpus_iter():
for text in df['text']:
yield text
tokenizer.train_from_iterator(corpus_iter(), trainer=trainer)
# Tokenize all texts
tokenized = [tokenizer.encode(t).ids for t in df['text']]
# Feed into TF-IDF with identity tokenizer (already tokenized)
identity = lambda x: x
vectorizer = TfidfVectorizer(
analyzer='word', tokenizer=identity,
preprocessor=identity, token_pattern=None)
X = vectorizer.fit_transform(tokenized)
model = Ridge(alpha=0.8)
model.fit(X, df['score'])
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF