skills/nlp/test-vocabulary-alignment/SKILL.md
Fits TF-IDF vectorizer on test set first to extract vocabulary, then retrains on train set using that vocabulary for feature consistency.
npx skillsauth add wenmin-wu/ds-skills nlp-test-vocabulary-alignmentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When train and test have different text distributions (e.g., different essay prompts, different LLM generators), fitting TF-IDF only on train may miss test-specific n-grams. Fit the vectorizer on test first to discover its vocabulary, then refit on train using only that vocabulary. This ensures every feature in your model exists in both train and test.
from sklearn.feature_extraction.text import TfidfVectorizer
def aligned_tfidf(train_texts, test_texts, **tfidf_kwargs):
"""Fit vocabulary on test, then vectorize train with that vocab."""
# Step 1: Discover test vocabulary
vec_test = TfidfVectorizer(**tfidf_kwargs)
vec_test.fit(test_texts)
vocab = vec_test.vocabulary_
# Step 2: Refit on train using test vocabulary
vec_train = TfidfVectorizer(vocabulary=vocab, **tfidf_kwargs)
tf_train = vec_train.fit_transform(train_texts)
tf_test = vec_train.transform(test_texts)
return tf_train, tf_test, vec_train
tf_train, tf_test, vectorizer = aligned_tfidf(
train_texts, test_texts,
ngram_range=(3, 5), sublinear_tf=True, strip_accents='unicode'
)
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF