skills/tabular/featureunion-field-dispatch/SKILL.md
Use sklearn FeatureUnion with closure-based preprocessors to apply different vectorizers to different DataFrame columns in a single fit_transform call
npx skillsauth add wenmin-wu/ds-skills tabular-featureunion-field-dispatchInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When a dataset has multiple text and categorical columns that each need different vectorization (TF-IDF for descriptions, CountVectorizer for names, token-pattern matching for categoricals), use a FeatureUnion with custom preprocessor closures. Each vectorizer receives a closure that extracts its target column from the row. This keeps the entire feature pipeline in one fit_transform call and produces a single sparse matrix.
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
default_preprocessor = CountVectorizer().build_preprocessor()
def build_preprocessor(field):
idx = list(df.columns).index(field)
return lambda x: default_preprocessor(x[idx])
vectorizer = FeatureUnion([
('name', CountVectorizer(
ngram_range=(1, 2), max_features=50000,
preprocessor=build_preprocessor('name'))),
('category', CountVectorizer(
token_pattern='.+',
preprocessor=build_preprocessor('category'))),
('brand', CountVectorizer(
token_pattern='.+',
preprocessor=build_preprocessor('brand'))),
('description', TfidfVectorizer(
ngram_range=(1, 3), max_features=100000,
preprocessor=build_preprocessor('description'))),
])
X = vectorizer.fit_transform(df.values)
CountVectorizer().build_preprocessor()FeatureUnion its own closuretoken_pattern='.+' for single-value categoricals (treat entire cell as one token)fit_transform(df.values) — each row is a numpy array, closures extract the right column.values gives array rows to closuresdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF