skills/tabular/tfidf-svd-dense-text-features/SKILL.md
Compress TF-IDF sparse text vectors into a handful of dense TruncatedSVD components so GBDTs can consume free-text fields as plain tabular columns
npx skillsauth add wenmin-wu/ds-skills tabular-tfidf-svd-dense-text-featuresInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Gradient boosters (LightGBM, XGBoost, CatBoost) do not consume sparse text matrices well — feature importance is diluted across thousands of rarely-hit columns, and training slows down. The clean fix is to fit a TF-IDF vectorizer on the union of train+test text, then run a small TruncatedSVD (3-20 components) and project each row into that dense subspace. You get a handful of columns like svd_desc_1..svd_desc_5 that capture the dominant topic axes and drop straight into your tabular feature frame. Used on Avito Demand Prediction top kernels to fold title/description text into a LightGBM alongside price, category, and region.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import pandas as pd
tfidf = TfidfVectorizer(ngram_range=(1, 1), max_features=100_000)
full = tfidf.fit(pd.concat([train_df['description'], test_df['description']]).fillna(''))
train_vec = tfidf.transform(train_df['description'].fillna(''))
test_vec = tfidf.transform(test_df['description'].fillna(''))
n_comp = 5
svd = TruncatedSVD(n_components=n_comp, algorithm='arpack', random_state=0)
svd.fit(train_vec)
cols = [f'svd_desc_{i+1}' for i in range(n_comp)]
train_df[cols] = svd.transform(train_vec)
test_df[cols] = svd.transform(test_vec)
TfidfVectorizer on train + test text so the vocabulary is shared (transductive)TruncatedSVD with a small n_components on the train matrixalgorithm='arpack': stable top-k decomposition on very sparse matrices; use randomized if n_components > 50.data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF