skills/tabular/tfidf-weighted-category-counts/SKILL.md
Convert per-group categorical event counts into TF-IDF-style features using log(1+tf/total) * log(N/df)
npx skillsauth add wenmin-wu/ds-skills tabular-tfidf-weighted-category-countsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Categorical count features (how often each activity / key / event type occurred per session) suffer from two problems: long sessions dominate raw counts, and common categories drown out rare-but-informative ones. Applying TF-IDF — the same transform used on text term counts — fixes both. Normalize each count by the session total (TF) and weight by how rare the category is across sessions (IDF). The result behaves like a dense numeric feature but preserves rarity signal. Fit IDF on train only, then reuse on test to avoid leakage.
import numpy as np
class CategoryTfidf:
def __init__(self):
self.idf = {}
def fit_transform(self, counts_df):
# counts_df: rows = session id, columns = categories, values = counts
cnts = counts_df.sum(axis=1)
out = counts_df.copy().astype(float)
N = len(counts_df)
for col in counts_df.columns:
df_col = (counts_df[col] > 0).sum()
idf = np.log(N / (df_col + 1))
self.idf[col] = idf
tf = counts_df[col] / cnts.replace(0, 1)
out[col] = (1 + np.log1p(tf)) * idf
return out
def transform(self, counts_df):
cnts = counts_df.sum(axis=1)
out = counts_df.copy().astype(float)
for col in counts_df.columns:
tf = counts_df[col] / cnts.replace(0, 1)
out[col] = (1 + np.log1p(tf)) * self.idf.get(col, 0)
return out
crosstab or groupby().size().unstack()log(N/(df+1)) avoids division-by-zero when a category appears only in test.data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF