skills/tabular/inner-kfold-target-encoding/SKILL.md
Computes leak-free target encoding statistics (mean, std, min, max) using nested inner KFold within each outer CV fold, preventing target leakage that occurs with naive groupby-based encoding.
npx skillsauth add wenmin-wu/ds-skills tabular-inner-kfold-target-encodingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Naive target encoding leaks information because the target statistics for a row's category include that row's own target value. This skill uses a nested (inner) KFold loop: within each outer training fold, an inner 5-fold CV computes category-level aggregates only on held-in data, then applies them to the held-out inner fold. The result is a leak-free set of TE features (mean via sklearn, std/min/max via groupby).
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import TargetEncoder
import pandas as pd
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in outer_cv.split(X, y):
X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]
y_tr = y.iloc[train_idx]
# Mean TE via sklearn (handles inner CV internally)
te = TargetEncoder(cv=5, smooth="auto")
X_tr_te = te.fit_transform(X_tr[cat_cols], y_tr)
X_val_te = te.transform(X_val[cat_cols])
# Std/Min/Max TE via inner KFold
for col in cat_cols:
oof_stats = pd.DataFrame(index=X_tr.index)
for in_tr, in_val in inner_cv.split(X_tr, y_tr):
stats = X_tr.iloc[in_tr].groupby(col)[target].agg(['std','min','max'])
mapped = X_tr.iloc[in_val][[col]].join(stats, on=col)
oof_stats.loc[mapped.index] = mapped[['std','min','max']].values
# Val/test: use full outer-fold training stats
full_stats = X_tr.groupby(col)[target].agg(['std','min','max'])
X_val[[f'{col}_std', f'{col}_min', f'{col}_max']] = \
X_val[[col]].join(full_stats, on=col)[['std','min','max']].values
StratifiedKFold(n_splits=5) for model evaluationStratifiedKFold(n_splits=5) on the outer training setgroupby(cat_col)[target].agg(['std','min','max']) on held-in, apply to held-outsklearn.TargetEncoder(cv=5) for mean TE (it does its own internal CV)data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF