skills/tabular/outlier-rate-label-encoding/SKILL.md
Encode a categorical column by replacing each category with the per-category outlier rate (mean of a binary outlier flag), out-of-fold to avoid leakage — a target-aware encoding tuned to long-tail / sentinel-target problems where a binary classifier signal is more useful than the raw regression mean
npx skillsauth add wenmin-wu/ds-skills tabular-outlier-rate-label-encodingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Standard target encoding replaces a category with mean(y | category). When the target has a sentinel-tail problem (Elo's -33.22, churned/non-churned), the regression mean is dominated by the sentinel and washes out subtle category effects. Replace it with mean(is_outlier | category) — a per-category outlier rate. This is a much more discriminating encoding for the classifier half of an outlier-aware blending pipeline, and remains useful as a feature in the regressor too. As with any target encoding, do it out-of-fold to avoid leakage, and add Bayesian smoothing for low-frequency categories.
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
OUT_VAL = -33.21928
train['is_out'] = (train.target == OUT_VAL).astype(int)
def oof_rate_encode(df_tr, df_te, col, target='is_out', k=20):
prior = df_tr[target].mean()
enc = np.zeros(len(df_tr))
kf = KFold(5, shuffle=True, random_state=42)
for tr_idx, va_idx in kf.split(df_tr):
agg = df_tr.iloc[tr_idx].groupby(col)[target].agg(['mean', 'size'])
smooth = (agg['mean'] * agg['size'] + prior * k) / (agg['size'] + k)
enc[va_idx] = df_tr.iloc[va_idx][col].map(smooth).fillna(prior).values
full_agg = df_tr.groupby(col)[target].agg(['mean', 'size'])
smooth_full = (full_agg['mean'] * full_agg['size'] + prior * k) / (full_agg['size'] + k)
te_enc = df_te[col].map(smooth_full).fillna(prior).values
return enc, te_enc
train['cat3_outrate'], test['cat3_outrate'] = oof_rate_encode(train, test, 'category_3')
y == sentinel, y > threshold, or any indicator)(mean·n + prior·k) / (n + k) with k between 10 and 100k is target-dependent: start at 20, raise it if the encoding overfits low-frequency categories.outrate and freq capture orthogonal axes (signal vs. support).data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF