skills/tabular/leak-free-loop-features/SKILL.md
Iterates through rows chronologically to accumulate user statistics, fetching current state before updating to prevent future data leakage.
npx skillsauth add wenmin-wu/ds-skills tabular-leak-free-loop-featuresInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
For sequential prediction tasks (knowledge tracing, click prediction), compute running user statistics by looping through rows in time order. The key: fetch the user's current stats BEFORE updating them with the current row's outcome. This ensures each row only sees past data — no future leakage.
import numpy as np
from collections import defaultdict
def build_loop_features(df, user_col='user_id', target_col='answered_correctly'):
"""Accumulate user stats with strict temporal ordering."""
sum_dict = defaultdict(float)
count_dict = defaultdict(int)
n = len(df)
user_mean = np.zeros(n, dtype=np.float32)
user_count = np.zeros(n, dtype=np.int32)
for i, (uid, target) in enumerate(zip(df[user_col].values, df[target_col].values)):
# Fetch BEFORE update — this prevents leakage
if count_dict[uid] > 0:
user_mean[i] = sum_dict[uid] / count_dict[uid]
else:
user_mean[i] = np.nan # cold start
user_count[i] = count_dict[uid]
# Update AFTER fetch
sum_dict[uid] += target
count_dict[uid] += 1
df['user_mean'] = user_mean
df['user_count'] = user_count
return df
@jit or write in C for >10M rowsdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF