skills/llm/test-time-train-pseudo-label-expansion/SKILL.md
Convert a test.csv with paired positive/negative example columns into a labeled training set at inference time, using the OTHER example as the in-prompt demonstration so the model never sees its own target as a few-shot exemplar
npx skillsauth add wenmin-wu/ds-skills llm-test-time-train-pseudo-label-expansionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Some Kaggle test sets ship with positive_example_1, positive_example_2, negative_example_1, negative_example_2 columns alongside the row to score. These are gold labels — they let you do test-time training (TTT) without external data. The trick is the leakage trap: if you flatten naively and feed each example as both the target AND the in-prompt demo, the model just copies the demo. The fix is to always use the other example of the same polarity as the demo (3 - i index pairing). You get a clean labeled set the size of 4 * len(test) and a few-shot prompt with one positive + one negative demo that never overlaps the target row.
import pandas as pd
rows = []
for _, r in test.iterrows():
for i in [1, 2]:
rows.append({
'body': r[f'positive_example_{i}'],
'rule': r['rule'],
'subreddit': r['subreddit'],
'pos_demo': r[f'positive_example_{3 - i}'], # the OTHER positive
'neg_demo': r[f'negative_example_{3 - i}'], # paired negative
'label': 1,
})
rows.append({
'body': r[f'negative_example_{i}'],
'rule': r['rule'],
'subreddit': r['subreddit'],
'pos_demo': r[f'positive_example_{3 - i}'],
'neg_demo': r[f'negative_example_{3 - i}'],
'label': 0,
})
ttt_df = pd.DataFrame(rows) # 4 rows per test row, fully labeled
positive_example_{1,2} and negative_example_{1,2}, emit 4 labeled rows3 - i) so the target is never in its own promptpositive_example_1 + negative_example_1 (either index works since the target row is unlabeled)3 - i pairing, not random: random demo selection mixes polarities and adds variance; deterministic pairing is reproducible and leak-free.(body, rule) may legitimately repeat with different surrounding context; dedup hurts.subreddit and rule in the prompt: the labels are conditional on the rule; stripping it collapses distinct decision boundaries.data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF