skills/tabular/synthetic-sample-detection/SKILL.md
Detects synthetic/fake test samples by checking whether each row has at least one unique value across all features — real samples do, synthetic ones don't.
npx skillsauth add wenmin-wu/ds-skills tabular-synthetic-sample-detectionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Some competitions inject synthetic rows into the test set to prevent probing or inflate leaderboard noise. A reliable signal: real data points almost always have at least one feature value that appears only once across the entire dataset, while synthetic rows (generated by sampling from existing value distributions) lack any truly unique values. Flag rows with zero unique values as synthetic and exclude them from frequency-based feature engineering.
import numpy as np
df_test = test_df.drop("ID_code", axis=1).values
unique_count = np.zeros_like(df_test)
for col in range(df_test.shape[1]):
_, idx, counts = np.unique(df_test[:, col], return_index=True, return_counts=True)
unique_count[idx[counts == 1], col] += 1
has_unique = np.sum(unique_count, axis=1) > 0
real_idx = np.argwhere(has_unique)[:, 0]
fake_idx = np.argwhere(~has_unique)[:, 0]
print(f"Real: {len(real_idx)}, Synthetic: {len(fake_idx)}")
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF