skills/cv/deterministic-hash-partitioning/SKILL.md
Partition a large dataset into N balanced shards using integer key modulo arithmetic for reproducible, class-interleaved splits across CSV files
npx skillsauth add wenmin-wu/ds-skills cv-deterministic-hash-partitioningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When a dataset is too large to fit in memory (e.g., 50M doodles across 340 classes), deterministic hash partitioning splits it into N shards using key_id % N. Each shard gets a balanced mix of all classes, is reproducible without storing the split, and can be processed independently. Combined with streaming append, this builds sharded files without loading the full dataset.
import pandas as pd
import numpy as np
from tqdm import tqdm
N_SHARDS = 100
categories = [...] # list of 340 class names
for class_idx, category in enumerate(tqdm(categories)):
df = pd.read_csv(f"train_{category}.csv", nrows=30000)
df["label"] = class_idx
df["shard"] = (df["key_id"] // 10**7) % N_SHARDS
for k in range(N_SHARDS):
chunk = df[df["shard"] == k].drop(["key_id", "shard"], axis=1)
mode = "w" if class_idx == 0 else "a"
header = class_idx == 0
chunk.to_csv(f"train_shard_{k}.csv.gz",
mode=mode, header=header, index=False,
compression="gzip")
key_id // 10^7 % N_SHARDS// 10^7 to avoid sequential correlationdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF