skills/tabular/dtype-preset-csv-load/SKILL.md
Predefines minimal unsigned integer dtypes before CSV loading to cut DataFrame memory usage by 2-4x without any data loss.
npx skillsauth add wenmin-wu/ds-skills tabular-dtype-preset-csv-loadInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Pandas defaults to int64/float64 for all numeric columns, wasting 4–6 bytes per value when the actual range fits in uint8/uint16/uint32. By passing a dtype dict to pd.read_csv(), you enforce minimal types at load time — before the full-size DataFrame ever exists in memory. On a 200M-row click log, this can mean the difference between fitting in 16GB RAM or not. Combined with usecols to skip unneeded columns, this is the first line of defense for large datasets.
import pandas as pd
# Inspect column ranges first (on a small sample)
sample = pd.read_csv('train.csv', nrows=100_000)
for col in sample.select_dtypes('number').columns:
print(f"{col}: {sample[col].min()} - {sample[col].max()}")
# Define minimal dtypes based on observed ranges
dtypes = {
'ip': 'uint32', # max ~300k → fits uint32
'app': 'uint16', # max ~700 → fits uint16
'device': 'uint16', # max ~4000
'os': 'uint16', # max ~900
'channel': 'uint16', # max ~500
'is_attributed': 'uint8', # binary 0/1
}
# Load with preset dtypes — 2-4x less memory
train = pd.read_csv(
'train.csv',
dtype=dtypes,
usecols=list(dtypes.keys()) + ['click_time'],
parse_dates=['click_time'],
)
print(f"Memory: {train.memory_usage(deep=True).sum() / 1e9:.2f} GB")
pd.read_csv() along with usecolsassert df[col].max() < np.iinfo(dtype).maxdtype='category' for 10-50x savingsdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF