skills/tabular/yearly-partitioned-groupby/SKILL.md
Split a multi-year table into per-year partitions, run the same groupby aggregation on each, then concat and gc — a pure-pandas map-reduce that survives 100M+ rows on a 16GB kernel
npx skillsauth add wenmin-wu/ds-skills tabular-yearly-partitioned-groupbyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A naive df.groupby(['year','month','day','store'])['sales'].agg(['sum','count']) on a 125M-row Favorita train set hangs or OOMs a 16GB kernel because the intermediate hash table dwarfs available memory. Slicing by year first keeps each partition's hash table small, runs aggregation per partition, and pd.concats the results. This is map-reduce implemented in pure pandas — no Dask, no HDF5 streaming — and it's usually the cheapest fix when a single groupby kills your kernel.
import pandas as pd
import gc
def aggregate_partition(df):
day_store = df.groupby(['Year','Month','Day','store_nbr'],
as_index=False)['unit_sales'].agg(['sum','count'])
day_item = df.groupby(['Year','Month','Day','item_nbr'],
as_index=False)['unit_sales'].agg(['sum','count'])
return day_store, day_item
store_parts, item_parts = [], []
for y in sorted(train['Year'].unique()):
ds, di = aggregate_partition(train.loc[train['Year'] == y])
store_parts.append(ds); item_parts.append(di)
gc.collect()
day_store = pd.concat(store_parts); day_item = pd.concat(item_parts)
del store_parts, item_parts; gc.collect()
Year) is a small int so slicing is fast.loc, aggregate, append resultspd.concat the per-partition results along axis 0del intermediate lists and call gc.collect() explicitly — pandas does not release memory eagerlygc.collect(): without it, the freed DataFrames linger and the next iteration adds to peak memory.data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF