skills/tabular/declarative-groupby-aggregation/SKILL.md
Config-driven feature factory that generates groupby aggregation features from a declarative spec list, supporting count, mean, var, nunique, cumcount, and custom lambdas.
npx skillsauth add wenmin-wu/ds-skills tabular-declarative-groupby-aggregationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Feature engineering in tabular competitions often requires dozens of groupby aggregations (count by IP, nunique apps per device, variance of channel per OS, etc.). Instead of writing each one manually, define a spec list of {groupby, select, agg} dictionaries and loop through them. This pattern is self-documenting, easy to extend, and reduces copy-paste bugs. Used extensively in fraud detection, recommendation, and any multi-entity dataset.
import pandas as pd
import numpy as np
import gc
GROUPBY_AGGREGATIONS = [
{'groupby': ['ip'], 'select': 'app', 'agg': 'nunique'},
{'groupby': ['ip', 'app'], 'select': 'channel', 'agg': 'count'},
{'groupby': ['ip', 'device', 'os'], 'select': 'app', 'agg': 'cumcount'},
{'groupby': ['ip'], 'select': 'channel', 'agg': 'var'},
{'groupby': ['ip', 'app', 'os'], 'select': 'hour', 'agg': 'mean'},
{'groupby': ['ip'], 'select': 'app', 'agg': lambda x: x.value_counts().head(1).index[0],
'agg_name': 'mode'},
]
for spec in GROUPBY_AGGREGATIONS:
agg_name = spec.get('agg_name', spec['agg'].__name__
if callable(spec['agg']) else spec['agg'])
new_feature = '{}_{}_{}'.format('_'.join(spec['groupby']),
agg_name, spec['select'])
cols = list(set(spec['groupby'] + [spec['select']]))
gp = df[cols].groupby(spec['groupby'])[spec['select']] \
.agg(spec['agg']).reset_index() \
.rename(columns={spec['select']: new_feature})
if spec['agg'] == 'cumcount':
df[new_feature] = gp[0].values
else:
df = df.merge(gp, on=spec['groupby'], how='left')
del gp; gc.collect()
groupby keys, select column, agg functionagg_name key for lambdas since __name__ defaults to <lambda>data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF