skills/tabular/two-level-hierarchical-aggregation/SKILL.md
Aggregates deeply nested relational tables through two groupby levels (child → intermediate → parent) to build features from multi-hop relationships.
npx skillsauth add wenmin-wu/ds-skills tabular-two-level-hierarchical-aggregationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
In multi-table competitions (Home Credit, Amex, Elo), auxiliary data is often nested: a client has many loans, each loan has many monthly records. Simple one-level aggregation (monthly records → client) loses the loan-level structure. Two-level aggregation first summarizes at the intermediate level (monthly → loan), then re-aggregates at the parent level (loan → client). This preserves distributional information across the hierarchy — e.g., the client's average loan balance volatility, not just their overall average balance.
import pandas as pd
import numpy as np
def agg_numeric(df, group_var, prefix):
agg = df.groupby(group_var).agg(['count', 'mean', 'max', 'min', 'sum'])
columns = [f'{prefix}_{col}_{stat}' for col, stat in agg.columns]
agg.columns = columns
return agg
def agg_categorical(df, group_var, prefix):
cat = pd.get_dummies(df.select_dtypes('object'))
cat[group_var] = df[group_var]
agg = cat.groupby(group_var).agg(['sum', 'mean'])
columns = [f'{prefix}_{col}_{stat}' for col, stat in agg.columns]
agg.columns = columns
return agg
# Level 1: monthly records → loan
loan_agg = agg_numeric(monthly, group_var='LOAN_ID', prefix='monthly')
# Join back loan metadata
loan_features = loans[['LOAN_ID', 'CLIENT_ID']].merge(loan_agg, on='LOAN_ID')
# Level 2: loan → client
client_features = agg_numeric(
loan_features.drop(columns=['LOAN_ID']),
group_var='CLIENT_ID', prefix='loan'
)
# Merge into main table
train = train.merge(client_features, on='CLIENT_ID', how='left')
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF