skills/tabular/gmm-feature-augmentation/SKILL.md
Fits a Gaussian Mixture Model on the joint feature-target space and samples synthetic data pairs to augment small tabular datasets.
npx skillsauth add wenmin-wu/ds-skills tabular-gmm-feature-augmentationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Small tabular datasets (<1000 rows) often underfit — there isn't enough signal for tree models or neural nets to generalize. GMM augmentation fits a Gaussian Mixture Model on the joint space of features and targets, then samples synthetic (X, y) pairs that follow the same distribution. Unlike SMOTE (which only interpolates between neighbors), GMM captures the full multimodal density, generating diverse samples that respect cluster boundaries. This is especially effective for regression tasks where SMOTE doesn't apply.
import numpy as np
import pandas as pd
from sklearn.mixture import GaussianMixture
def gmm_augment(X, y, n_samples=1000, n_components=5, random_state=42):
"""Generate synthetic samples from GMM fitted on joint (X, y) space."""
df = pd.DataFrame(X).copy()
df.columns = df.columns.astype(str)
df['_target'] = y
gmm = GaussianMixture(n_components=n_components, random_state=random_state)
gmm.fit(df)
synthetic, _ = gmm.sample(n_samples)
synthetic_df = pd.DataFrame(synthetic, columns=df.columns)
augmented = pd.concat([df, synthetic_df], ignore_index=True)
X_aug = augmented.drop(columns='_target').values
y_aug = augmented['_target'].values
return X_aug, y_aug
X_aug, y_aug = gmm_augment(X_train, y_train, n_samples=2000, n_components=10)
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF