bundled/skills/scientific-data-preprocessing/SKILL.md
⚠️ CRITICAL USER EXPERIENCE-BASED SKILL - ALWAYS CONSULT BEFORE DATA PREPROCESSING ⚠️ Prevents catastrophic errors (88.9% error rate in V1.0 case study) through multi-level feature analysis, data leakage detection, and semantic validation. MANDATORY for: data preprocessing, feature engineering, standardization, normalization, interpolation, missing value handling, feature selection, or ANY data transformation task. Covers grouped time-series, cross-sectional, panel data. Detects: time travel leakage, causal inversion, ID misuse, semantic-numeric fallacies, distribution blindness. User's hard-won lessons from real project failures.
npx skillsauth add foryourhealth111-pixel/vco-skills-codex scientific-data-preprocessingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
⚠️ CRITICAL: USER'S HARD-WON EXPERIENCE - MANDATORY CONSULTATION ⚠️
This skill encapsulates painful lessons learned from real preprocessing disasters (88.9% error rate documented). ALWAYS use this skill for planning, reflection, and validation when ANY data preprocessing is involved.
Why this skill is mandatory:
When to invoke (DO NOT SKIP):
Prevent catastrophic preprocessing errors in grouped time-series data by applying multi-level feature analysis and respecting data structure boundaries.
MANDATORY consultation - trigger immediately when:
Trigger keywords that MUST invoke this skill:
This skill does NOT:
Required inputs before proceeding:
Level 1: Data Type
# Check data types
df.dtypes # int64, float64, object, etc.
Level 2: Feature Type Classification
# Binary (0/1)
binary_features = [col for col in df.columns if df[col].nunique() == 2]
# Categorical (finite discrete values)
categorical_features = [col for col in df.select_dtypes(include='object').columns]
# Continuous (infinite possible values)
continuous_features = [col for col in df.select_dtypes(include=['float64', 'int64']).columns
if df[col].nunique() > 10]
Level 3: Data Structure
# Check for grouping
print(f"Number of groups: {df['group_id'].nunique()}")
print(f"Avg points per group: {df.groupby('group_id').size().mean():.1f}")
# Check for time-series
df_sorted = df.sort_values(['group_id', 'timestamp'])
Level 4: Physical Meaning
# Validate physical ranges
assert df['speed_mph'].max() < 200, "Speed exceeds physical limit"
assert df['distance_meters'].min() >= 0, "Negative distance impossible"
# Decision: Within-group or global processing?
def choose_processing_scope(data, feature, goal):
"""
goal = 'relative' → within-group (e.g., "this point was intense FOR THIS MATCH")
goal = 'absolute' → global (e.g., "this was an intense point OVERALL")
"""
if goal == 'relative':
return 'within_group'
elif goal == 'absolute':
return 'global'
else:
raise ValueError("Goal must be 'relative' or 'absolute'")
from scipy.interpolate import CubicSpline
import numpy as np
# ✅ CORRECT: Interpolate within each group
for group_id in df['match_id'].unique():
mask = df['match_id'] == group_id
group_data = df.loc[mask, 'speed_mph'].copy()
# Get valid (non-NaN) indices
valid_idx = group_data.notna()
valid_positions = np.where(valid_idx)[0]
valid_values = group_data[valid_idx].values
if len(valid_positions) >= 4:
cs = CubicSpline(valid_positions, valid_values)
missing_positions = np.where(~valid_idx)[0]
df.loc[mask & ~valid_idx, 'speed_mph'] = cs(missing_positions)
# ❌ WRONG: Cross-group interpolation
# This interpolates between match A's last point and match B's first point!
cs = CubicSpline(
np.where(df['speed_mph'].notna())[0], # ❌ All indices globally
df['speed_mph'].dropna().values
)
df.loc[df['speed_mph'].isna(), 'speed_mph'] = cs(
np.where(df['speed_mph'].isna())[0]
)
from sklearn.preprocessing import StandardScaler
# ✅ CORRECT: Standardize within each match
for match_id in df['match_id'].unique():
mask = df['match_id'] == match_id
scaler = StandardScaler()
df.loc[mask, 'distance_run_std_within'] = scaler.fit_transform(
df.loc[mask, [['distance_run']]
)
# Interpretation: z=+2 means "2 std above average FOR THIS MATCH"
# ✅ CORRECT: Global standardization (when appropriate)
scaler = StandardScaler()
df['distance_run_std_global'] = scaler.fit_transform(df[['distance_run']])
# Interpretation: z=+2 means "2 std above average ACROSS ALL MATCHES"
# Binary variables (0/1) - KEEP AS-IS
binary_cols = ['is_ace', 'is_winner', 'is_error']
# ❌ NEVER standardize these! They have semantic meaning as 0/1
# Categorical variables - ONE-HOT ENCODE
df_encoded = pd.get_dummies(df, columns=['server', 'serve_number'], dtype=int)
# Continuous variables - STANDARDIZE (within-group or global)
continuous_cols = ['distance_run', 'rally_count', 'speed_mph']
# ✅ Apply pattern 3 or 4 based on goal
# ✅ CORRECT: Sliding window for momentum analysis
window = 10
df['win_rate_last10'] = df.groupby('match_id')['point_won'].transform(
lambda x: x.rolling(window, min_periods=1).mean()
)
# ❌ WRONG: Cumulative features (loses temporal locality)
df['cumulative_points_won'] = df.groupby('match_id')['point_won'].cumsum()
# This just increases monotonically and correlates with point_number
def validate_data_quality(df, feature, expected_range):
"""Validate before processing"""
# Check range
assert df[feature].min() >= expected_range[0], f"{feature} below minimum"
assert df[feature].max() <= expected_range[1], f"{feature} above maximum"
# Check for anomalies
mean = df[feature].mean()
std = df[feature].std()
if std > mean:
print(f"⚠️ WARNING: {feature} has std > mean (highly skewed or errors)")
# Check missing pattern
missing_by_group = df.groupby('match_id')[feature].apply(lambda x: x.isna().sum())
if missing_by_group.max() > len(df) / df['match_id'].nunique() * 0.5:
print(f"⚠️ WARNING: {feature} has >50% missing in some groups")
# Example
validate_data_quality(df, 'speed_mph', expected_range=(50, 165))
def detect_processing_scope(df, group_col, feature_col):
"""
Recommend within-group vs global based on variance structure
"""
# Calculate variance components
within_group_var = df.groupby(group_col)[feature_col].var().mean()
global_var = df[feature_col].var()
# Intraclass correlation
between_group_var = global_var - within_group_var
icc = between_group_var / global_var
if icc > 0.5:
return 'within_group', f"High between-group variance (ICC={icc:.2f})"
else:
return 'global', f"Low between-group variance (ICC={icc:.2f})"
scope, reason = detect_processing_scope(df, 'match_id', 'distance_run')
print(f"Recommended: {scope} - {reason}")
def detect_data_leakage(df, target_col, feature_cols, id_cols):
"""
Critical checks for data leakage and AI common pitfalls
"""
issues = []
# 1. ID Leakage: High cardinality variables as features
for col in feature_cols:
if col in id_cols:
issues.append(f"❌ FATAL: {col} is an ID - NEVER use as feature")
continue
# Check if looks like ID (>50% unique)
uniqueness = df[col].nunique() / len(df)
if uniqueness > 0.5:
issues.append(f"⚠️ {col}: {uniqueness*100:.1f}% unique - possible ID leakage")
# 2. Causal Inversion: Perfect correlation with target
for col in feature_cols:
if col == target_col:
continue
if df[col].dtype in ['int64', 'float64']:
corr = abs(df[[col, target_col]].corr().iloc[0, 1])
if corr > 0.95:
issues.append(f"❌ FATAL: {col} correlation={corr:.3f} - likely consequence of target!")
# 3. Meaningless Numeric: Codes treated as numbers
for col in feature_cols:
if df[col].dtype in ['int64', 'float64']:
# Pattern: High values, many uniques, looks like code
if df[col].min() > 1000 and df[col].nunique() > 100:
issues.append(f"⚠️ {col}: Looks like code (zipcode/ID) - should be categorical")
# 4. Time Travel: Check if standardization used global statistics
# (Requires knowing if train/test split was done first)
# Print report
if issues:
print("="*60)
print("DATA LEAKAGE AUDIT")
print("="*60)
for issue in issues:
print(issue)
print("="*60)
else:
print("✅ No obvious leakage detected")
return issues
# Example usage
issues = detect_data_leakage(
df,
target_col='point_won',
feature_cols=['speed_mph', 'user_id', 'distance_run'],
id_cols=['match_id', 'user_id']
)
from scipy.stats import skew, kurtosis
from sklearn.preprocessing import StandardScaler, RobustScaler
def smart_scaler_selection(df, col):
"""
Choose scaler based on distribution characteristics
"""
data = df[col].dropna()
# Check distribution
skewness = skew(data)
kurt = kurtosis(data)
print(f"{col}: skewness={skewness:.2f}, kurtosis={kurt:.2f}")
if abs(skewness) < 0.5 and abs(kurt) < 3:
# Roughly normal
print(" → StandardScaler (data is roughly normal)")
return StandardScaler(), None
elif skewness > 1:
# Right-skewed (long tail)
print(" → Log transform + StandardScaler (right-skewed)")
return StandardScaler(), 'log'
else:
# Heavy outliers or non-normal
print(" → RobustScaler (heavy outliers)")
return RobustScaler(), None
# Example usage
for col in continuous_features:
scaler, transform = smart_scaler_selection(df, col)
if transform == 'log':
df[f'{col}_log'] = np.log1p(df[col])
df[f'{col}_scaled'] = scaler.fit_transform(df[[f'{col}_log']])
else:
df[f'{col}_scaled'] = scaler.fit_transform(df[[col]])
Input:
speed_mph, distance_run, rally_count, is_ace, serverSteps:
import pandas as pd
from sklearn.preprocessing import StandardScaler
# 1. Load and inspect
df = pd.read_csv('tennis_data.csv')
print(f"Matches: {df['match_id'].nunique()}")
print(f"Features: {df.dtypes}")
# 2. Classify features
binary_features = ['is_ace', 'is_winner', 'is_break_point']
categorical_features = ['server', 'serve_number']
continuous_features = ['distance_run', 'speed_mph', 'rally_count']
# 3. Validate data quality
for feat in continuous_features:
print(f"\n{feat}:")
print(df[feat].describe())
# Check for impossible values
if feat == 'speed_mph':
assert df[feat].max() < 170, "Speed exceeds world record!"
# 4. Handle missing values (within-group)
for match_id in df['match_id'].unique():
mask = df['match_id'] == match_id
for feat in continuous_features:
if df.loc[mask, feat].isna().any():
# Simple linear interpolation within match
df.loc[mask, feat] = df.loc[mask, feat].interpolate(method='linear')
# 5. One-hot encode categorical
df = pd.get_dummies(df, columns=categorical_features, dtype=int)
# 6. Standardize continuous features WITHIN each match
for feat in continuous_features:
df[f'{feat}_std'] = np.nan
for match_id in df['match_id'].unique():
mask = df['match_id'] == match_id
scaler = StandardScaler()
df.loc[mask, f'{feat}_std'] = scaler.fit_transform(
df.loc[mask, [[feat]]
)
# 7. Create sliding window features
window = 10
df['win_rate_last10'] = df.groupby('match_id')['point_won'].transform(
lambda x: x.rolling(window, min_periods=1).mean()
)
# 8. KEEP binary features as 0/1 (don't transform!)
# binary_features are already correct
print("\n✅ Preprocessing complete!")
print(f"Final shape: {df.shape}")
print(f"Standardized features: {[f for f in df.columns if f.endswith('_std')]}")
Expected output:
server_1, server_2)_std versions_std features have mean≈0, std≈1 WITHIN each matchInput:
Steps:
# Check if standardization was done correctly
def check_within_group_standardization(df, group_col, feature_std_col):
"""
Verify that standardized feature has mean≈0, std≈1 within each group
"""
results = df.groupby(group_col)[feature_std_col].agg(['mean', 'std'])
# Within-group standardization: each group should have mean≈0, std≈1
if (results['mean'].abs() < 0.1).all() and (results['std'].between(0.9, 1.1)).all():
print("✅ CORRECT: Within-group standardization detected")
return True
# Global standardization: groups will have varying means and stds
else:
print("❌ WRONG: Global standardization detected!")
print("Group means:", results['mean'].values[:5])
print("Group stds:", results['std'].values[:5])
return False
check_within_group_standardization(df, 'match_id', 'distance_run_std')
Expected output:
Input:
Steps:
# ❌ WRONG approach (existing code)
df['cumulative_wins'] = df.groupby('match_id')['point_won'].cumsum()
# Problem: This just counts total wins up to this point
# Doesn't capture recent momentum!
# ✅ CORRECT approach (fix)
# Replace cumulative with sliding window
window = 10
df['recent_win_rate'] = df.groupby('match_id')['point_won'].transform(
lambda x: x.rolling(window, min_periods=1).mean()
)
# Compare
print("Cumulative (wrong):", df['cumulative_wins'].values[50:60])
print("Sliding window (correct):", df['recent_win_rate'].values[50:60])
# Cumulative: [25, 26, 26, 27, 28, ...] - monotonic
# Sliding window: [0.6, 0.7, 0.5, 0.6, ...] - fluctuates with momentum
Expected output:
references/index.md: Navigation and overviewreferences/error-case-studies.md: Real-world preprocessing disasters from tennis datareferences/decision-trees.md: Full decision trees for all preprocessing choicesreferences/validation-checklist.md: Pre-processing validation checklistreferences/ai-common-pitfalls.md: AI-specific errors (data leakage, semantic fallacies, distribution blindness)⚠️ CRITICAL NOTICE: USER'S PERSONAL EXPERIENCE-BASED SKILL ⚠️
This skill is NOT theoretical - it's based on real project failures:
Why this matters to you (Claude):
Authority level: HIGHEST
Sources:
Mandatory consultation:
Last updated: 2026-01-18 (V1.1)
Known limits:
development
Model interpretability and explainability using SHAP (SHapley Additive exPlanations). Use this skill when explaining machine learning model predictions, computing feature importance, generating SHAP plots (waterfall, beeswarm, bar, scatter, force, heatmap), debugging models, analyzing model bias or fairness, comparing models, or implementing explainable AI. Works with tree-based models (XGBoost, LightGBM, Random Forest), deep learning (TensorFlow, PyTorch), linear models, and any black-box model.
development
Use when the user asks to inspect Sentry issues or events, summarize recent production errors, or pull basic Sentry health data via the Sentry API; perform read-only queries with the bundled script and require `SENTRY_AUTH_TOKEN`.
development
World-class prompt engineering skill for LLM optimization, prompt patterns, structured outputs, and AI product development. Expertise in Claude, GPT-4, prompt design patterns, few-shot learning, chain-of-thought, and AI evaluation. Includes RAG optimization, agent design, and LLM system architecture. Use when building AI products, optimizing LLM performance, designing agentic systems, or implementing advanced prompting techniques.
development
World-class ML engineering skill for productionizing ML models, MLOps, and building scalable ML systems. Expertise in PyTorch, TensorFlow, model deployment, feature stores, model monitoring, and ML infrastructure. Includes LLM integration, fine-tuning, RAG systems, and agentic AI. Use when deploying ML models, building ML platforms, implementing MLOps, or integrating LLMs into production systems.