Scientific Data Preprocessing Skill

⚠️ CRITICAL: USER'S HARD-WON EXPERIENCE - MANDATORY CONSULTATION ⚠️

This skill encapsulates painful lessons learned from real preprocessing disasters (88.9% error rate documented). ALWAYS use this skill for planning, reflection, and validation when ANY data preprocessing is involved.

Why this skill is mandatory:

Based on actual project failures (V1.0, V2.0 case studies)
Prevents data leakage that causes production disasters
Catches semantic errors AI agents commonly make
Saves weeks of debugging and model retraining

When to invoke (DO NOT SKIP):

✅ Before starting ANY data preprocessing task
✅ During preprocessing for reflection and validation
✅ After preprocessing for comprehensive audit
✅ When reviewing AI-generated preprocessing code

Core Mission

Prevent catastrophic preprocessing errors in grouped time-series data by applying multi-level feature analysis and respecting data structure boundaries.

When to Use This Skill

MANDATORY consultation - trigger immediately when:

Data Preprocessing Tasks (ALWAYS)

Any data cleaning, transformation, or preparation work
Loading and preparing data for modeling
Creating training/test splits
Handling missing values (imputation, deletion)
Feature scaling/normalization/standardization
Encoding categorical variables
Feature engineering or construction
Feature selection or dimensionality reduction

Data Structure Types (ALWAYS)

Preprocesssing time-series data with natural groupings (matches, sessions, patients, experiments)
Sports analytics (tennis, basketball, etc.)
Medical/clinical data with patient groupings
Panel data or longitudinal studies
Any grouped/hierarchical data structure

Quality Assurance (ALWAYS)

Auditing existing preprocessing for data leakage or semantic errors
Reviewing AI-generated preprocessing code for common pitfalls
Validating preprocessing before model training
Debugging unexpected model performance

Critical Checkpoints (NEVER SKIP)

✅ BEFORE: Planning preprocessing strategy
✅ DURING: Reflecting on decisions and checking for errors
✅ AFTER: Comprehensive validation and audit

Trigger keywords that MUST invoke this skill:

"preprocess", "preprocessing", "data cleaning", "data preparation"
"standardize", "normalize", "scale", "transform"
"impute", "fill missing", "handle NaN"
"encode", "one-hot", "categorical"
"feature engineering", "feature selection", "feature construction"
"train test split", "cross validation split"
"interpolate", "smooth", "aggregate"

Not For / Boundaries

This skill does NOT:

Handle purely cross-sectional data (ungrouped, single timepoint)
Make domain-specific feature engineering decisions (you decide business logic)
Choose ML models (focuses on preprocessing only)
Handle distributed/big data infrastructure (assumes data fits in memory)

Required inputs before proceeding:

Confirmation that data has groups (e.g., match_id, patient_id, session_id)
Understanding of whether goal is within-group (relative) or cross-group (absolute) comparison
Domain constraints on data ranges/units

Quick Reference

Multi-Level Feature Analysis Framework

Level 1: Data Type

# Check data types
df.dtypes  # int64, float64, object, etc.

Level 2: Feature Type Classification

# Binary (0/1)
binary_features = [col for col in df.columns if df[col].nunique() == 2]

# Categorical (finite discrete values)
categorical_features = [col for col in df.select_dtypes(include='object').columns]

# Continuous (infinite possible values)
continuous_features = [col for col in df.select_dtypes(include=['float64', 'int64']).columns
                       if df[col].nunique() > 10]

Level 3: Data Structure

# Check for grouping
print(f"Number of groups: {df['group_id'].nunique()}")
print(f"Avg points per group: {df.groupby('group_id').size().mean():.1f}")

# Check for time-series
df_sorted = df.sort_values(['group_id', 'timestamp'])

Level 4: Physical Meaning

# Validate physical ranges
assert df['speed_mph'].max() < 200, "Speed exceeds physical limit"
assert df['distance_meters'].min() >= 0, "Negative distance impossible"

Critical Processing Decision Tree

# Decision: Within-group or global processing?
def choose_processing_scope(data, feature, goal):
    """
    goal = 'relative' → within-group (e.g., "this point was intense FOR THIS MATCH")
    goal = 'absolute' → global (e.g., "this was an intense point OVERALL")
    """
    if goal == 'relative':
        return 'within_group'
    elif goal == 'absolute':
        return 'global'
    else:
        raise ValueError("Goal must be 'relative' or 'absolute'")

Pattern 1: Within-Group Interpolation (CORRECT)

from scipy.interpolate import CubicSpline
import numpy as np

# ✅ CORRECT: Interpolate within each group
for group_id in df['match_id'].unique():
    mask = df['match_id'] == group_id
    group_data = df.loc[mask, 'speed_mph'].copy()

    # Get valid (non-NaN) indices
    valid_idx = group_data.notna()
    valid_positions = np.where(valid_idx)[0]
    valid_values = group_data[valid_idx].values

    if len(valid_positions) >= 4:
        cs = CubicSpline(valid_positions, valid_values)
        missing_positions = np.where(~valid_idx)[0]
        df.loc[mask & ~valid_idx, 'speed_mph'] = cs(missing_positions)

Pattern 2: Global Interpolation (WRONG - Don't Do This)

# ❌ WRONG: Cross-group interpolation
# This interpolates between match A's last point and match B's first point!
cs = CubicSpline(
    np.where(df['speed_mph'].notna())[0],  # ❌ All indices globally
    df['speed_mph'].dropna().values
)
df.loc[df['speed_mph'].isna(), 'speed_mph'] = cs(
    np.where(df['speed_mph'].isna())[0]
)

Pattern 3: Within-Group Standardization (for Relative Analysis)

from sklearn.preprocessing import StandardScaler

# ✅ CORRECT: Standardize within each match
for match_id in df['match_id'].unique():
    mask = df['match_id'] == match_id
    scaler = StandardScaler()

    df.loc[mask, 'distance_run_std_within'] = scaler.fit_transform(
        df.loc[mask, [['distance_run']]
    )

# Interpretation: z=+2 means "2 std above average FOR THIS MATCH"

Pattern 4: Global Standardization (for Absolute Comparison)

# ✅ CORRECT: Global standardization (when appropriate)
scaler = StandardScaler()
df['distance_run_std_global'] = scaler.fit_transform(df[['distance_run']])

# Interpretation: z=+2 means "2 std above average ACROSS ALL MATCHES"

Pattern 5: Feature Type Processing Rules

# Binary variables (0/1) - KEEP AS-IS
binary_cols = ['is_ace', 'is_winner', 'is_error']
# ❌ NEVER standardize these! They have semantic meaning as 0/1

# Categorical variables - ONE-HOT ENCODE
df_encoded = pd.get_dummies(df, columns=['server', 'serve_number'], dtype=int)

# Continuous variables - STANDARDIZE (within-group or global)
continuous_cols = ['distance_run', 'rally_count', 'speed_mph']
# ✅ Apply pattern 3 or 4 based on goal

Pattern 6: Sliding Window Features (for Momentum)

# ✅ CORRECT: Sliding window for momentum analysis
window = 10

df['win_rate_last10'] = df.groupby('match_id')['point_won'].transform(
    lambda x: x.rolling(window, min_periods=1).mean()
)

# ❌ WRONG: Cumulative features (loses temporal locality)
df['cumulative_points_won'] = df.groupby('match_id')['point_won'].cumsum()
# This just increases monotonically and correlates with point_number

Pattern 7: Data Quality Validation

def validate_data_quality(df, feature, expected_range):
    """Validate before processing"""
    # Check range
    assert df[feature].min() >= expected_range[0], f"{feature} below minimum"
    assert df[feature].max() <= expected_range[1], f"{feature} above maximum"

    # Check for anomalies
    mean = df[feature].mean()
    std = df[feature].std()

    if std > mean:
        print(f"⚠️ WARNING: {feature} has std > mean (highly skewed or errors)")

    # Check missing pattern
    missing_by_group = df.groupby('match_id')[feature].apply(lambda x: x.isna().sum())
    if missing_by_group.max() > len(df) / df['match_id'].nunique() * 0.5:
        print(f"⚠️ WARNING: {feature} has >50% missing in some groups")

# Example
validate_data_quality(df, 'speed_mph', expected_range=(50, 165))

Pattern 8: Detect Processing Scope Automatically

def detect_processing_scope(df, group_col, feature_col):
    """
    Recommend within-group vs global based on variance structure
    """
    # Calculate variance components
    within_group_var = df.groupby(group_col)[feature_col].var().mean()
    global_var = df[feature_col].var()

    # Intraclass correlation
    between_group_var = global_var - within_group_var
    icc = between_group_var / global_var

    if icc > 0.5:
        return 'within_group', f"High between-group variance (ICC={icc:.2f})"
    else:
        return 'global', f"Low between-group variance (ICC={icc:.2f})"

scope, reason = detect_processing_scope(df, 'match_id', 'distance_run')
print(f"Recommended: {scope} - {reason}")

Pattern 9: Data Leakage Detection

def detect_data_leakage(df, target_col, feature_cols, id_cols):
    """
    Critical checks for data leakage and AI common pitfalls
    """
    issues = []

    # 1. ID Leakage: High cardinality variables as features
    for col in feature_cols:
        if col in id_cols:
            issues.append(f"❌ FATAL: {col} is an ID - NEVER use as feature")
            continue

        # Check if looks like ID (>50% unique)
        uniqueness = df[col].nunique() / len(df)
        if uniqueness > 0.5:
            issues.append(f"⚠️ {col}: {uniqueness*100:.1f}% unique - possible ID leakage")

    # 2. Causal Inversion: Perfect correlation with target
    for col in feature_cols:
        if col == target_col:
            continue
        if df[col].dtype in ['int64', 'float64']:
            corr = abs(df[[col, target_col]].corr().iloc[0, 1])
            if corr > 0.95:
                issues.append(f"❌ FATAL: {col} correlation={corr:.3f} - likely consequence of target!")

    # 3. Meaningless Numeric: Codes treated as numbers
    for col in feature_cols:
        if df[col].dtype in ['int64', 'float64']:
            # Pattern: High values, many uniques, looks like code
            if df[col].min() > 1000 and df[col].nunique() > 100:
                issues.append(f"⚠️ {col}: Looks like code (zipcode/ID) - should be categorical")

    # 4. Time Travel: Check if standardization used global statistics
    # (Requires knowing if train/test split was done first)

    # Print report
    if issues:
        print("="*60)
        print("DATA LEAKAGE AUDIT")
        print("="*60)
        for issue in issues:
            print(issue)
        print("="*60)
    else:
        print("✅ No obvious leakage detected")

    return issues

# Example usage
issues = detect_data_leakage(
    df,
    target_col='point_won',
    feature_cols=['speed_mph', 'user_id', 'distance_run'],
    id_cols=['match_id', 'user_id']
)

Pattern 10: Distribution-Aware Scaling

from scipy.stats import skew, kurtosis
from sklearn.preprocessing import StandardScaler, RobustScaler

def smart_scaler_selection(df, col):
    """
    Choose scaler based on distribution characteristics
    """
    data = df[col].dropna()

    # Check distribution
    skewness = skew(data)
    kurt = kurtosis(data)

    print(f"{col}: skewness={skewness:.2f}, kurtosis={kurt:.2f}")

    if abs(skewness) < 0.5 and abs(kurt) < 3:
        # Roughly normal
        print("  → StandardScaler (data is roughly normal)")
        return StandardScaler(), None

    elif skewness > 1:
        # Right-skewed (long tail)
        print("  → Log transform + StandardScaler (right-skewed)")
        return StandardScaler(), 'log'

    else:
        # Heavy outliers or non-normal
        print("  → RobustScaler (heavy outliers)")
        return RobustScaler(), None

# Example usage
for col in continuous_features:
    scaler, transform = smart_scaler_selection(df, col)

    if transform == 'log':
        df[f'{col}_log'] = np.log1p(df[col])
        df[f'{col}_scaled'] = scaler.fit_transform(df[[f'{col}_log']])
    else:
        df[f'{col}_scaled'] = scaler.fit_transform(df[[col]])

Examples

Example 1: Tennis Match Preprocessing (Complete Pipeline)

Input:

CSV with 7,284 rows, 31 matches
Features: speed_mph, distance_run, rally_count, is_ace, server
Goal: Analyze momentum (relative intensity within each match)

Steps:

import pandas as pd
from sklearn.preprocessing import StandardScaler

# 1. Load and inspect
df = pd.read_csv('tennis_data.csv')
print(f"Matches: {df['match_id'].nunique()}")
print(f"Features: {df.dtypes}")

# 2. Classify features
binary_features = ['is_ace', 'is_winner', 'is_break_point']
categorical_features = ['server', 'serve_number']
continuous_features = ['distance_run', 'speed_mph', 'rally_count']

# 3. Validate data quality
for feat in continuous_features:
    print(f"\n{feat}:")
    print(df[feat].describe())
    # Check for impossible values
    if feat == 'speed_mph':
        assert df[feat].max() < 170, "Speed exceeds world record!"

# 4. Handle missing values (within-group)
for match_id in df['match_id'].unique():
    mask = df['match_id'] == match_id
    for feat in continuous_features:
        if df.loc[mask, feat].isna().any():
            # Simple linear interpolation within match
            df.loc[mask, feat] = df.loc[mask, feat].interpolate(method='linear')

# 5. One-hot encode categorical
df = pd.get_dummies(df, columns=categorical_features, dtype=int)

# 6. Standardize continuous features WITHIN each match
for feat in continuous_features:
    df[f'{feat}_std'] = np.nan
    for match_id in df['match_id'].unique():
        mask = df['match_id'] == match_id
        scaler = StandardScaler()
        df.loc[mask, f'{feat}_std'] = scaler.fit_transform(
            df.loc[mask, [[feat]]
        )

# 7. Create sliding window features
window = 10
df['win_rate_last10'] = df.groupby('match_id')['point_won'].transform(
    lambda x: x.rolling(window, min_periods=1).mean()
)

# 8. KEEP binary features as 0/1 (don't transform!)
# binary_features are already correct

print("\n✅ Preprocessing complete!")
print(f"Final shape: {df.shape}")
print(f"Standardized features: {[f for f in df.columns if f.endswith('_std')]}")

Expected output:

Binary features remain 0/1
Categorical features one-hot encoded (e.g., server_1, server_2)
Continuous features have both original and _std versions
_std features have mean≈0, std≈1 WITHIN each match
Sliding window features capture local momentum
No missing values

Example 2: Detecting Cross-Group Contamination

Input:

Preprocessed data where you suspect cross-group standardization

Steps:

# Check if standardization was done correctly
def check_within_group_standardization(df, group_col, feature_std_col):
    """
    Verify that standardized feature has mean≈0, std≈1 within each group
    """
    results = df.groupby(group_col)[feature_std_col].agg(['mean', 'std'])

    # Within-group standardization: each group should have mean≈0, std≈1
    if (results['mean'].abs() < 0.1).all() and (results['std'].between(0.9, 1.1)).all():
        print("✅ CORRECT: Within-group standardization detected")
        return True

    # Global standardization: groups will have varying means and stds
    else:
        print("❌ WRONG: Global standardization detected!")
        print("Group means:", results['mean'].values[:5])
        print("Group stds:", results['std'].values[:5])
        return False

check_within_group_standardization(df, 'match_id', 'distance_run_std')

Expected output:

CORRECT: All group means ≈ 0, all group stds ≈ 1
WRONG: Group means vary widely, indicating global standardization

Example 3: Fixing Cumulative Feature Error

Input:

Existing pipeline using cumulative sums for momentum

Steps:

# ❌ WRONG approach (existing code)
df['cumulative_wins'] = df.groupby('match_id')['point_won'].cumsum()

# Problem: This just counts total wins up to this point
# Doesn't capture recent momentum!

# ✅ CORRECT approach (fix)
# Replace cumulative with sliding window
window = 10
df['recent_win_rate'] = df.groupby('match_id')['point_won'].transform(
    lambda x: x.rolling(window, min_periods=1).mean()
)

# Compare
print("Cumulative (wrong):", df['cumulative_wins'].values[50:60])
print("Sliding window (correct):", df['recent_win_rate'].values[50:60])

# Cumulative: [25, 26, 26, 27, 28, ...] - monotonic
# Sliding window: [0.6, 0.7, 0.5, 0.6, ...] - fluctuates with momentum

Expected output:

Cumulative features removed
Sliding window features show local variations
Momentum analysis now captures short-term trends

References

references/index.md: Navigation and overview
references/error-case-studies.md: Real-world preprocessing disasters from tennis data
references/decision-trees.md: Full decision trees for all preprocessing choices
references/validation-checklist.md: Pre-processing validation checklist
references/ai-common-pitfalls.md: AI-specific errors (data leakage, semantic fallacies, distribution blindness)

Maintenance

⚠️ CRITICAL NOTICE: USER'S PERSONAL EXPERIENCE-BASED SKILL ⚠️

This skill is NOT theoretical - it's based on real project failures:

V1.0 disaster: 88.9% error rate, weeks of wasted work
V2.0 issues: Cross-group contamination, unreliable results
V3.0 success: All errors fixed, production-ready

Why this matters to you (Claude):

These are the EXACT errors AI agents commonly make
User has already paid the price for these mistakes
Ignoring this skill = repeating documented failures
Following this skill = learning from experience without pain

Authority level: HIGHEST

Based on user's hard-won lessons from actual project
Validated through multiple iterations (V1.0 → V2.0 → V3.0)
Every error documented with impact metrics
Every fix validated with comprehensive testing

Sources:

Primary: User's personal project (2024 MCM Problem C - Tennis Momentum Analysis)
Secondary: Statistical best practices for grouped data
Tertiary: Common AI preprocessing errors observed across domains

Mandatory consultation:

⚠️ ALWAYS consult before, during, and after any data preprocessing
⚠️ NEVER skip validation steps outlined in this skill
⚠️ When in doubt, err on the side of caution (use this skill)

Last updated: 2026-01-18 (V1.1)

Known limits:

Assumes data fits in memory (not for big data infrastructure)
Focused on numeric/categorical features (text/image preprocessing partially covered)
Does not prescribe domain-specific feature engineering (user decides business logic)
Requires basic understanding of statistics (mean, std, correlation)

Scientific Data Preprocessing Skill

⚠️ CRITICAL: USER'S HARD-WON EXPERIENCE - MANDATORY CONSULTATION ⚠️

Why this skill is mandatory:

Based on actual project failures (V1.0, V2.0 case studies)
Prevents data leakage that causes production disasters
Catches semantic errors AI agents commonly make
Saves weeks of debugging and model retraining

When to invoke (DO NOT SKIP):

✅ Before starting ANY data preprocessing task
✅ During preprocessing for reflection and validation
✅ After preprocessing for comprehensive audit
✅ When reviewing AI-generated preprocessing code

Core Mission

Prevent catastrophic preprocessing errors in grouped time-series data by applying multi-level feature analysis and respecting data structure boundaries.

When to Use This Skill

MANDATORY consultation - trigger immediately when:

Data Preprocessing Tasks (ALWAYS)

Any data cleaning, transformation, or preparation work
Loading and preparing data for modeling
Creating training/test splits
Handling missing values (imputation, deletion)
Feature scaling/normalization/standardization
Encoding categorical variables
Feature engineering or construction
Feature selection or dimensionality reduction

Data Structure Types (ALWAYS)

Preprocesssing time-series data with natural groupings (matches, sessions, patients, experiments)
Sports analytics (tennis, basketball, etc.)
Medical/clinical data with patient groupings
Panel data or longitudinal studies
Any grouped/hierarchical data structure

Quality Assurance (ALWAYS)

Auditing existing preprocessing for data leakage or semantic errors
Reviewing AI-generated preprocessing code for common pitfalls
Validating preprocessing before model training
Debugging unexpected model performance

Critical Checkpoints (NEVER SKIP)

✅ BEFORE: Planning preprocessing strategy
✅ DURING: Reflecting on decisions and checking for errors
✅ AFTER: Comprehensive validation and audit

Trigger keywords that MUST invoke this skill:

"preprocess", "preprocessing", "data cleaning", "data preparation"
"standardize", "normalize", "scale", "transform"
"impute", "fill missing", "handle NaN"
"encode", "one-hot", "categorical"
"feature engineering", "feature selection", "feature construction"
"train test split", "cross validation split"
"interpolate", "smooth", "aggregate"

Not For / Boundaries

This skill does NOT:

Handle purely cross-sectional data (ungrouped, single timepoint)
Make domain-specific feature engineering decisions (you decide business logic)
Choose ML models (focuses on preprocessing only)
Handle distributed/big data infrastructure (assumes data fits in memory)

Required inputs before proceeding:

Confirmation that data has groups (e.g., match_id, patient_id, session_id)
Understanding of whether goal is within-group (relative) or cross-group (absolute) comparison
Domain constraints on data ranges/units

Quick Reference

Multi-Level Feature Analysis Framework

Level 1: Data Type

# Check data types
df.dtypes  # int64, float64, object, etc.

Level 2: Feature Type Classification

# Binary (0/1)
binary_features = [col for col in df.columns if df[col].nunique() == 2]

# Categorical (finite discrete values)
categorical_features = [col for col in df.select_dtypes(include='object').columns]

# Continuous (infinite possible values)
continuous_features = [col for col in df.select_dtypes(include=['float64', 'int64']).columns
                       if df[col].nunique() > 10]

Level 3: Data Structure

# Check for grouping
print(f"Number of groups: {df['group_id'].nunique()}")
print(f"Avg points per group: {df.groupby('group_id').size().mean():.1f}")

# Check for time-series
df_sorted = df.sort_values(['group_id', 'timestamp'])

Level 4: Physical Meaning

# Validate physical ranges
assert df['speed_mph'].max() < 200, "Speed exceeds physical limit"
assert df['distance_meters'].min() >= 0, "Negative distance impossible"

Critical Processing Decision Tree

# Decision: Within-group or global processing?
def choose_processing_scope(data, feature, goal):
    """
    goal = 'relative' → within-group (e.g., "this point was intense FOR THIS MATCH")
    goal = 'absolute' → global (e.g., "this was an intense point OVERALL")
    """
    if goal == 'relative':
        return 'within_group'
    elif goal == 'absolute':
        return 'global'
    else:
        raise ValueError("Goal must be 'relative' or 'absolute'")

Pattern 1: Within-Group Interpolation (CORRECT)

from scipy.interpolate import CubicSpline
import numpy as np

# ✅ CORRECT: Interpolate within each group
for group_id in df['match_id'].unique():
    mask = df['match_id'] == group_id
    group_data = df.loc[mask, 'speed_mph'].copy()

    # Get valid (non-NaN) indices
    valid_idx = group_data.notna()
    valid_positions = np.where(valid_idx)[0]
    valid_values = group_data[valid_idx].values

    if len(valid_positions) >= 4:
        cs = CubicSpline(valid_positions, valid_values)
        missing_positions = np.where(~valid_idx)[0]
        df.loc[mask & ~valid_idx, 'speed_mph'] = cs(missing_positions)

Pattern 2: Global Interpolation (WRONG - Don't Do This)

# ❌ WRONG: Cross-group interpolation
# This interpolates between match A's last point and match B's first point!
cs = CubicSpline(
    np.where(df['speed_mph'].notna())[0],  # ❌ All indices globally
    df['speed_mph'].dropna().values
)
df.loc[df['speed_mph'].isna(), 'speed_mph'] = cs(
    np.where(df['speed_mph'].isna())[0]
)

Pattern 3: Within-Group Standardization (for Relative Analysis)

from sklearn.preprocessing import StandardScaler

# ✅ CORRECT: Standardize within each match
for match_id in df['match_id'].unique():
    mask = df['match_id'] == match_id
    scaler = StandardScaler()

    df.loc[mask, 'distance_run_std_within'] = scaler.fit_transform(
        df.loc[mask, [['distance_run']]
    )

# Interpretation: z=+2 means "2 std above average FOR THIS MATCH"

Pattern 4: Global Standardization (for Absolute Comparison)

# ✅ CORRECT: Global standardization (when appropriate)
scaler = StandardScaler()
df['distance_run_std_global'] = scaler.fit_transform(df[['distance_run']])

# Interpretation: z=+2 means "2 std above average ACROSS ALL MATCHES"

Pattern 5: Feature Type Processing Rules

# Binary variables (0/1) - KEEP AS-IS
binary_cols = ['is_ace', 'is_winner', 'is_error']
# ❌ NEVER standardize these! They have semantic meaning as 0/1

# Categorical variables - ONE-HOT ENCODE
df_encoded = pd.get_dummies(df, columns=['server', 'serve_number'], dtype=int)

# Continuous variables - STANDARDIZE (within-group or global)
continuous_cols = ['distance_run', 'rally_count', 'speed_mph']
# ✅ Apply pattern 3 or 4 based on goal

Pattern 6: Sliding Window Features (for Momentum)

# ✅ CORRECT: Sliding window for momentum analysis
window = 10

df['win_rate_last10'] = df.groupby('match_id')['point_won'].transform(
    lambda x: x.rolling(window, min_periods=1).mean()
)

# ❌ WRONG: Cumulative features (loses temporal locality)
df['cumulative_points_won'] = df.groupby('match_id')['point_won'].cumsum()
# This just increases monotonically and correlates with point_number

Pattern 7: Data Quality Validation

def validate_data_quality(df, feature, expected_range):
    """Validate before processing"""
    # Check range
    assert df[feature].min() >= expected_range[0], f"{feature} below minimum"
    assert df[feature].max() <= expected_range[1], f"{feature} above maximum"

    # Check for anomalies
    mean = df[feature].mean()
    std = df[feature].std()

    if std > mean:
        print(f"⚠️ WARNING: {feature} has std > mean (highly skewed or errors)")

    # Check missing pattern
    missing_by_group = df.groupby('match_id')[feature].apply(lambda x: x.isna().sum())
    if missing_by_group.max() > len(df) / df['match_id'].nunique() * 0.5:
        print(f"⚠️ WARNING: {feature} has >50% missing in some groups")

# Example
validate_data_quality(df, 'speed_mph', expected_range=(50, 165))

Pattern 8: Detect Processing Scope Automatically

def detect_processing_scope(df, group_col, feature_col):
    """
    Recommend within-group vs global based on variance structure
    """
    # Calculate variance components
    within_group_var = df.groupby(group_col)[feature_col].var().mean()
    global_var = df[feature_col].var()

    # Intraclass correlation
    between_group_var = global_var - within_group_var
    icc = between_group_var / global_var

    if icc > 0.5:
        return 'within_group', f"High between-group variance (ICC={icc:.2f})"
    else:
        return 'global', f"Low between-group variance (ICC={icc:.2f})"

scope, reason = detect_processing_scope(df, 'match_id', 'distance_run')
print(f"Recommended: {scope} - {reason}")

Pattern 9: Data Leakage Detection

def detect_data_leakage(df, target_col, feature_cols, id_cols):
    """
    Critical checks for data leakage and AI common pitfalls
    """
    issues = []

    # 1. ID Leakage: High cardinality variables as features
    for col in feature_cols:
        if col in id_cols:
            issues.append(f"❌ FATAL: {col} is an ID - NEVER use as feature")
            continue

        # Check if looks like ID (>50% unique)
        uniqueness = df[col].nunique() / len(df)
        if uniqueness > 0.5:
            issues.append(f"⚠️ {col}: {uniqueness*100:.1f}% unique - possible ID leakage")

    # 2. Causal Inversion: Perfect correlation with target
    for col in feature_cols:
        if col == target_col:
            continue
        if df[col].dtype in ['int64', 'float64']:
            corr = abs(df[[col, target_col]].corr().iloc[0, 1])
            if corr > 0.95:
                issues.append(f"❌ FATAL: {col} correlation={corr:.3f} - likely consequence of target!")

    # 3. Meaningless Numeric: Codes treated as numbers
    for col in feature_cols:
        if df[col].dtype in ['int64', 'float64']:
            # Pattern: High values, many uniques, looks like code
            if df[col].min() > 1000 and df[col].nunique() > 100:
                issues.append(f"⚠️ {col}: Looks like code (zipcode/ID) - should be categorical")

    # 4. Time Travel: Check if standardization used global statistics
    # (Requires knowing if train/test split was done first)

    # Print report
    if issues:
        print("="*60)
        print("DATA LEAKAGE AUDIT")
        print("="*60)
        for issue in issues:
            print(issue)
        print("="*60)
    else:
        print("✅ No obvious leakage detected")

    return issues

# Example usage
issues = detect_data_leakage(
    df,
    target_col='point_won',
    feature_cols=['speed_mph', 'user_id', 'distance_run'],
    id_cols=['match_id', 'user_id']
)

Pattern 10: Distribution-Aware Scaling

from scipy.stats import skew, kurtosis
from sklearn.preprocessing import StandardScaler, RobustScaler

def smart_scaler_selection(df, col):
    """
    Choose scaler based on distribution characteristics
    """
    data = df[col].dropna()

    # Check distribution
    skewness = skew(data)
    kurt = kurtosis(data)

    print(f"{col}: skewness={skewness:.2f}, kurtosis={kurt:.2f}")

    if abs(skewness) < 0.5 and abs(kurt) < 3:
        # Roughly normal
        print("  → StandardScaler (data is roughly normal)")
        return StandardScaler(), None

    elif skewness > 1:
        # Right-skewed (long tail)
        print("  → Log transform + StandardScaler (right-skewed)")
        return StandardScaler(), 'log'

    else:
        # Heavy outliers or non-normal
        print("  → RobustScaler (heavy outliers)")
        return RobustScaler(), None

# Example usage
for col in continuous_features:
    scaler, transform = smart_scaler_selection(df, col)

    if transform == 'log':
        df[f'{col}_log'] = np.log1p(df[col])
        df[f'{col}_scaled'] = scaler.fit_transform(df[[f'{col}_log']])
    else:
        df[f'{col}_scaled'] = scaler.fit_transform(df[[col]])

Examples

Example 1: Tennis Match Preprocessing (Complete Pipeline)

Input:

CSV with 7,284 rows, 31 matches
Features: speed_mph, distance_run, rally_count, is_ace, server
Goal: Analyze momentum (relative intensity within each match)

Steps:

import pandas as pd
from sklearn.preprocessing import StandardScaler

# 1. Load and inspect
df = pd.read_csv('tennis_data.csv')
print(f"Matches: {df['match_id'].nunique()}")
print(f"Features: {df.dtypes}")

# 2. Classify features
binary_features = ['is_ace', 'is_winner', 'is_break_point']
categorical_features = ['server', 'serve_number']
continuous_features = ['distance_run', 'speed_mph', 'rally_count']

# 3. Validate data quality
for feat in continuous_features:
    print(f"\n{feat}:")
    print(df[feat].describe())
    # Check for impossible values
    if feat == 'speed_mph':
        assert df[feat].max() < 170, "Speed exceeds world record!"

# 4. Handle missing values (within-group)
for match_id in df['match_id'].unique():
    mask = df['match_id'] == match_id
    for feat in continuous_features:
        if df.loc[mask, feat].isna().any():
            # Simple linear interpolation within match
            df.loc[mask, feat] = df.loc[mask, feat].interpolate(method='linear')

# 5. One-hot encode categorical
df = pd.get_dummies(df, columns=categorical_features, dtype=int)

# 6. Standardize continuous features WITHIN each match
for feat in continuous_features:
    df[f'{feat}_std'] = np.nan
    for match_id in df['match_id'].unique():
        mask = df['match_id'] == match_id
        scaler = StandardScaler()
        df.loc[mask, f'{feat}_std'] = scaler.fit_transform(
            df.loc[mask, [[feat]]
        )

# 7. Create sliding window features
window = 10
df['win_rate_last10'] = df.groupby('match_id')['point_won'].transform(
    lambda x: x.rolling(window, min_periods=1).mean()
)

# 8. KEEP binary features as 0/1 (don't transform!)
# binary_features are already correct

print("\n✅ Preprocessing complete!")
print(f"Final shape: {df.shape}")
print(f"Standardized features: {[f for f in df.columns if f.endswith('_std')]}")

Expected output:

Binary features remain 0/1
Categorical features one-hot encoded (e.g., server_1, server_2)
Continuous features have both original and _std versions
_std features have mean≈0, std≈1 WITHIN each match
Sliding window features capture local momentum
No missing values

Example 2: Detecting Cross-Group Contamination

Input:

Preprocessed data where you suspect cross-group standardization

Steps:

# Check if standardization was done correctly
def check_within_group_standardization(df, group_col, feature_std_col):
    """
    Verify that standardized feature has mean≈0, std≈1 within each group
    """
    results = df.groupby(group_col)[feature_std_col].agg(['mean', 'std'])

    # Within-group standardization: each group should have mean≈0, std≈1
    if (results['mean'].abs() < 0.1).all() and (results['std'].between(0.9, 1.1)).all():
        print("✅ CORRECT: Within-group standardization detected")
        return True

    # Global standardization: groups will have varying means and stds
    else:
        print("❌ WRONG: Global standardization detected!")
        print("Group means:", results['mean'].values[:5])
        print("Group stds:", results['std'].values[:5])
        return False

check_within_group_standardization(df, 'match_id', 'distance_run_std')

Expected output:

CORRECT: All group means ≈ 0, all group stds ≈ 1
WRONG: Group means vary widely, indicating global standardization

Example 3: Fixing Cumulative Feature Error

Input:

Existing pipeline using cumulative sums for momentum

Steps:

# ❌ WRONG approach (existing code)
df['cumulative_wins'] = df.groupby('match_id')['point_won'].cumsum()

# Problem: This just counts total wins up to this point
# Doesn't capture recent momentum!

# ✅ CORRECT approach (fix)
# Replace cumulative with sliding window
window = 10
df['recent_win_rate'] = df.groupby('match_id')['point_won'].transform(
    lambda x: x.rolling(window, min_periods=1).mean()
)

# Compare
print("Cumulative (wrong):", df['cumulative_wins'].values[50:60])
print("Sliding window (correct):", df['recent_win_rate'].values[50:60])

# Cumulative: [25, 26, 26, 27, 28, ...] - monotonic
# Sliding window: [0.6, 0.7, 0.5, 0.6, ...] - fluctuates with momentum

Expected output:

Cumulative features removed
Sliding window features show local variations
Momentum analysis now captures short-term trends

References

references/index.md: Navigation and overview
references/error-case-studies.md: Real-world preprocessing disasters from tennis data
references/decision-trees.md: Full decision trees for all preprocessing choices
references/validation-checklist.md: Pre-processing validation checklist
references/ai-common-pitfalls.md: AI-specific errors (data leakage, semantic fallacies, distribution blindness)

Maintenance

⚠️ CRITICAL NOTICE: USER'S PERSONAL EXPERIENCE-BASED SKILL ⚠️

This skill is NOT theoretical - it's based on real project failures:

V1.0 disaster: 88.9% error rate, weeks of wasted work
V2.0 issues: Cross-group contamination, unreliable results
V3.0 success: All errors fixed, production-ready

Why this matters to you (Claude):

These are the EXACT errors AI agents commonly make
User has already paid the price for these mistakes
Ignoring this skill = repeating documented failures
Following this skill = learning from experience without pain

Authority level: HIGHEST

Based on user's hard-won lessons from actual project
Validated through multiple iterations (V1.0 → V2.0 → V3.0)
Every error documented with impact metrics
Every fix validated with comprehensive testing

Sources:

Primary: User's personal project (2024 MCM Problem C - Tennis Momentum Analysis)
Secondary: Statistical best practices for grouped data
Tertiary: Common AI preprocessing errors observed across domains

Mandatory consultation:

⚠️ ALWAYS consult before, during, and after any data preprocessing
⚠️ NEVER skip validation steps outlined in this skill
⚠️ When in doubt, err on the side of caution (use this skill)

Last updated: 2026-01-18 (V1.1)

Known limits:

Assumes data fits in memory (not for big data infrastructure)
Focused on numeric/categorical features (text/image preprocessing partially covered)
Does not prescribe domain-specific feature engineering (user decides business logic)
Requires basic understanding of statistics (mean, std, correlation)

Adoption

foryourhealth111-pixel/scientific-data-preprocessing

$ install --global

Security Scan Results

SKILL.md

Scientific Data Preprocessing Skill

Core Mission

When to Use This Skill

Data Preprocessing Tasks (ALWAYS)

Data Structure Types (ALWAYS)

Quality Assurance (ALWAYS)

Critical Checkpoints (NEVER SKIP)

Not For / Boundaries

Quick Reference

Multi-Level Feature Analysis Framework

Critical Processing Decision Tree

Pattern 1: Within-Group Interpolation (CORRECT)

Pattern 2: Global Interpolation (WRONG - Don't Do This)

Pattern 3: Within-Group Standardization (for Relative Analysis)

Pattern 4: Global Standardization (for Absolute Comparison)

Pattern 5: Feature Type Processing Rules

Pattern 6: Sliding Window Features (for Momentum)

Pattern 7: Data Quality Validation

Pattern 8: Detect Processing Scope Automatically

Pattern 9: Data Leakage Detection

Pattern 10: Distribution-Aware Scaling

Examples

Example 1: Tennis Match Preprocessing (Complete Pipeline)

Example 2: Detecting Cross-Group Contamination

Example 3: Fixing Cumulative Feature Error

References

Maintenance

Related Skills

foryourhealth111-pixel/zarr-python

foryourhealth111-pixel/yeet

foryourhealth111-pixel/xlsx

foryourhealth111-pixel/xan

foryourhealth111-pixel/scientific-data-preprocessing

$ install --global

Security Scan Results

SKILL.md

Scientific Data Preprocessing Skill

Core Mission

When to Use This Skill

Data Preprocessing Tasks (ALWAYS)

Data Structure Types (ALWAYS)

Quality Assurance (ALWAYS)

Critical Checkpoints (NEVER SKIP)

Not For / Boundaries

Quick Reference

Multi-Level Feature Analysis Framework

Critical Processing Decision Tree

Pattern 1: Within-Group Interpolation (CORRECT)

Pattern 2: Global Interpolation (WRONG - Don't Do This)

Pattern 3: Within-Group Standardization (for Relative Analysis)

Pattern 4: Global Standardization (for Absolute Comparison)

Pattern 5: Feature Type Processing Rules

Pattern 6: Sliding Window Features (for Momentum)

Pattern 7: Data Quality Validation

Pattern 8: Detect Processing Scope Automatically

Pattern 9: Data Leakage Detection

Pattern 10: Distribution-Aware Scaling

Examples

Example 1: Tennis Match Preprocessing (Complete Pipeline)

Example 2: Detecting Cross-Group Contamination

Example 3: Fixing Cumulative Feature Error

References

Maintenance

Related Skills

foryourhealth111-pixel/zarr-python

foryourhealth111-pixel/yeet

foryourhealth111-pixel/xlsx

foryourhealth111-pixel/xan