NaN-Safe Correlation Computation

Overview

Computing correlations across many features (genes, proteins, variants) when missing values are present is error-prone. The most common mistake is using bulk matrix shortcuts that silently mishandle NaN, producing incorrect correlation values. This guide covers correct per-feature pairwise computation, degenerate input filtering, and performance optimization.

Key Concepts

Pairwise vs Listwise Deletion

Pairwise deletion: For each feature pair, remove only samples where either value is NaN. Each feature uses the maximum available data.
Listwise deletion: Remove any sample with NaN in any feature. Wastes valid data and biases results if missingness is not completely random.
Rule: Always use pairwise deletion for per-feature correlations.

Why Bulk Matrix Shortcuts Fail

Different features have different missing value patterns across samples. Bulk methods handle this inconsistently:

| Method | Problem | |--------|---------| | DataFrame.rank() then corrwith() | rank() assigns NaN ranks; corrwith() may drop globally or per-column inconsistently | | DataFrame.corrwith(method='spearman') | Implementation varies by pandas version; may use listwise deletion | | np.corrcoef on ranked data | Propagates NaN to entire result if any value is missing |

Impact of Incorrect Computation

Correlations can shift by 0.01-0.05 or more
Features near a threshold (e.g., 0.6) can be misclassified
Valid sample count per feature is unknown (may silently use fewer samples than expected)

Degenerate Inputs

Features that produce undefined or unstable correlations:

| Type | Description | Effect | |------|-------------|--------| | Constant features | All values identical (variance = 0) | Correlation undefined (division by zero) | | Near-constant features | Very low variance | Correlation numerically unstable | | Too few valid values | After NaN removal, fewer than min_valid pairs | Statistically unreliable | | Single-value after filtering | Only one unique value remains post-NaN removal | Correlation undefined |

Decision Framework

Do you have missing values (NaN) in your feature matrix?
├── No NaN at all → Bulk methods are safe (corrwith, np.corrcoef)
└── Yes, NaN present
    ├── Same NaN pattern across all features? → Listwise deletion is acceptable
    └── Different NaN patterns per feature (typical)
        ├── < 10,000 features → Per-feature loop with scipy.stats.spearmanr
        └── > 10,000 features → Parallelized per-feature loop (joblib)

| Scenario | Recommended Approach | Rationale | |----------|---------------------|-----------| | No missing data | DataFrame.corrwith() | Fast, correct when no NaN | | Sparse NaN, < 10K features | Per-feature spearmanr loop | Correct pairwise deletion, acceptable speed | | Sparse NaN, > 10K features | Parallelized per-feature loop | Same correctness, scales with cores | | Dense NaN (> 50% missing) | Per-feature loop + strict min_valid | Many features will be skipped; report skip count | | Uniform NaN pattern | Listwise deletion + bulk method | If all features share same NaN rows, pairwise = listwise |

Best Practices

Always print NaN summary before analysis: Report total NaN count, features with any NaN, and per-feature NaN distribution. This documents data quality and alerts you to severe missingness patterns.
Use scipy.stats.spearmanr per feature in a loop: This is the only method that guarantees correct pairwise NaN removal for each feature independently.
Set a minimum valid pair threshold (min_valid): Default to 10. Features with fewer valid pairs after NaN removal produce unreliable correlations and should be skipped with NaN.
Filter degenerate inputs before computing correlations: Remove constant features, near-constant features, and features with excessive NaN before the correlation loop. This avoids undefined results and speeds up computation.
Track n_valid per feature in the output: The number of valid pairs varies per feature. Report it alongside rho and p-value so downstream analysis can assess reliability.
Report how many features were skipped or filtered: Silent feature loss is a common source of confusion. Always print the count of filtered degenerate features and skipped low-data features.
Use parallelization for large datasets: For > 10,000 features, use joblib to distribute the per-feature loop across cores. The per-feature computation is embarrassingly parallel.

Common Pitfalls

Using bulk rank-then-correlate with NaN present: df.rank() followed by corrwith() silently mishandles NaN, producing incorrect correlations.
- How to avoid: Always use scipy.stats.spearmanr per feature when NaN is present.
Assuming uniform sample count across features: Different features have different NaN patterns, so each correlation is computed on a different number of samples.
- How to avoid: Track and report n_valid for every feature.
Not filtering degenerate inputs: Constant or near-constant features produce undefined correlations or divide-by-zero warnings that can silently corrupt results.
- How to avoid: Run filter_degenerate() before the correlation loop.
Using listwise deletion when NaN patterns differ: Listwise deletion removes any row with NaN in any feature, potentially discarding most of your data.
- How to avoid: Use pairwise deletion (per-feature NaN removal).
Ignoring the NaN summary step: Skipping the data quality report means you cannot verify whether the NaN pattern is severe enough to affect results.
- How to avoid: Always print NaN summary before correlation computation.
Setting min_valid too low: With fewer than ~10 valid pairs, Spearman correlation is unreliable and p-values are meaningless.
- How to avoid: Use min_valid >= 10; increase for high-dimensional studies.

Workflow

Step 1: Print NaN Summary
- Report dataset shape, total NaN, features with any NaN, mean/max NaN per feature
- Decision point: If > 50% NaN overall, reconsider data quality before proceeding
Step 2: Filter Degenerate Features
- Remove constant features (nunique < 3)
- Remove features with excessive NaN (< 50% non-NaN)
- Report count of removed features
Step 3: Compute Per-Feature Correlations
- Loop over features with scipy.stats.spearmanr
- Apply pairwise NaN removal per feature
- Skip features with < min_valid valid pairs (record as NaN)
Step 4: Assemble and Report Results
- Create DataFrame with rho, p-value, n_valid per feature
- Report count of skipped features
- Verify no silent data loss (input features = output features + skipped + filtered)

Reference Implementation

from scipy.stats import spearmanr
import numpy as np
import pandas as pd

def nan_summary(df):
    """Print NaN summary before correlation analysis."""
    print(f"Dataset shape: {df.shape}")
    print(f"Total NaN: {df.isna().sum().sum()}")
    print(f"Features with any NaN: {(df.isna().any()).sum()}")
    print(f"NaN per feature (mean): {df.isna().sum().mean():.1f}")
    print(f"NaN per feature (max): {df.isna().sum().max()}")

def filter_degenerate(df, min_unique=3, min_nonnan_frac=0.5):
    """Remove degenerate features before correlation analysis.

    Args:
        df: DataFrame (samples x features)
        min_unique: Minimum number of unique non-NaN values required
        min_nonnan_frac: Minimum fraction of non-NaN values required

    Returns:
        Filtered DataFrame, count of removed features
    """
    n_samples = len(df)
    keep = []
    for col in df.columns:
        values = df[col].dropna()
        if len(values) < n_samples * min_nonnan_frac:
            continue
        if values.nunique() < min_unique:
            continue
        keep.append(col)
    removed = len(df.columns) - len(keep)
    print(f"Filtered {removed} degenerate features out of {len(df.columns)}")
    return df[keep], removed

def pairwise_spearman(df_x, df_y, min_valid=10):
    """Compute per-feature Spearman correlation with pairwise NaN removal.

    Args:
        df_x: DataFrame (samples x features), aligned with df_y
        df_y: DataFrame (samples x features), same shape as df_x
        min_valid: Minimum number of valid (non-NaN) pairs required

    Returns:
        DataFrame with columns: rho, pvalue, n_valid
    """
    nan_summary(df_x)
    nan_summary(df_y)

    results = []
    for feature in df_x.columns:
        x = df_x[feature].values
        y = df_y[feature].values
        mask = ~(np.isnan(x) | np.isnan(y))
        n_valid = mask.sum()
        if n_valid < min_valid:
            results.append({'feature': feature, 'rho': np.nan,
                          'pvalue': np.nan, 'n_valid': n_valid})
            continue
        rho, pval = spearmanr(x[mask], y[mask])
        results.append({'feature': feature, 'rho': rho,
                       'pvalue': pval, 'n_valid': n_valid})

    result_df = pd.DataFrame(results).set_index('feature')
    skipped = result_df['rho'].isna().sum()
    if skipped > 0:
        print(f"Skipped {skipped} features with < {min_valid} valid pairs")
    return result_df

Anti-Patterns

# WRONG: Bulk rank-then-correlate
ranked_x = df_x.rank()
ranked_y = df_y.rank()
corrs = ranked_x.corrwith(ranked_y)

# WRONG: Bulk corrwith with method parameter
corrs = df_x.corrwith(df_y, method='spearman')

# WRONG: numpy corrcoef on ranked arrays (propagates NaN)
corrs = np.corrcoef(df_x.rank().values.T, df_y.rank().values.T)

Performance Optimization (> 10,000 features)

from joblib import Parallel, delayed

def parallel_spearman(df_x, df_y, min_valid=10, n_jobs=4):
    """Parallelized per-feature Spearman correlation."""
    def compute_one(feature):
        x = df_x[feature].values
        y = df_y[feature].values
        mask = ~(np.isnan(x) | np.isnan(y))
        n = mask.sum()
        if n < min_valid:
            return feature, np.nan, np.nan, n
        rho, pval = spearmanr(x[mask], y[mask])
        return feature, rho, pval, n

    results = Parallel(n_jobs=n_jobs)(
        delayed(compute_one)(f) for f in df_x.columns
    )
    return pd.DataFrame(
        results, columns=['feature', 'rho', 'pvalue', 'n_valid']
    ).set_index('feature')

Related Skills

statistical-analysis -- General statistical test selection and assumption checking
degenerate-input-filtering -- Broader guide on filtering uninformative data before any statistical test
scikit-learn-machine-learning -- Feature selection and preprocessing pipelines

NaN-Safe Correlation Computation

Overview

Key Concepts

Pairwise vs Listwise Deletion

Pairwise deletion: For each feature pair, remove only samples where either value is NaN. Each feature uses the maximum available data.
Listwise deletion: Remove any sample with NaN in any feature. Wastes valid data and biases results if missingness is not completely random.
Rule: Always use pairwise deletion for per-feature correlations.

Why Bulk Matrix Shortcuts Fail

Different features have different missing value patterns across samples. Bulk methods handle this inconsistently:

Impact of Incorrect Computation

Correlations can shift by 0.01-0.05 or more
Features near a threshold (e.g., 0.6) can be misclassified
Valid sample count per feature is unknown (may silently use fewer samples than expected)

Degenerate Inputs

Features that produce undefined or unstable correlations:

Decision Framework

Do you have missing values (NaN) in your feature matrix?
├── No NaN at all → Bulk methods are safe (corrwith, np.corrcoef)
└── Yes, NaN present
    ├── Same NaN pattern across all features? → Listwise deletion is acceptable
    └── Different NaN patterns per feature (typical)
        ├── < 10,000 features → Per-feature loop with scipy.stats.spearmanr
        └── > 10,000 features → Parallelized per-feature loop (joblib)

Best Practices

Always print NaN summary before analysis: Report total NaN count, features with any NaN, and per-feature NaN distribution. This documents data quality and alerts you to severe missingness patterns.
Use scipy.stats.spearmanr per feature in a loop: This is the only method that guarantees correct pairwise NaN removal for each feature independently.
Set a minimum valid pair threshold (min_valid): Default to 10. Features with fewer valid pairs after NaN removal produce unreliable correlations and should be skipped with NaN.
Filter degenerate inputs before computing correlations: Remove constant features, near-constant features, and features with excessive NaN before the correlation loop. This avoids undefined results and speeds up computation.
Track n_valid per feature in the output: The number of valid pairs varies per feature. Report it alongside rho and p-value so downstream analysis can assess reliability.
Report how many features were skipped or filtered: Silent feature loss is a common source of confusion. Always print the count of filtered degenerate features and skipped low-data features.
Use parallelization for large datasets: For > 10,000 features, use joblib to distribute the per-feature loop across cores. The per-feature computation is embarrassingly parallel.

Common Pitfalls

Using bulk rank-then-correlate with NaN present: df.rank() followed by corrwith() silently mishandles NaN, producing incorrect correlations.
- How to avoid: Always use scipy.stats.spearmanr per feature when NaN is present.
Assuming uniform sample count across features: Different features have different NaN patterns, so each correlation is computed on a different number of samples.
- How to avoid: Track and report n_valid for every feature.
Not filtering degenerate inputs: Constant or near-constant features produce undefined correlations or divide-by-zero warnings that can silently corrupt results.
- How to avoid: Run filter_degenerate() before the correlation loop.
Using listwise deletion when NaN patterns differ: Listwise deletion removes any row with NaN in any feature, potentially discarding most of your data.
- How to avoid: Use pairwise deletion (per-feature NaN removal).
Ignoring the NaN summary step: Skipping the data quality report means you cannot verify whether the NaN pattern is severe enough to affect results.
- How to avoid: Always print NaN summary before correlation computation.
Setting min_valid too low: With fewer than ~10 valid pairs, Spearman correlation is unreliable and p-values are meaningless.
- How to avoid: Use min_valid >= 10; increase for high-dimensional studies.

Workflow

Step 1: Print NaN Summary
- Report dataset shape, total NaN, features with any NaN, mean/max NaN per feature
- Decision point: If > 50% NaN overall, reconsider data quality before proceeding
Step 2: Filter Degenerate Features
- Remove constant features (nunique < 3)
- Remove features with excessive NaN (< 50% non-NaN)
- Report count of removed features
Step 3: Compute Per-Feature Correlations
- Loop over features with scipy.stats.spearmanr
- Apply pairwise NaN removal per feature
- Skip features with < min_valid valid pairs (record as NaN)
Step 4: Assemble and Report Results
- Create DataFrame with rho, p-value, n_valid per feature
- Report count of skipped features
- Verify no silent data loss (input features = output features + skipped + filtered)

Reference Implementation

from scipy.stats import spearmanr
import numpy as np
import pandas as pd

def nan_summary(df):
    """Print NaN summary before correlation analysis."""
    print(f"Dataset shape: {df.shape}")
    print(f"Total NaN: {df.isna().sum().sum()}")
    print(f"Features with any NaN: {(df.isna().any()).sum()}")
    print(f"NaN per feature (mean): {df.isna().sum().mean():.1f}")
    print(f"NaN per feature (max): {df.isna().sum().max()}")

def filter_degenerate(df, min_unique=3, min_nonnan_frac=0.5):
    """Remove degenerate features before correlation analysis.

    Args:
        df: DataFrame (samples x features)
        min_unique: Minimum number of unique non-NaN values required
        min_nonnan_frac: Minimum fraction of non-NaN values required

    Returns:
        Filtered DataFrame, count of removed features
    """
    n_samples = len(df)
    keep = []
    for col in df.columns:
        values = df[col].dropna()
        if len(values) < n_samples * min_nonnan_frac:
            continue
        if values.nunique() < min_unique:
            continue
        keep.append(col)
    removed = len(df.columns) - len(keep)
    print(f"Filtered {removed} degenerate features out of {len(df.columns)}")
    return df[keep], removed

def pairwise_spearman(df_x, df_y, min_valid=10):
    """Compute per-feature Spearman correlation with pairwise NaN removal.

    Args:
        df_x: DataFrame (samples x features), aligned with df_y
        df_y: DataFrame (samples x features), same shape as df_x
        min_valid: Minimum number of valid (non-NaN) pairs required

    Returns:
        DataFrame with columns: rho, pvalue, n_valid
    """
    nan_summary(df_x)
    nan_summary(df_y)

    results = []
    for feature in df_x.columns:
        x = df_x[feature].values
        y = df_y[feature].values
        mask = ~(np.isnan(x) | np.isnan(y))
        n_valid = mask.sum()
        if n_valid < min_valid:
            results.append({'feature': feature, 'rho': np.nan,
                          'pvalue': np.nan, 'n_valid': n_valid})
            continue
        rho, pval = spearmanr(x[mask], y[mask])
        results.append({'feature': feature, 'rho': rho,
                       'pvalue': pval, 'n_valid': n_valid})

    result_df = pd.DataFrame(results).set_index('feature')
    skipped = result_df['rho'].isna().sum()
    if skipped > 0:
        print(f"Skipped {skipped} features with < {min_valid} valid pairs")
    return result_df

Anti-Patterns

# WRONG: Bulk rank-then-correlate
ranked_x = df_x.rank()
ranked_y = df_y.rank()
corrs = ranked_x.corrwith(ranked_y)

# WRONG: Bulk corrwith with method parameter
corrs = df_x.corrwith(df_y, method='spearman')

# WRONG: numpy corrcoef on ranked arrays (propagates NaN)
corrs = np.corrcoef(df_x.rank().values.T, df_y.rank().values.T)

Performance Optimization (> 10,000 features)

from joblib import Parallel, delayed

def parallel_spearman(df_x, df_y, min_valid=10, n_jobs=4):
    """Parallelized per-feature Spearman correlation."""
    def compute_one(feature):
        x = df_x[feature].values
        y = df_y[feature].values
        mask = ~(np.isnan(x) | np.isnan(y))
        n = mask.sum()
        if n < min_valid:
            return feature, np.nan, np.nan, n
        rho, pval = spearmanr(x[mask], y[mask])
        return feature, rho, pval, n

    results = Parallel(n_jobs=n_jobs)(
        delayed(compute_one)(f) for f in df_x.columns
    )
    return pd.DataFrame(
        results, columns=['feature', 'rho', 'pvalue', 'n_valid']
    ).set_index('feature')

Related Skills

statistical-analysis -- General statistical test selection and assumption checking
degenerate-input-filtering -- Broader guide on filtering uninformative data before any statistical test
scikit-learn-machine-learning -- Feature selection and preprocessing pipelines

Adoption

jaechang-hits/nan-safe-correlation

$ install --global

Security Scan Results

SKILL.md

NaN-Safe Correlation Computation

Overview

Key Concepts

Pairwise vs Listwise Deletion

Why Bulk Matrix Shortcuts Fail

Impact of Incorrect Computation

Degenerate Inputs

Decision Framework

Best Practices

Common Pitfalls

Workflow

Reference Implementation

Anti-Patterns

Performance Optimization (> 10,000 features)

Further Reading

Related Skills

Related Skills

jaechang-hits/deseq2-differential-expression

jaechang-hits/vcf-variant-filtering

jaechang-hits/snpeff-variant-annotation

jaechang-hits/plink2-gwas-analysis

jaechang-hits/nan-safe-correlation

$ install --global

Security Scan Results

SKILL.md

NaN-Safe Correlation Computation

Overview

Key Concepts

Pairwise vs Listwise Deletion

Why Bulk Matrix Shortcuts Fail

Impact of Incorrect Computation

Degenerate Inputs

Decision Framework

Best Practices

Common Pitfalls

Workflow

Reference Implementation

Anti-Patterns

Performance Optimization (> 10,000 features)

Further Reading

Related Skills

Related Skills

jaechang-hits/deseq2-differential-expression

jaechang-hits/vcf-variant-filtering

jaechang-hits/snpeff-variant-annotation

jaechang-hits/plink2-gwas-analysis