skills/biostatistics/nan-safe-correlation/SKILL.md
Per-feature NaN-safe Spearman/Pearson correlation across many features (genes, proteins, variants) with missing values. Covers why bulk matrix shortcuts fail, correct pairwise deletion, degenerate input filtering, and large-dataset performance. Use statistical-analysis for test choice; shap-model-explainability for interpretability.
npx skillsauth add jaechang-hits/scicraft nan-safe-correlationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Computing correlations across many features (genes, proteins, variants) when missing values are present is error-prone. The most common mistake is using bulk matrix shortcuts that silently mishandle NaN, producing incorrect correlation values. This guide covers correct per-feature pairwise computation, degenerate input filtering, and performance optimization.
Different features have different missing value patterns across samples. Bulk methods handle this inconsistently:
| Method | Problem |
|--------|---------|
| DataFrame.rank() then corrwith() | rank() assigns NaN ranks; corrwith() may drop globally or per-column inconsistently |
| DataFrame.corrwith(method='spearman') | Implementation varies by pandas version; may use listwise deletion |
| np.corrcoef on ranked data | Propagates NaN to entire result if any value is missing |
Features that produce undefined or unstable correlations:
| Type | Description | Effect | |------|-------------|--------| | Constant features | All values identical (variance = 0) | Correlation undefined (division by zero) | | Near-constant features | Very low variance | Correlation numerically unstable | | Too few valid values | After NaN removal, fewer than min_valid pairs | Statistically unreliable | | Single-value after filtering | Only one unique value remains post-NaN removal | Correlation undefined |
Do you have missing values (NaN) in your feature matrix?
├── No NaN at all → Bulk methods are safe (corrwith, np.corrcoef)
└── Yes, NaN present
├── Same NaN pattern across all features? → Listwise deletion is acceptable
└── Different NaN patterns per feature (typical)
├── < 10,000 features → Per-feature loop with scipy.stats.spearmanr
└── > 10,000 features → Parallelized per-feature loop (joblib)
| Scenario | Recommended Approach | Rationale |
|----------|---------------------|-----------|
| No missing data | DataFrame.corrwith() | Fast, correct when no NaN |
| Sparse NaN, < 10K features | Per-feature spearmanr loop | Correct pairwise deletion, acceptable speed |
| Sparse NaN, > 10K features | Parallelized per-feature loop | Same correctness, scales with cores |
| Dense NaN (> 50% missing) | Per-feature loop + strict min_valid | Many features will be skipped; report skip count |
| Uniform NaN pattern | Listwise deletion + bulk method | If all features share same NaN rows, pairwise = listwise |
Always print NaN summary before analysis: Report total NaN count, features with any NaN, and per-feature NaN distribution. This documents data quality and alerts you to severe missingness patterns.
Use scipy.stats.spearmanr per feature in a loop: This is the only method that guarantees correct pairwise NaN removal for each feature independently.
Set a minimum valid pair threshold (min_valid): Default to 10. Features with fewer valid pairs after NaN removal produce unreliable correlations and should be skipped with NaN.
Filter degenerate inputs before computing correlations: Remove constant features, near-constant features, and features with excessive NaN before the correlation loop. This avoids undefined results and speeds up computation.
Track n_valid per feature in the output: The number of valid pairs varies per feature. Report it alongside rho and p-value so downstream analysis can assess reliability.
Report how many features were skipped or filtered: Silent feature loss is a common source of confusion. Always print the count of filtered degenerate features and skipped low-data features.
Use parallelization for large datasets: For > 10,000 features, use joblib to distribute the per-feature loop across cores. The per-feature computation is embarrassingly parallel.
Using bulk rank-then-correlate with NaN present: df.rank() followed by corrwith() silently mishandles NaN, producing incorrect correlations.
scipy.stats.spearmanr per feature when NaN is present.Assuming uniform sample count across features: Different features have different NaN patterns, so each correlation is computed on a different number of samples.
n_valid for every feature.Not filtering degenerate inputs: Constant or near-constant features produce undefined correlations or divide-by-zero warnings that can silently corrupt results.
filter_degenerate() before the correlation loop.Using listwise deletion when NaN patterns differ: Listwise deletion removes any row with NaN in any feature, potentially discarding most of your data.
Ignoring the NaN summary step: Skipping the data quality report means you cannot verify whether the NaN pattern is severe enough to affect results.
Setting min_valid too low: With fewer than ~10 valid pairs, Spearman correlation is unreliable and p-values are meaningless.
Step 1: Print NaN Summary
Step 2: Filter Degenerate Features
Step 3: Compute Per-Feature Correlations
scipy.stats.spearmanrStep 4: Assemble and Report Results
from scipy.stats import spearmanr
import numpy as np
import pandas as pd
def nan_summary(df):
"""Print NaN summary before correlation analysis."""
print(f"Dataset shape: {df.shape}")
print(f"Total NaN: {df.isna().sum().sum()}")
print(f"Features with any NaN: {(df.isna().any()).sum()}")
print(f"NaN per feature (mean): {df.isna().sum().mean():.1f}")
print(f"NaN per feature (max): {df.isna().sum().max()}")
def filter_degenerate(df, min_unique=3, min_nonnan_frac=0.5):
"""Remove degenerate features before correlation analysis.
Args:
df: DataFrame (samples x features)
min_unique: Minimum number of unique non-NaN values required
min_nonnan_frac: Minimum fraction of non-NaN values required
Returns:
Filtered DataFrame, count of removed features
"""
n_samples = len(df)
keep = []
for col in df.columns:
values = df[col].dropna()
if len(values) < n_samples * min_nonnan_frac:
continue
if values.nunique() < min_unique:
continue
keep.append(col)
removed = len(df.columns) - len(keep)
print(f"Filtered {removed} degenerate features out of {len(df.columns)}")
return df[keep], removed
def pairwise_spearman(df_x, df_y, min_valid=10):
"""Compute per-feature Spearman correlation with pairwise NaN removal.
Args:
df_x: DataFrame (samples x features), aligned with df_y
df_y: DataFrame (samples x features), same shape as df_x
min_valid: Minimum number of valid (non-NaN) pairs required
Returns:
DataFrame with columns: rho, pvalue, n_valid
"""
nan_summary(df_x)
nan_summary(df_y)
results = []
for feature in df_x.columns:
x = df_x[feature].values
y = df_y[feature].values
mask = ~(np.isnan(x) | np.isnan(y))
n_valid = mask.sum()
if n_valid < min_valid:
results.append({'feature': feature, 'rho': np.nan,
'pvalue': np.nan, 'n_valid': n_valid})
continue
rho, pval = spearmanr(x[mask], y[mask])
results.append({'feature': feature, 'rho': rho,
'pvalue': pval, 'n_valid': n_valid})
result_df = pd.DataFrame(results).set_index('feature')
skipped = result_df['rho'].isna().sum()
if skipped > 0:
print(f"Skipped {skipped} features with < {min_valid} valid pairs")
return result_df
# WRONG: Bulk rank-then-correlate
ranked_x = df_x.rank()
ranked_y = df_y.rank()
corrs = ranked_x.corrwith(ranked_y)
# WRONG: Bulk corrwith with method parameter
corrs = df_x.corrwith(df_y, method='spearman')
# WRONG: numpy corrcoef on ranked arrays (propagates NaN)
corrs = np.corrcoef(df_x.rank().values.T, df_y.rank().values.T)
from joblib import Parallel, delayed
def parallel_spearman(df_x, df_y, min_valid=10, n_jobs=4):
"""Parallelized per-feature Spearman correlation."""
def compute_one(feature):
x = df_x[feature].values
y = df_y[feature].values
mask = ~(np.isnan(x) | np.isnan(y))
n = mask.sum()
if n < min_valid:
return feature, np.nan, np.nan, n
rho, pval = spearmanr(x[mask], y[mask])
return feature, rho, pval, n
results = Parallel(n_jobs=n_jobs)(
delayed(compute_one)(f) for f in df_x.columns
)
return pd.DataFrame(
results, columns=['feature', 'rho', 'pvalue', 'n_valid']
).set_index('feature')
statistical-analysis -- General statistical test selection and assumption checkingdegenerate-input-filtering -- Broader guide on filtering uninformative data before any statistical testscikit-learn-machine-learning -- Feature selection and preprocessing pipelinestools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.