proteomics/differential-abundance/SKILL.md
Statistical testing for differentially abundant proteins between conditions. Covers preprocessing (log2 transformation, normalization), limma and DEqMS workflows with empirical Bayes moderation, fold change shrinkage for accurate effect size estimation, and Python alternatives. Use when identifying proteins with significant abundance changes between experimental groups.
npx skillsauth add GPTomics/bioSkills bio-proteomics-differential-abundanceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: limma 3.58+, DEqMS 1.20+, ashr 2.2+, proDA 1.20+, numpy 1.26+, pandas 2.2+, scipy 1.12+, statsmodels 0.14+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturespackageVersion('<pkg>') then ?function_name to verify parametersIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Find differentially abundant proteins between my conditions" -> Perform statistical testing on protein intensities to identify significant abundance changes.
limma::eBayes() for empirical Bayes moderated t-tests (preferred for small n)DEqMS::spectraCounteBayes() when PSM/peptide count metadata is availableproDA::test_diff() when missing values are extensive (label-free)scipy.stats.ttest_ind(equal_var=False) with statsmodels BH correctionRaw mass spectrometry intensities require log2 transformation and normalization before statistical testing.
Goal: Convert right-skewed raw intensities to approximately normal distributions with stabilized variance.
Approach: Apply log2 to all intensity values. Replace zeros (undetected values) with NaN before transformation to avoid -inf.
log2_data = np.log2(intensities.replace(0, np.nan))
log2_matrix <- log2(intensity_matrix)
log2_matrix[!is.finite(log2_matrix)] <- NA
Goal: Remove systematic technical biases (sample loading, instrument drift) so that observed differences reflect biology.
Approach: Choose a normalization method based on data characteristics. All methods assume the majority of proteins are not differentially abundant.
Median normalization -- subtract per-sample median so all samples share a common center:
sample_medians = log2_data.median(axis=0)
global_median = sample_medians.median()
normalized = log2_data - sample_medians + global_median
normalized <- normalizeBetweenArrays(log2_matrix, method = 'scale')
| Method | When to use | R function |
|--------|-------------|------------|
| Median centering | Default for most analyses; robust to missing values | Manual or normalizeBetweenArrays(method='scale') |
| Cyclic loess | Unbalanced DE (more up- than down-regulated) | normalizeBetweenArrays(method='cyclicloess') |
| VSN | Heteroscedastic data; operates on raw intensities (skip log2) | vsn::justvsn(raw_matrix) |
| Quantile | TMT with complete data; avoid with many missing values | normalizeBetweenArrays(method='quantile') |
| Scenario | Recommended | Rationale | |----------|-------------|-----------| | Small n (3-5 per group), protein-level data | limma | Borrows variance across proteins via empirical Bayes; adds ~10-20 effective df | | PSM/peptide count metadata available | DEqMS | Weights variance by quantification depth per protein | | Label-free with many missing values (>20%) | proDA | Models abundance-dependent dropout; no imputation needed | | Large n (>10 per group), Python-only environment | Welch's t-test + BH | Variance estimates reliable at larger sample sizes | | Complex designs (nested, multiple comparisons) | MSstats | Feature-level mixed models; handles technical replicates |
With small sample sizes (n=3-5), simple t-tests have only 4-6 degrees of freedom for variance estimation, making per-protein variances extremely noisy. Some proteins get artificially low variance (false positives), others artificially high (false negatives). limma's empirical Bayes shrinks each variance toward the global trend, dramatically improving calibration.
Goal: Identify differentially abundant proteins using moderated statistics that borrow information across all proteins.
Approach: Build a linear model from the design matrix, fit contrasts, apply empirical Bayes moderation with intensity-dependent variance trend and robust fitting, then extract BH-corrected results.
library(limma)
design <- model.matrix(~0 + condition, data = sample_info)
colnames(design) <- levels(factor(sample_info$condition))
fit <- lmFit(protein_matrix, design)
contrast_matrix <- makeContrasts(Treatment - Control, levels = design)
fit2 <- contrasts.fit(fit, contrast_matrix)
fit2 <- eBayes(fit2, trend = TRUE, robust = TRUE)
results <- topTable(fit2, coef = 1, number = Inf, adjust.method = 'BH')
trend=TRUE allows the prior variance to depend on mean intensity. robust=TRUE protects against hyper-variable outlier proteins. Results columns: logFC, AveExpr, t, P.Value, adj.P.Val, B.
For batch effects, include batch in the design matrix (do NOT use removeBatchEffect before testing -- that function is for visualization only):
design <- model.matrix(~0 + condition + batch, data = sample_info)
Goal: Improve upon limma by accounting for the relationship between quantification depth and variance -- proteins quantified by more PSMs/peptides have more precise abundance estimates.
Approach: Run the standard limma pipeline, attach PSM/peptide counts, then apply DEqMS's count-aware empirical Bayes that fits a variance-vs-count regression.
library(DEqMS)
# Standard limma pipeline through eBayes (see above), then:
fit2$count <- psm_count_per_protein[rownames(fit2$coefficients)]
fit3 <- spectraCounteBayes(fit2)
results <- outputResult(fit3, coef_col = 1)
Results include limma columns plus DEqMS-specific: sca.t, sca.P.Value, sca.adj.pval.
library(proDA)
fit <- proDA(protein_matrix, design = ~condition, col_data = sample_info,
reference_level = 'Control')
results <- test_diff(fit, conditionTreatment - conditionControl)
Results columns: name, pval, adj_pval, diff (log2FC), t_statistic, se.
Goal: Perform the full differential abundance pipeline in Python: preprocessing, statistical testing, and multiple testing correction.
Approach: Log2-transform and median-normalize raw intensities, run per-protein Welch's t-tests, and apply Benjamini-Hochberg correction.
import numpy as np
import pandas as pd
from scipy import stats
from statsmodels.stats.multitest import multipletests
def preprocess(intensities):
log2_data = np.log2(intensities.replace(0, np.nan))
sample_medians = log2_data.median(axis=0)
global_median = sample_medians.median()
return log2_data - sample_medians + global_median
def differential_abundance(normalized, case_cols, ctrl_cols):
results = []
for protein in normalized.index:
case = normalized.loc[protein, case_cols].dropna()
ctrl = normalized.loc[protein, ctrl_cols].dropna()
if len(case) >= 2 and len(ctrl) >= 2:
log2fc = case.mean() - ctrl.mean()
_, pval = stats.ttest_ind(case, ctrl, equal_var=False)
results.append({'protein': protein, 'log2fc': log2fc, 'pvalue': pval})
df = pd.DataFrame(results)
df['padj'] = multipletests(df['pvalue'], method='fdr_bh')[1]
return df
Key details:
equal_var=False selects Welch's t-test (scipy defaults to Student's with equal_var=True)multipletests defaults to Holm-Sidak -- always pass method='fdr_bh' explicitlyRaw fold changes from the linear model or t-test are the best unbiased point estimates of the true effect, but they are noisy -- proteins with no real abundance change still show small nonzero estimates from measurement noise. How to handle this depends on the downstream use case.
Report the unmodified log2 fold change from the statistical test when:
Apply shrinkage when the goal is to recover which proteins truly changed and by how much -- i.e., effect size accuracy matters more than preserving the full distribution. ashr (R) fits a mixture prior with a point mass at zero and estimates posterior means, smoothly shrinking uncertain effects toward zero while preserving well-supported ones:
library(ashr)
se <- sqrt(fit2$s2.post) * fit2$stdev.unscaled[, 1]
shrunk <- ash(fit2$coefficients[, 1], se, mixcompdist = 'normal')
shrunken_fc <- shrunk$result$PosteriorMean
lfsr <- shrunk$result$lfsr
ashr is preferred over hard-thresholding (zeroing FCs at a p-value cutoff) because it shrinks smoothly based on per-protein uncertainty rather than applying an arbitrary step function at padj = 0.05. Hard zeroing discards information and creates artificial discontinuities -- a protein at padj 0.049 keeps its full FC while one at 0.051 is set to zero.
In Python without ashr, there is no mature equivalent. If effect size accuracy is critical, use R with ashr. For Python-only environments, report raw fold changes with adjusted p-values and let the downstream analysis handle thresholding.
To test whether fold changes exceed a biologically meaningful threshold (rather than just differ from zero), use treat() + topTreat() instead of post-hoc FC filtering with topTable(lfc=...), which can inflate FDR:
fit2 <- treat(fit2, lfc = log2(1.2))
results <- topTreat(fit2, coef = 1, number = Inf)
library(ggplot2)
ggplot(results, aes(x = logFC, y = -log10(adj.P.Val))) +
geom_point(aes(color = significant), alpha = 0.6) +
geom_hline(yintercept = -log10(0.05), linetype = 'dashed') +
geom_vline(xintercept = c(-1, 1), linetype = 'dashed') +
scale_color_manual(values = c('grey60', 'firebrick')) +
theme_minimal() + labs(x = 'Log2 Fold Change', y = '-Log10 Adjusted P-value')
scipy.stats.ttest_ind defaults to equal_var=True; always set equal_var=False (Welch's) since treatment can affect both mean and varianceremoveBatchEffect() before testing -- this function is for visualization only; include batch as a covariate in the design matrix for statistical testingtopTable(lfc=...) -- can inflate FDR; use treat() + topTreat() for minimum-effect-size testingtools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.