flow-cytometry/differential-analysis/SKILL.md
Differential abundance (DA) and differential state (DS) analysis for flow and mass cytometry - tests which cell populations change in frequency or marker expression between conditions using diffcyt (edgeR/voom/GLMM for DA, limma/LMM for DS), with cydar, CITRUS, and compositional methods (sccomp, scCODA, DCATS) as alternatives. Covers the sample-is-the-experimental-unit principle, design/contrast and mixed-model formulas, compositionality of cluster proportions, and FDR across clusters. Use when comparing populations between groups, choosing a DA method, handling paired/batch designs, or deciding whether compositional correction is needed.
npx skillsauth add GPTomics/bioSkills bio-flow-cytometry-differential-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: diffcyt 1.22+, CATALYST 1.26+, edgeR 4.0+, limma 3.58+.
Before using code patterns, verify installed versions match. If versions differ:
packageVersion('<pkg>') then ?function_name to verify parameterstestDA_edgeR/testDS_limma are diffcyt functions operating on count/median objects from calcCounts/calcMedians; the CATALYST-integrated path is the diffcyt() wrapper on the SCE. Confirm the signature with ?diffcyt before relying on it.
"Compare cell populations between my conditions" -> Test cluster frequencies (DA) and within-cluster marker expression (DS) between groups, with the sample (not the cell) as the unit.
diffcyt::diffcyt(sce, analysis_type='DA', method_DA='diffcyt-DA-edgeR', design, contrast)diffcyt(sce, analysis_type='DS', method_DS='diffcyt-DS-limma', ...)Tens of thousands of cells from one donor are technical PSEUDOREPLICATES, not independent observations. A per-cell test (Wilcoxon across all cells) treats them as n = cells and produces astronomically significant p-values from two mice - it is the single most common statistical sin in modern cytometry (Hurlbert 1984 Ecol Monogr 54:187; the cytometry mirror of the scRNA-seq pseudobulk lesson). The correct unit is the SAMPLE/subject: diffcyt aggregates cells to PER-SAMPLE-PER-CLUSTER counts (DA) and PER-SAMPLE-PER-CLUSTER arcsinh-MEDIANS (DS), then tests across samples with edgeR/limma/GLMM (Weber 2019 Commun Biol 2:183). Biological replication is mandatory (>= 2-3 per group); DA from a single sample per condition has no valid test. Paired with this: cluster proportions are COMPOSITIONAL (they sum to 1), so a real increase in one population mechanically forces apparent depletion in others - a source of false DA in "unchanged" clusters.
| Method | Citation | Mechanism | When to use | |--------|----------|-----------|-------------| | diffcyt-DA-edgeR / voom | Weber 2019 Commun Biol 2:183 | edgeR/voom empirical-Bayes on per-sample counts; optional TMM | standard 2+ group with replicates (DEFAULT) | | diffcyt-DA-GLMM / DS-LMM | Weber 2019 | random effects in the formula | paired/repeated-measures/nested (subject random effect) | | cydar | Lun 2017 Nat Methods 14:707 | overlapping hyperspheres + edgeR + spatial FDR | continuum, avoid hard clusters | | CITRUS | Bruggner 2014 PNAS 111:E2770 | hierarchical clustering + LASSO | predictive signature, LARGE n; correlated-not-causal; largely superseded | | sccomp / scCODA / DCATS | Mangiola 2023 PNAS 120:e2203828120 / Buttner 2021 Nat Commun 12:6876 / Lin 2023 Genome Biol 24:151 | simplex-aware compositional models | strong compositional shift (one pop dominates); DCATS for assignment uncertainty |
Goal: Test abundance and state on a CATALYST-clustered SCE.
Approach: Build design + contrast from ei(sce); the diffcyt() wrapper uses the stored clustering. State markers are tested in DS, type markers define DA clusters.
library(CATALYST); library(diffcyt)
sce <- readRDS('sce_clustered.rds')
design <- createDesignMatrix(ei(sce), cols_design = 'condition')
contrast <- createContrast(c(0, 1)) # Treatment vs Control
res_DA <- diffcyt(sce, clustering_to_use = 'meta20',
analysis_type = 'DA', method_DA = 'diffcyt-DA-edgeR',
design = design, contrast = contrast)
res_DS <- diffcyt(sce, clustering_to_use = 'meta20',
analysis_type = 'DS', method_DS = 'diffcyt-DS-limma',
design = design, contrast = contrast)
library(SummarizedExperiment)
rowData(res_DA$res) # cluster_id, logFC, p_val, p_adj (BH across clusters)
Goal: Account for within-subject correlation (e.g. pre/post on the same donor).
Approach: Use a GLMM/LMM method with a random effect for subject via a formula.
formula <- createFormula(ei(sce), cols_fixed = 'condition', cols_random = 'patient_id')
res_DA <- diffcyt(sce, clustering_to_use = 'meta20',
analysis_type = 'DA', method_DA = 'diffcyt-DA-GLMM',
formula = formula, contrast = createContrast(c(0, 1)))
Goal: Confirm a headline single-population shift is not inducing artifactual reciprocal depletion.
Approach: Re-test with a simplex-aware model when one cluster changes a lot or total yield differs by group.
# If a dominant population expands, the apparent depletion of others may be a simplex artifact.
# Re-test with sccomp / scCODA (reference cell type) / DCATS (assignment uncertainty)
# before reporting reciprocal depletion as independent biology.
Trigger: Wilcoxon/t-test across all cells. Mechanism: cells aren't independent. Symptom: p ~ 1e-40 from few subjects. Fix: aggregate to per-sample summaries (diffcyt).
Trigger: one population expands strongly. Mechanism: proportions sum to 1. Symptom: significant "depletion" of unrelated clusters. Fix: TMM only when total cell abundance is NOT itself the biological signal (else it removes real signal), or a compositional method (sccomp/scCODA/DCATS); report total-yield differences.
Trigger: normalizing batch out then testing naively. Mechanism: over-correction removes real signal. Symptom: attenuated effects. Fix: include batch in the design; if batch == condition, no rescue - design it out.
Trigger: 1 sample per condition. Mechanism: no error term. Symptom: uninterpretable p. Fix: require >= 2-3 biological replicates per group.
| Threshold | Source | Rationale | |-----------|--------|-----------| | >= 2-3 biological replicates per group | Weber 2019 | minimum for a valid DA/DS error term | | BH FDR across clusters (and clusters x markers for DS) | diffcyt | high-resolution grids have many tests | | arcsinh median as DS statistic | Nowicka 2017 | robust per-cluster per-sample summary |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| testDA_edgeR(sce, ...) fails | wrong signature | use the diffcyt() wrapper on the SCE, or calcCounts first |
| results empty | wrong clustering_to_use name | match the stored clustering id (e.g. meta20) |
| no DS results | state markers not flagged | set marker_class='state' in the panel |
| paired design ignored | used fixed-effect method | use diffcyt-DA-GLMM with a random effect |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.