atac-seq/differential-accessibility/SKILL.md
Identify differentially accessible chromatin regions across conditions using DiffBind, csaw, DESeq2, or edgeR. Use when comparing ATAC-seq accessibility between treatment groups, choosing between consensus-peak vs sliding-window approaches, picking the correct normalization (full library vs reads-in-peaks), correcting batch with SVA/RUVseq, or interpreting log2FC and FDR thresholds in a chromatin context.
npx skillsauth add GPTomics/bioSkills bio-atac-seq-differential-accessibilityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: DiffBind 3.12+, DESeq2 1.42+, edgeR 4.0+, csaw 1.36+, limma 3.58+, GenomicRanges 1.54+, ChIPseeker 1.38+, Subread 2.0+ (featureCounts), sva 3.50+, RUVSeq 1.36+.
Before using code patterns, verify installed versions match:
packageVersion('<pkg>') then ?function_name to verify parametersIf code throws unexpected errors, introspect the installed package and adapt rather than retrying.
"Find chromatin regions that change accessibility between my conditions" -> Build a sample-by-region count matrix, normalize for library size and chromatin compaction, fit a generalized linear model (negative-binomial), and extract regions with significant accessibility change.
DiffBind -> count -> normalize -> contrast -> analyzecsaw::windowCounts + filterWindowsGlobal + edgeR QL F-testDESeq2 or edgeR directly on featureCounts outputDiffBind is a wrapper around DESeq2 / edgeR with ATAC-aware defaults. csaw is the only peak-free option; it tests fixed-width sliding windows. The choice depends on whether peaks are stable across conditions (use DiffBind) or whether some condition has dramatically different peak structure (use csaw or rebuild consensus peaks).
| Tool | Model | Input | Min reps | Strength | Fails when |
|------|-------|-------|----------|----------|------------|
| DiffBind 3.x (default DESeq2) | NB GLM via DESeq2 on consensus peaks | BAM + peak files | 2-3 per group | ATAC-aware defaults; built-in QC; blocking factors. Default in 3.x is normalize=DBA_NORM_LIB with library=DBA_LIBSIZE_FULL (full library size, background-included) | Peaks differ dramatically between conditions (closed -> open shifts width); fewer than 2 reps per group |
| DiffBind with edgeR backend | NB GLM via edgeR-QL on consensus peaks | Same | 2-3 per group | Robust at low replicates (n=2 OK); QL test calibrates dispersion better than DESeq2 at small n | When global accessibility shifts dominate, switch to spike-in or library=DBA_LIBSIZE_PEAKREADS (RiP) |
| DESeq2 directly on peak counts | NB GLM with shrinkage | featureCounts SAF | 3+ | Maximum control; integrates with apeglm shrinkage; modern interface | Need to manually build consensus peakset; per-region pre-filter required (low counts inflate dispersion) |
| edgeR QL F-test on peak counts | NB QL (quasi-likelihood) | featureCounts | 2 | Calibrated FDR at low n (n=2 viable); robust to outlier reps | Manual consensus peakset; small library bias unless normalization explicit |
| csaw (windows) | edgeR-QL on sliding windows | BAM only | 2 | No peak set required; detects diffuse changes peaks miss; merges adjacent windows | Computationally heavy; window size choice biases results; harder to annotate downstream |
| limma-voom | linear model with mean-variance trend | log2(CPM+offset) | 3 | Fast; good calibration at moderate count | Mis-calibrated at very low counts (atac peaks often have dropouts); needs explicit voom normalization |
Methodology evolves; verify the current consensus practice (Schep & Greenleaf 2017; Reske 2020 benchmark) before locking pipelines.
| Scenario | Recommended workflow | Why |
|----------|---------------------|-----|
| 3+ reps, similar peak structure across conditions | DiffBind (DESeq2 backend), summits=250, normalize=NATIVE | Standard pattern; peak-level inference is interpretable |
| 2 reps per condition | DiffBind with edgeR backend OR raw edgeR QL | DESeq2 underpowered at n=2; QL is robust |
| Peak structure differs dramatically (e.g., differentiation, KO of pioneer TF) | csaw windows OR rebuild consensus peakset post-hoc per condition then take union | Stable consensus peakset is invalid when chromatin landscape shifts |
| Multi-factor design (batch, sex, time) | DiffBind with dba.contrast(..., design='~Batch + Condition') | Standard linear model adjustment |
| Hidden batch / unknown variance | DESeq2 + SVA OR RUVseq before fitting | Empirical surrogate variables capture unknown nuisance |
| Long timecourse (5+ time points) | DESeq2 LRT (likelihood ratio test) on ~time + condition + time:condition | Captures temporal interaction; use differential-expression/timeseries-de patterns |
| Diffuse / broad accessibility change (super-enhancers) | csaw with merged windows OR call broad peaks first | Narrow peaks fragment broad domains -> inflated peak count, deflated effect |
| Single-cell ATAC pseudobulk | DESeq2 on aggregated counts OR Signac::FindMarkers | See atac-seq/single-cell-atac |
| Allele-specific accessibility | csaw on heterozygous SNPs OR HOMER tagDir | Peak-level invalid because alleles share peaks |
| Plant / non-model organism | DiffBind works; just provide custom genome and disable annotation | Annotation step assumes UCSC TxDb; bypass if absent |
The consensus peakset choice drives FDR calibration. DiffBind defaults rarely match what a chromatin biologist wants.
| Strategy | Implementation | When to use |
|----------|---------------|-------------|
| Intersection (peak in all reps) | dba.count(minOverlap=N) with N = total reps | Strict; for high-confidence reproducible analysis (matches IDR philosophy) |
| Union (peak in any rep) | minOverlap=1 | Maximum sensitivity; risks single-rep artefact peaks |
| Majority rule (peak in >= half reps) | minOverlap=ceiling(N/2) | DiffBind default-ish; balance |
| Per-condition union, then union of unions | Compute consensus per group, then merge | Best when conditions have very different peak counts |
| Iterative overlap removal (Corces 2018) | Sort peaks by significance; greedily keep non-overlapping; fixed-width 501 bp | Standard for fixed-width consensus; required for peak-count matrices used in machine learning |
Refer to atac-seq/consensus-peakset for full coverage of fixed-width re-centering and the iterative overlap algorithm. For DiffBind, the key parameter is summits=250 (re-center peaks on summit +/- 250 bp = 501 bp fixed width).
DiffBind 3.x conflates two orthogonal choices: the normalization method (normalize=) and the library-size definition (library=). The defaults are normalize=DBA_NORM_LIB with library=DBA_LIBSIZE_FULL (full mapped-read total).
| Choice | DiffBind argument | What it does | When to use |
|--------|-------------------|--------------|-------------|
| Normalize by library size only | normalize=DBA_NORM_LIB (default) | Scale counts by the chosen library size | Standard; pairs with full or RiP library |
| Reads-in-peaks library size | library=DBA_LIBSIZE_PEAKREADS | Library size = reads in consensus peaks (RiP) | When background varies independently of biology (protects against background drift) |
| Full mapped library size | library=DBA_LIBSIZE_FULL (default) | Library size = total mapped reads | When global accessibility shifts must remain visible (e.g., chromatin compaction) |
| Native per-tool default | normalize=DBA_NORM_NATIVE | DESeq2 RLE or edgeR TMM, depending on backend | Use DESeq2/edgeR conventions directly |
| TMM (edgeR) | normalize=DBA_NORM_TMM | Trimmed mean of M-values | Robust to a few highly-DA peaks dominating |
| RLE (DESeq2) | normalize=DBA_NORM_RLE | DESeq2 geometric-mean size factors | DESeq2-conventional analysis |
| Spike-in / external | not built-in; pre-scale counts | Exogenous reference (e.g., spike-in chromatin) | Required when global scaling is biological |
Trigger: Treatment causes global chromatin compaction (e.g., HDAC inhibitor, DNMT inhibitor).
Mechanism: Full library-size normalization is robust to background but the default still scales background reads in; under uniform global compaction the magnitudes can collapse. RiP scaling (library=DBA_LIBSIZE_PEAKREADS) makes the opposite assumption (peak signal is stable, background absorbs the shift) and so erases the very biology of interest.
Symptom: Volcano plot is symmetric about zero; PCA shows treatment effect that vanishes after normalization.
Fix: Use spike-in normalization (add exogenous chromatin pre-Tn5; scale by spike-in reads), or keep the default library=DBA_LIBSIZE_FULL but interpret with the global shift in mind. Reske 2020 documented this with HDAC inhibitor.
Trigger: Treatment causes whole-genome accessibility shift; cell-cycle synchronized samples; differentiation timecourse.
Mechanism: Setting library=DBA_LIBSIZE_PEAKREADS (RiP-based) assumes total reads-in-peaks is comparable across samples. Global accessibility shifts break this assumption.
Symptom: Conditions clearly differ in PCA before normalization; after normalization PC1 is nearly noise.
Fix: Keep the default library=DBA_LIBSIZE_FULL (or use spike-in scaling) and re-run dba.contrast and dba.analyze. Re-inspect PCA; if treatment now drives PC1, the global-shift biology is preserved.
Trigger: Per-rep peaks have very different widths; consensus uses union.
Mechanism: Without summits=250, DiffBind counts reads in the original peak intervals. A peak called as 200 bp in one rep and 800 bp in another inflates the count for the wider rep.
Symptom: Top differential peaks track peak width, not signal intensity.
Fix: Always set summits=250 (or 100, depending on resolution). This re-centers all peaks on the summit and uses identical 501 (or 201) bp windows.
Trigger: Default width=10 (bp) windows; default filter.global cutoff.
Mechanism: Narrow windows have very low counts and inflated dispersion; default global filter discards too many windows. Results are extremely sensitive to these.
Symptom: Number of significant windows ranges from 200 to 200,000 across reasonable parameter sweeps.
Fix: Use width=150 for ATAC (matches typical NFR fragment); set filter.global(stat, log2(min.fold))=log2(3) to discard low-signal windows. Validate by running on technical replicates -- ~0 differential windows is the expected outcome.
Trigger: n=2 per condition; using lfcShrink(type='apeglm').
Mechanism: Apeglm shrinks log2FC toward zero based on dispersion estimate; at n=2 dispersion is unreliable, shrinkage is over-aggressive, and biology is masked.
Symptom: All log2FC values cluster near zero post-shrinkage; FDR list has high p-values across the board.
Fix: Skip shrinkage at n=2 OR switch to edgeR QL test. If n=2 is unavoidable, report unshrunken log2FC alongside FDR; do not use shrunken FCs as the effect size.
Trigger: Including peaks with mean count < 5 across all samples.
Mechanism: The QL F-test calibrates dispersion across all features. Including very-low-count peaks pulls dispersion estimates and inflates FDR.
Fix: filterByExpr(y, group=group) removes low-count peaks; restore peaks one at a time only if they are biologically critical and supported by at least one rep at depth.
| Pattern | Likely cause | Action |
|---------|--------------|--------|
| DiffBind + DESeq2 differ wildly | Different normalization (DiffBind default = RiP, DESeq2 default = RLE) | Force same normalization; differences should shrink |
| DiffBind + csaw differ | csaw catches diffuse changes peaks miss; DiffBind catches narrow peaks csaw smooths | Both can be correct; report intersection as high-confidence |
| Top hits in DiffBind have FDR > 0.5 in DESeq2 | DiffBind's blacklist filter or width re-centering changes the per-region count | Re-run DESeq2 on the exact DiffBind consensus matrix (dba.peakset extract) |
| Effect-size ranking differs across reps | One rep is an outlier -- check PCA | Drop or block as covariate; never silently include |
| No significant peaks despite obvious browser-track differences | Library-size normalization eaten the global shift | Switch to spike-in or full-library normalization |
Operational rule: For high-confidence reporting, require concordant detection in two methods from different families (DiffBind/DESeq2-style on consensus peaks AND csaw-style sliding windows agreeing within +/- 500 bp). Report the intersection as primary; the union as exploratory.
| Question | Threshold | Rationale | |----------|-----------|-----------| | Statistical significance | FDR < 0.05 | Standard BH FDR (DESeq2 / edgeR / DiffBind default) | | Stringent biological change | abs(log2FC) >= 1 (= 2-fold) | Within-noise effects below 2-fold are unreliable in chromatin | | Conservative reporting | FDR < 0.01 AND abs(log2FC) >= 1 | Per ENCODE differential reporting guidance | | Exploratory / discovery | FDR < 0.1 OR shrunken log2FC >= 0.585 (1.5x) | For follow-up validation, not final claim | | Proper effect-size reporting | Use shrunken log2FC (apeglm or DESeq2 lfcShrink) when n >= 3 | Raw log2FC at low counts is volatile |
abs(log2FC) >= 1 is not universal. ATAC effects in primary cells (immune subsets, neurons) often max at 1.5-fold; require log2FC >= 0.585 with FDR < 0.05 for those settings.
Goal: Recover differential accessibility signal when unknown batch effects swamp the contrast.
Approach: Estimate surrogate variables on normalized counts via svaseq, append them to the DESeq2 design, refit the model, and extract the contrast.
library(DESeq2); library(sva)
dds <- DESeqDataSetFromMatrix(countData=counts, colData=coldata, design=~condition)
dds <- estimateSizeFactors(dds)
dat <- counts(dds, normalized=TRUE)
dat <- dat[rowMeans(dat) > 1, ]
mod <- model.matrix(~condition, colData(dds))
mod0 <- model.matrix(~1, colData(dds))
svobj <- svaseq(dat, mod, mod0, n.sv=2)
dds$SV1 <- svobj$sv[, 1]; dds$SV2 <- svobj$sv[, 2]
design(dds) <- ~SV1 + SV2 + condition
dds <- DESeq(dds)
res <- results(dds, contrast=c('condition', 'treated', 'control'))
RUVseq is the alternative when negative-control regions (ChrM peaks NOT changing) or technical replicates are available. SVA is preferred when no controls exist.
Trigger: Treatment causes whole-genome accessibility shift (HDAC inhibitor, DNMT inhibitor); RPM/CPM/RiP normalization erases the global biology.
Mechanism: Exogenous chromatin spike-in (Drosophila S2 nuclei) is added at constant cell number ratio pre-Tn5; reads aligning to dm6 quantify the constant exogenous baseline. Sample-level scaling factor = inverse of dm6 reads per sample, applied to human-aligned counts.
Goal: Preserve global accessibility shifts that RiP / library-size normalization would erase.
Approach: Compute per-sample size factors from inverse spike-in read counts, override DESeq2's default size factors, then run the standard DESeq2 fit and contrast.
library(DESeq2)
# spike_counts: per-sample dm6 read counts (one column per sample)
sf_spike <- 1 / spike_counts
sf_spike <- sf_spike / mean(sf_spike) # Geometric mean = 1 for stability
dds <- DESeqDataSetFromMatrix(countData=counts, colData=coldata, design=~condition)
sizeFactors(dds) <- sf_spike # Override library-size factors
dds <- DESeq(dds)
res <- results(dds, contrast=c('condition', 'treated', 'control'))
After spike-in normalization, log2FC reflects absolute accessibility change (not just relative redistribution). ENCODE 4 increasingly recommends spike-in for global-shift biology.
Trigger: n=2 per condition; parametric NB tests give over-confident p-values.
Mechanism: csaw provides a permutation framework: the null is generated by shuffling sample labels; test statistic is the count-difference per window; per-region p is the rank under permutation.
Goal: Generate empirical per-region p-values when parametric NB tests are over-confident at low replicate counts.
Approach: Fit the observed edgeR QL F statistic, repeatedly shuffle group labels and refit, then compute per-region p as the rank of the observed F under the shuffled null.
library(csaw); library(edgeR)
# Standard csaw counts (windows or peaks)
counts <- regionCounts(bam_files, regions, ext=200)
# Standard NB fit
y <- DGEList(counts=assay(counts), group=condition)
y <- calcNormFactors(y, method='TMM')
design <- model.matrix(~condition)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
# Permutation: shuffle group labels n_perms times; track per-region rank statistic
n_perms <- 1000
perm_p <- replicate(n_perms, {
shuffled <- sample(condition)
design_p <- model.matrix(~shuffled)
fit_p <- glmQLFit(estimateDisp(y, design_p), design_p)
glmQLFTest(fit_p, coef=2)$table$F
})
observed_F <- glmQLFTest(fit, coef=2)$table$F
permp <- rowMeans(perm_p >= observed_F)
Permutation requires ~1000 shuffles for stable per-region p; computationally expensive but essential when parametric tests cannot be trusted.
Goal: Identify peaks whose accessibility trajectory differs between conditions across a timecourse.
Approach: Fit a DESeq2 LRT comparing a full model with a spline-by-condition interaction against a reduced model lacking the interaction; significant peaks have time-dependent condition response.
library(DESeq2); library(splines)
# Spline-modeled time course (5+ time points)
dds <- DESeqDataSetFromMatrix(countData=counts, colData=coldata,
design=~ns(timepoint, df=3) + condition + ns(timepoint, df=3):condition)
dds_full <- DESeq(dds, test='LRT', reduced=~ns(timepoint, df=3) + condition)
res <- results(dds_full)
The LRT compares the full model (with time:condition interaction) to a reduced model without; significant peaks have time-dependent condition response. Use df=3 natural splines for typical 5-7 timepoints; df=4-5 for >= 8.
Trigger: Combined ATAC-seq + Hi-C/HiChIP datasets; want to test enhancer-promoter pair-level differential.
Mechanism: Aggregate peak-level differential signal at HiCCUPS loop anchors (or ABC-predicted enhancer-gene pairs). Combined enhancer + promoter accessibility change has more statistical power than either alone.
Goal: Test enhancer-promoter pair-level differential accessibility by aggregating peak-level signal at loop anchors.
Approach: Import HiCCUPS loops, map consensus peaks to anchor positions, then aggregate per-peak log2FC across both anchors of each loop to get loop-level effect sizes.
# Pseudo-pattern: per loop, sum DESeq2 log2FC at both anchors
loops <- import('hiccups_loops.bedpe', format='bedpe')
peak_to_loop <- findOverlaps(consensus_peaks, c(anchorOne(loops), anchorTwo(loops)))
loop_lfc <- aggregate(res$log2FoldChange[queryHits(peak_to_loop)],
by=list(loop=ceiling(subjectHits(peak_to_loop) / 2)),
FUN=function(x) sum(x, na.rm=TRUE))
For implementation, use the InteractionSet Bioconductor package which preserves loop-pair structure during testing. Reference: Mumbach 2017 Nat Genet and Hwang 2024 (Hi-C loop-aggregated DA).
Goal: Assign each differentially accessible peak to its nearest gene and feature class for downstream interpretation.
Approach: Pull DiffBind / DESeq2 results as GRanges, annotate via ChIPseeker against a TxDb with a custom promoter window, then plot annotation distribution and extract gene IDs for enrichment.
library(ChIPseeker); library(TxDb.Hsapiens.UCSC.hg38.knownGene)
diff_peaks <- dba.report(dba)
peakAnno <- annotatePeak(diff_peaks, TxDb=TxDb.Hsapiens.UCSC.hg38.knownGene,
tssRegion=c(-2000, 500), level='gene')
plotAnnoPie(peakAnno); plotDistToTSS(peakAnno)
genes <- as.data.frame(peakAnno)$geneId # for GO enrichment via pathway-analysis/go-enrichment
tssRegion=c(-2000, 500) defines promoter as TSS-2kb to TSS+500bp; ChIPseeker default (-3000, 3000) over-counts promoter assignments. Adjust per cell type / organism.
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| DiffBind very slow | Counting all peaks across all BAMs sequentially | dba.count(..., bParallel=TRUE) and provide BPPARAM |
| unable to use the provided design matrix (DESeq2) | Confounded design (e.g., batch perfectly aligns with condition) | Replicate in a way that breaks the confound, or drop the batch term |
| FDR list empty despite obvious differences | RiP scaling (library=DBA_LIBSIZE_PEAKREADS) removed global biology | Use spike-in or keep default library=DBA_LIBSIZE_FULL; verify with browser tracks |
| Top peaks all on chrM | chrM not removed from BAM before counting | Always strip chrM upstream |
| dispersion estimate failure (DESeq2) | Too few peaks pass filter; too few reps | filterByExpr less aggressively; check rep count |
| Error in if (any(out))(csaw) | Window count below threshold | Reduce bin.size; check BAM is paired-end |
| ChIPseeker error on non-human TxDb | Wrong organism db loaded | Use make_org_db from biomartr or AnnotationDbi for non-model |
| Volcano plot symmetric about zero with no significant peaks | Hidden batch swamping signal | Run SVA/RUVseq |
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.