Version Compatibility

Reference examples tested with: bedtools 2.31+, samtools 1.19+, BEDOPS 2.4.41+, GenomicRanges 1.54+, DiffBind 3.12+, Subread 2.0.2+ (featureCounts; --countReadPairs requires >= 2.0.2), pybedtools 0.10+.

Before using code patterns, verify installed versions match. If versions differ:

Python: pip show <package> then help(module.function) to check signatures
R: packageVersion('<pkg>') then ?function_name to verify parameters
CLI: <tool> --version then <tool> --help to confirm flags

If code throws unexpected errors, introspect the installed package and adapt rather than retrying.

Consensus Peakset Construction

"Build a single peakset to count reads against for all my samples" -> Combine per-replicate or per-condition peak calls into a non-redundant, fixed-width set of regions. The strategy chosen drives FDR calibration, peak-width fairness, and reproducibility downstream.

CLI: bedtools merge (simple union) and bedtools multiinter (per-sample membership columns emitted by default)
CLI: Corces 2018 iterative overlap removal (custom shell)
R: DiffBind::dba.count(summits=250) (built-in fixed-width)
Python: pybedtools for programmatic merging

The peakset choice is rarely default-correct. Wrong width or wrong overlap rule propagates to every downstream analysis (differential, motif, footprint, ML).

Why a Consensus Peakset Matters

ATAC peaks vary in width across replicates: same regulatory element might be called 200 bp in rep1 and 800 bp in rep2 because of stochastic Tn5 cuts at edges. Counting reads in different-width intervals confounds peak width with biological signal. A fixed-width consensus avoids this.

For ENCODE-style differential analysis: ALL samples must be counted against the SAME peak coordinates; otherwise the count matrix is non-rectangular and statistical models are misspecified.

Strategy Taxonomy

| Strategy | Implementation | Width | When to use | Fails when | |----------|---------------|-------|-------------|------------| | Naive union | bedtools merge of all peaks | Variable, tends wide | Quick exploratory; never for differential | Width inflation drives spurious differential | | Naive intersection | bedtools multiinter requiring all samples | Variable | High-stringency reproducibility | Loses real condition-specific peaks | | Majority-rule overlap | multiinter requiring >= n/2 samples | Variable | Balance; DiffBind default with minOverlap | Width still varies; counts are width-biased | | Iterative overlap removal (Corces 2018) | Sort by significance, greedily keep non-overlapping at fixed width | 501 bp fixed | ML features; cross-study comparison; modern ATAC standard | Loses sub-501bp resolution; overweights high-significance peaks | | Summit-centered fixed width (DiffBind) | dba.count(summits=250) re-centers all peaks on summit +/- 250 bp | 501 bp fixed | Matches the Corces 501 bp convention (note: dba.count default is summits=200 -> 401 bp); integrates with replicate counts | Requires summit info (MACS narrowPeak); broad peaks lose width info | | IDR-filtered union | Union of IDR-passed peaks across rep pairs | Variable | ENCODE pipeline-compliant; reproducibility-aware | Requires running IDR per pair; computationally heavier | | Per-condition union, then global union | Each group consensus separately, then merge | Variable | Different cell types / strong condition shift | Same width issues as naive union | | Width-controlled extension | Extend each peak to median width centered on midpoint | User-set | Quick fixed-width without summit info | Midpoint != summit; can shift biology |

Methodology evolves; verify against current ENCODE 4 ATAC standards (encodeproject.org/atac-seq) and the Corces 2018 iterative overlap algorithm before locking pipelines.

Iterative Overlap Removal (Corces 2018)

This is the modern standard for fixed-width consensus peaksets used in cross-study comparison and machine-learning feature matrices.

Algorithm:

Pool all peaks from all samples; re-center each on its summit; extend +/- 250 bp -> 501 bp fixed-width peaks.
Sort by significance (narrowPeak column 7, signalValue, descending).
Walk down the sorted list. Keep each peak if it does not overlap any previously kept peak. Drop if overlap.
Output the kept peaks as the consensus.

Goal: Produce a non-overlapping, fixed-width peakset weighted toward strongest evidence.

Approach: Greedy non-overlap on summit-centered fixed-width peaks ranked by signalValue.

#!/bin/bash
# Corces 2018 iterative overlap removal
PEAKS_DIR=peaks/per_sample
GENOME_SIZES=hg38.chrom.sizes
WIDTH_HALF=250                                       # 501 bp total width

# 1. Pool all peaks, re-center on summit, extend
awk -v w=$WIDTH_HALF 'BEGIN{OFS="\t"}
    {summit=$2+$10; print $1, summit-w, summit+w+1, $4, $7, $6}' \
    $PEAKS_DIR/*.narrowPeak | \
    awk '$2 >= 0' | \
    bedtools slop -i - -g $GENOME_SIZES -b 0 | \
    sort -k1,1 -k2,2n > pooled_recentered.bed

# 2. Sort by signalValue descending (column 5 in our BED -- which was column 7 of narrowPeak)
sort -k5,5gr pooled_recentered.bed > pooled_by_sig.bed

# 3. Iterative greedy non-overlap (process in significance order)
python3 - <<'EOF'
import sys
kept = []
with open('pooled_by_sig.bed') as f:
    for line in f:
        chrom, start, end, name, sig, strand = line.strip().split('\t')[:6]
        start, end = int(start), int(end)
        overlap = any(c == chrom and not (end <= s or start >= e) for c, s, e in kept)
        if not overlap:
            kept.append((chrom, start, end))

with open('consensus_iterative.bed', 'w') as f:
    for c, s, e in sorted(kept):
        f.write(f'{c}\t{s}\t{e}\n')
EOF

sort -k1,1 -k2,2n consensus_iterative.bed > consensus_final.bed
echo "Consensus peakset: $(wc -l < consensus_final.bed) fixed-width 501bp peaks"

For a faster approximation at scale, bedtools cluster + per-cluster top-significance selection is tempting, but it is NOT equivalent to greedy iterative overlap: cluster groups transitively-overlapping peaks (a chain can span well beyond 501 bp), so keeping one peak per cluster discards non-overlapping peaks the greedy algorithm would retain. Use it only as an approximation.

Per-Strategy Failure Modes

Naive union -- Width-driven differential

Trigger: Using bedtools merge of all per-rep peaks; counting reads in merged intervals.

Mechanism: A peak appearing as 200 bp in one rep but 800 bp in another merges to 800 bp in the union. Read count in 800 bp interval is biased high; differential analysis flags it as condition-specific even when it's just width difference.

Symptom: Top differential peaks track peak width (mean width different by 100+ bp between conditions).

Fix: Use fixed-width strategy (Corces iterative or DiffBind summits=250). Never use merged variable-width peaks for differential.

Intersection (peak in all reps) -- Loses condition-specific biology

Trigger: Using bedtools multiinter -i ... (which emits per-sample membership columns by default) and requiring all samples.

Mechanism: Peaks present only in one condition fail intersection requirement and are excluded from the consensus, even though they are the biology of interest.

Symptom: Differential analysis returns near-zero significant peaks; the condition-specific peaks were filtered out before testing.

Fix: Use per-condition consensus, then union of consensus. Or majority rule with per-condition minOverlap.

Majority rule (DiffBind default-ish) -- Borderline peaks dropped

Trigger: dba.count(minOverlap=ceiling(N/2)) where N = total replicates and one condition has fewer reps.

Mechanism: A peak in 2/2 reps of cond1 but 0/3 reps of cond2 is in 2/5 = 40% < 50% -> dropped.

Fix: Compute consensus per condition first (e.g. peak in >= 2/3 reps), then union across conditions. This preserves condition-specific peaks.

Width-controlled extension -- Shifts off summit

Trigger: Extending peak to fixed width using midpoint ((start+end)/2).

Mechanism: Midpoint of the called peak is rarely at the actual summit; particularly for asymmetric peaks (TSS-flanking) or wide broadPeaks.

Fix: Use summit position from narrowPeak column 10 (start + summit_offset). If summit unavailable (broadPeak), use peak start + half median width as approximation.

IDR-filtered union -- Computational cost; threshold mismatch

Trigger: Running IDR for each rep pair, then unioning IDR-passed peaks.

Mechanism: IDR thresholds differ (true reps 0.05; pseudoreps 0.10). Mixing both into a union biases the consensus toward looser pseudorep peaks unless filtering carefully.

Fix: Decide upfront -- use only true-rep IDR (stricter, fewer peaks) OR pseudorep IDR (looser, more peaks). Don't mix.

Decision Tree by Goal

| Goal | Strategy | |------|----------| | Standard 2-3 rep DA analysis with DiffBind | DiffBind summits=250, minOverlap=2 | | Modern ATAC publication / ML features | Corces 2018 iterative overlap (501 bp fixed) | | ENCODE-compliant differential | IDR-passed per-rep-pair union (true-rep threshold 0.05) | | Multi-condition with strong biology shift | Per-condition consensus then union | | Single-cell ATAC pseudobulk | MACS3 per cluster -> iterative overlap across clusters | | Cross-study peak comparison | Iterative overlap on shared genome build; lift over if needed | | Quick exploratory | bedtools merge (acknowledge width bias; not for stats) |

Reconciliation Across Methods

| Pattern | Likely cause | Action | |---------|--------------|--------| | Iterative overlap peakset much smaller than DiffBind summits=250 | Iterative removes overlapping summits within 500 bp; DiffBind keeps them | Both valid; iterative is sparser, DiffBind preserves resolution | | Per-condition union then merged peak count >> within-condition consensus | Strong condition-specific peaks; biology is real | Use this strategy; do NOT collapse to global majority rule | | IDR-pass peaks much fewer than majority-rule | IDR is stricter (reproducibility-aware) than overlap counting | Use IDR for high-confidence reporting; majority for exploratory | | Same biology gives different peak counts on different genome builds | Lift-over noise; build-specific blacklist differences | Match builds; use ENCODE blacklist v2 specific to build |

Operational rule: Document the strategy explicitly in methods. The strategy choice can shift downstream results by 30-50% in peak count and FDR. Reproducibility requires explicit specification.

Width-Re-centering Patterns

# pybedtools: re-center on summit and fix width to 501 bp
import pybedtools as pbt

def fix_width_recenter(narrowpeak_path, half_width=250, out_path='consensus.bed'):
    """Re-center each narrowPeak on its summit and fix width to 2 * half_width + 1 bp."""
    rows = []
    for line in open(narrowpeak_path):
        f = line.strip().split('\t')
        chrom = f[0]; start = int(f[1]); summit_off = int(f[9])
        signal = float(f[6])
        summit = start + summit_off
        rows.append((chrom, max(0, summit - half_width), summit + half_width + 1,
                     f[3], signal, f[5]))
    rows.sort(key=lambda r: -r[4])               # Sort by signalValue desc
    bt = pbt.BedTool([list(map(str, r)) for r in rows])
    return bt.saveas(out_path)

Per-Condition Consensus Then Union

# For each condition, build a within-condition consensus first
for cond in cond1 cond2; do
    bedtools multiinter -i <(sort -k1,1 -k2,2n ${cond}_rep1.narrowPeak) \
                            <(sort -k1,1 -k2,2n ${cond}_rep2.narrowPeak) \
                            <(sort -k1,1 -k2,2n ${cond}_rep3.narrowPeak) \
        -names rep1 rep2 rep3 | \
        awk '$4 >= 2 {print $1"\t"$2"\t"$3}' > ${cond}_consensus.bed
done

# Union across condition consensuses (preserves condition-specific peaks)
cat cond1_consensus.bed cond2_consensus.bed | \
    sort -k1,1 -k2,2n | \
    bedtools merge -i - > all_conditions_consensus.bed

This is the right pattern when conditions differ enough that a global majority-rule would discard condition-specific peaks.

R-native Fixed-Width Re-centering (GenomicRanges)

For Bioconductor pipelines that avoid intermediate BED files. The iterative-overlap step requires a loop; !duplicated(findOverlaps(...)) does NOT implement greedy non-overlap (every peak self-overlaps).

library(GenomicRanges); library(rtracklayer)

peaks <- import('peaks.narrowPeak')                    # GRanges
# trim() below only clamps ranges when seqlengths are set, so attach them from the genome index
chrom_sizes <- read.table('hg38.chrom.sizes', col.names=c('chrom', 'len'))
seqlengths(peaks) <- setNames(chrom_sizes$len, chrom_sizes$chrom)[seqlevels(peaks)]
peaks_summit <- peaks                                  # column 10 of narrowPeak is summit offset
start(peaks_summit) <- start(peaks) + peaks$peak       # move start to the summit position
end(peaks_summit) <- start(peaks_summit)               # collapse to 1 bp at the summit before centering
peaks_fixed <- resize(peaks_summit, width=501, fix='center')
peaks_fixed <- trim(peaks_fixed)                       # clamp to chromosome bounds (requires seqlengths, set above)

# Sort by signalValue (column 7) descending; greedy iterative non-overlap
peaks_sorted <- peaks_fixed[order(-peaks_fixed$signalValue)]
kept <- logical(length(peaks_sorted))
already_used <- GRanges()
for (i in seq_along(peaks_sorted)) {
    if (!any(overlapsAny(peaks_sorted[i], already_used))) {
        kept[i] <- TRUE
        already_used <- c(already_used, peaks_sorted[i])
    }
}
consensus <- peaks_sorted[kept]
export(consensus, 'consensus_iterative.bed')

For very large peaksets (>500k), use bedtools shell pipeline (faster) or precompute the per-rank reduce in chunks. The greedy loop is the canonical Corces 2018 algorithm; approximations via reduce() or disjoin() are NOT equivalent.

Cross-Organism / Cross-Build Consensus

When merging peaks across genome builds (hg19 -> hg38; human -> mouse for cross-species analysis):

# Lift over hg19 peaks to hg38
liftOver published_peaks.hg19.bed hg19ToHg38.over.chain.gz \
    published_peaks.hg38.bed published_peaks.unmapped.bed

# Then merge with current hg38 peaks
cat published_peaks.hg38.bed my_peaks.hg38.bed | \
    sort -k1,1 -k2,2n | \
    bedtools merge -i - > merged_consensus.bed

Trigger: Cross-cohort meta-analysis or comparing to published datasets on older builds.

Mechanism: Genome assemblies differ in coordinates, gap regions, alt-haplotypes; chain files (UCSC) provide the position translation but ~3-5% of regions fail liftover (especially complex regions, sex chromosomes, MHC).

Fix: Always document liftOver chain version; report unmapped fraction; for high-stakes cross-build comparisons, re-align rather than liftover.

For human -> mouse comparison, use synteny-based mapping (bnMapper or halLiftover) rather than chain liftOver because conservation is non-trivial.

Peak-Width Adaptive (When ATAC + Broad Histone Marks Co-analyzed)

Trigger: Combined ATAC narrow peaks + H3K27ac (broadPeak) or H3K4me1 in the same workflow.

Mechanism: Fixed 501 bp from ATAC under-counts H3K27ac which spreads across 2-5 kb domains. Counting H3K27ac reads in 501 bp windows misses most signal.

Fix: Build a dual-resolution consensus: 501 bp for ATAC narrow peaks; broadPeak union (2-5 kb) for H3K27ac. Run differential separately on each resolution. Or use a unified per-feature width derived from the broader signal.

ENCODE-rE2G Candidate Element Lists

Alternative to iterative-overlap: ENCODE-rE2G (atac-seq/enhancer-gene-linking) predicts scored enhancer-gene regulatory links for ENCODE cell types, prioritizing elements by predicted regulatory function rather than just ATAC signal strength (the genome-wide cCRE registry itself is a distinct product, SCREEN/ENCODE cCREs). Its predicted regulatory elements serve as an alternative consensus peakset for ML feature engineering. Trade-off: limited to ENCODE cell types; cell-type-specific elements may not align with the actual peakset of any specific dataset.

Counting Reads in Consensus Peaks

# Convert BED to SAF for featureCounts (BED start is 0-based half-open; SAF Start is 1-based inclusive, so +1)
awk 'BEGIN{OFS="\t"; print "GeneID","Chr","Start","End","Strand"}
     {print $1"_"$2"_"$3, $1, $2+1, $3, "+"}' consensus.bed > consensus.saf

featureCounts -F SAF -a consensus.saf \
    -o consensus_counts.tsv \
    -p --countReadPairs \
    -T 8 \
    sample1.bam sample2.bam sample3.bam

-p --countReadPairs is required for paired-end ATAC; -T 8 parallelizes across 8 threads. Output is the count matrix consumed by DESeq2 / edgeR / DiffBind.

Blacklist Filtering

Blacklist filtering belongs after consensus construction, before or alongside differential testing.

# ENCODE blacklist v2 (Amemiya 2019)
bedtools intersect -v -a consensus_final.bed -b hg38-blacklist.v2.bed.gz \
    > consensus_no_blacklist.bed
echo "After blacklist: $(wc -l < consensus_no_blacklist.bed) peaks"

Common Errors

| Error / symptom | Cause | Solution | |-----------------|-------|----------| | Differential analysis flags hundreds of peaks; biology is implausible | Variable-width consensus driving width effects | Switch to fixed-width (iterative or summits=250) | | Negative coordinates in fixed-width output | Peak summit too close to chromosome start | Clip with awk '$2 >= 0' or bedtools slop -g chrom.sizes | | featureCounts SAF complaint about non-unique GeneID | Duplicate peak names in SAF | Use chr_start_end as GeneID; ensure peakset is non-redundant | | Iterative overlap output much smaller than expected | Too many peaks within 500 bp; strict non-overlap | Try smaller half-width (e.g. 100) for higher resolution | | multiinter output empty | Peak files not sorted; or wrong chromosome naming | Sort all inputs first; confirm chr name consistency | | Per-condition union has 2x peaks of any per-condition | Strong condition-specific biology -- expected | Confirm with browser tracks; use this consensus | | DiffBind takes hours on consensus | All-by-all counting is N samples x M peaks | Use bParallel=TRUE (default DBA$config$RunParallel); limit cores via DBA$config$cores |

References

Corces MR et al 2018 Science 362:eaav1898 (Iterative overlap, fixed-width 501 bp consensus standard)
Stark R & Brown G 2011 DiffBind (Bioconductor; canonical reference; summits=250 parameter)
Quinlan AR & Hall IM 2010 Bioinformatics 26:841 (bedtools)
Liao Y et al 2014 Bioinformatics 30:923 (featureCounts)
Amemiya HM et al 2019 Sci Rep 9:9354 (ENCODE blacklist v2)
Li Q et al 2011 Ann Appl Stat 5:1752 (IDR framework)
ENCODE 4 ATAC-seq Standards (encodeproject.org/atac-seq)

Related Skills

atac-seq/atac-peak-calling - Generate per-replicate peaks
atac-seq/differential-accessibility - Use consensus for DA testing
atac-seq/atac-qc - Filter samples before consensus construction
atac-seq/single-cell-atac - Per-cluster consensus for pseudobulk DA
genome-intervals/bed-file-basics - bedtools operations on the consensus
genome-intervals/interval-arithmetic - merge / intersect / subtract patterns
chip-seq/peak-calling - Same consensus strategy applies to ChIP-seq

Version Compatibility

Before using code patterns, verify installed versions match. If versions differ:

Python: pip show <package> then help(module.function) to check signatures
R: packageVersion('<pkg>') then ?function_name to verify parameters
CLI: <tool> --version then <tool> --help to confirm flags

If code throws unexpected errors, introspect the installed package and adapt rather than retrying.

Consensus Peakset Construction

CLI: bedtools merge (simple union) and bedtools multiinter (per-sample membership columns emitted by default)
CLI: Corces 2018 iterative overlap removal (custom shell)
R: DiffBind::dba.count(summits=250) (built-in fixed-width)
Python: pybedtools for programmatic merging

The peakset choice is rarely default-correct. Wrong width or wrong overlap rule propagates to every downstream analysis (differential, motif, footprint, ML).

Why a Consensus Peakset Matters

For ENCODE-style differential analysis: ALL samples must be counted against the SAME peak coordinates; otherwise the count matrix is non-rectangular and statistical models are misspecified.

Strategy Taxonomy

Methodology evolves; verify against current ENCODE 4 ATAC standards (encodeproject.org/atac-seq) and the Corces 2018 iterative overlap algorithm before locking pipelines.

Iterative Overlap Removal (Corces 2018)

This is the modern standard for fixed-width consensus peaksets used in cross-study comparison and machine-learning feature matrices.

Algorithm:

Pool all peaks from all samples; re-center each on its summit; extend +/- 250 bp -> 501 bp fixed-width peaks.
Sort by significance (narrowPeak column 7, signalValue, descending).
Walk down the sorted list. Keep each peak if it does not overlap any previously kept peak. Drop if overlap.
Output the kept peaks as the consensus.

Goal: Produce a non-overlapping, fixed-width peakset weighted toward strongest evidence.

Approach: Greedy non-overlap on summit-centered fixed-width peaks ranked by signalValue.

#!/bin/bash
# Corces 2018 iterative overlap removal
PEAKS_DIR=peaks/per_sample
GENOME_SIZES=hg38.chrom.sizes
WIDTH_HALF=250                                       # 501 bp total width

# 1. Pool all peaks, re-center on summit, extend
awk -v w=$WIDTH_HALF 'BEGIN{OFS="\t"}
    {summit=$2+$10; print $1, summit-w, summit+w+1, $4, $7, $6}' \
    $PEAKS_DIR/*.narrowPeak | \
    awk '$2 >= 0' | \
    bedtools slop -i - -g $GENOME_SIZES -b 0 | \
    sort -k1,1 -k2,2n > pooled_recentered.bed

# 2. Sort by signalValue descending (column 5 in our BED -- which was column 7 of narrowPeak)
sort -k5,5gr pooled_recentered.bed > pooled_by_sig.bed

# 3. Iterative greedy non-overlap (process in significance order)
python3 - <<'EOF'
import sys
kept = []
with open('pooled_by_sig.bed') as f:
    for line in f:
        chrom, start, end, name, sig, strand = line.strip().split('\t')[:6]
        start, end = int(start), int(end)
        overlap = any(c == chrom and not (end <= s or start >= e) for c, s, e in kept)
        if not overlap:
            kept.append((chrom, start, end))

with open('consensus_iterative.bed', 'w') as f:
    for c, s, e in sorted(kept):
        f.write(f'{c}\t{s}\t{e}\n')
EOF

sort -k1,1 -k2,2n consensus_iterative.bed > consensus_final.bed
echo "Consensus peakset: $(wc -l < consensus_final.bed) fixed-width 501bp peaks"

Per-Strategy Failure Modes

Naive union -- Width-driven differential

Trigger: Using bedtools merge of all per-rep peaks; counting reads in merged intervals.

Symptom: Top differential peaks track peak width (mean width different by 100+ bp between conditions).

Fix: Use fixed-width strategy (Corces iterative or DiffBind summits=250). Never use merged variable-width peaks for differential.

Intersection (peak in all reps) -- Loses condition-specific biology

Trigger: Using bedtools multiinter -i ... (which emits per-sample membership columns by default) and requiring all samples.

Mechanism: Peaks present only in one condition fail intersection requirement and are excluded from the consensus, even though they are the biology of interest.

Symptom: Differential analysis returns near-zero significant peaks; the condition-specific peaks were filtered out before testing.

Fix: Use per-condition consensus, then union of consensus. Or majority rule with per-condition minOverlap.

Majority rule (DiffBind default-ish) -- Borderline peaks dropped

Trigger: dba.count(minOverlap=ceiling(N/2)) where N = total replicates and one condition has fewer reps.

Mechanism: A peak in 2/2 reps of cond1 but 0/3 reps of cond2 is in 2/5 = 40% < 50% -> dropped.

Fix: Compute consensus per condition first (e.g. peak in >= 2/3 reps), then union across conditions. This preserves condition-specific peaks.

Width-controlled extension -- Shifts off summit

Trigger: Extending peak to fixed width using midpoint ((start+end)/2).

Mechanism: Midpoint of the called peak is rarely at the actual summit; particularly for asymmetric peaks (TSS-flanking) or wide broadPeaks.

Fix: Use summit position from narrowPeak column 10 (start + summit_offset). If summit unavailable (broadPeak), use peak start + half median width as approximation.

IDR-filtered union -- Computational cost; threshold mismatch

Trigger: Running IDR for each rep pair, then unioning IDR-passed peaks.

Mechanism: IDR thresholds differ (true reps 0.05; pseudoreps 0.10). Mixing both into a union biases the consensus toward looser pseudorep peaks unless filtering carefully.

Fix: Decide upfront -- use only true-rep IDR (stricter, fewer peaks) OR pseudorep IDR (looser, more peaks). Don't mix.

Decision Tree by Goal

Reconciliation Across Methods

Operational rule: Document the strategy explicitly in methods. The strategy choice can shift downstream results by 30-50% in peak count and FDR. Reproducibility requires explicit specification.

Width-Re-centering Patterns

# pybedtools: re-center on summit and fix width to 501 bp
import pybedtools as pbt

def fix_width_recenter(narrowpeak_path, half_width=250, out_path='consensus.bed'):
    """Re-center each narrowPeak on its summit and fix width to 2 * half_width + 1 bp."""
    rows = []
    for line in open(narrowpeak_path):
        f = line.strip().split('\t')
        chrom = f[0]; start = int(f[1]); summit_off = int(f[9])
        signal = float(f[6])
        summit = start + summit_off
        rows.append((chrom, max(0, summit - half_width), summit + half_width + 1,
                     f[3], signal, f[5]))
    rows.sort(key=lambda r: -r[4])               # Sort by signalValue desc
    bt = pbt.BedTool([list(map(str, r)) for r in rows])
    return bt.saveas(out_path)

Per-Condition Consensus Then Union

# For each condition, build a within-condition consensus first
for cond in cond1 cond2; do
    bedtools multiinter -i <(sort -k1,1 -k2,2n ${cond}_rep1.narrowPeak) \
                            <(sort -k1,1 -k2,2n ${cond}_rep2.narrowPeak) \
                            <(sort -k1,1 -k2,2n ${cond}_rep3.narrowPeak) \
        -names rep1 rep2 rep3 | \
        awk '$4 >= 2 {print $1"\t"$2"\t"$3}' > ${cond}_consensus.bed
done

# Union across condition consensuses (preserves condition-specific peaks)
cat cond1_consensus.bed cond2_consensus.bed | \
    sort -k1,1 -k2,2n | \
    bedtools merge -i - > all_conditions_consensus.bed

This is the right pattern when conditions differ enough that a global majority-rule would discard condition-specific peaks.

R-native Fixed-Width Re-centering (GenomicRanges)

library(GenomicRanges); library(rtracklayer)

peaks <- import('peaks.narrowPeak')                    # GRanges
# trim() below only clamps ranges when seqlengths are set, so attach them from the genome index
chrom_sizes <- read.table('hg38.chrom.sizes', col.names=c('chrom', 'len'))
seqlengths(peaks) <- setNames(chrom_sizes$len, chrom_sizes$chrom)[seqlevels(peaks)]
peaks_summit <- peaks                                  # column 10 of narrowPeak is summit offset
start(peaks_summit) <- start(peaks) + peaks$peak       # move start to the summit position
end(peaks_summit) <- start(peaks_summit)               # collapse to 1 bp at the summit before centering
peaks_fixed <- resize(peaks_summit, width=501, fix='center')
peaks_fixed <- trim(peaks_fixed)                       # clamp to chromosome bounds (requires seqlengths, set above)

# Sort by signalValue (column 7) descending; greedy iterative non-overlap
peaks_sorted <- peaks_fixed[order(-peaks_fixed$signalValue)]
kept <- logical(length(peaks_sorted))
already_used <- GRanges()
for (i in seq_along(peaks_sorted)) {
    if (!any(overlapsAny(peaks_sorted[i], already_used))) {
        kept[i] <- TRUE
        already_used <- c(already_used, peaks_sorted[i])
    }
}
consensus <- peaks_sorted[kept]
export(consensus, 'consensus_iterative.bed')

Cross-Organism / Cross-Build Consensus

When merging peaks across genome builds (hg19 -> hg38; human -> mouse for cross-species analysis):

# Lift over hg19 peaks to hg38
liftOver published_peaks.hg19.bed hg19ToHg38.over.chain.gz \
    published_peaks.hg38.bed published_peaks.unmapped.bed

# Then merge with current hg38 peaks
cat published_peaks.hg38.bed my_peaks.hg38.bed | \
    sort -k1,1 -k2,2n | \
    bedtools merge -i - > merged_consensus.bed

Trigger: Cross-cohort meta-analysis or comparing to published datasets on older builds.

Fix: Always document liftOver chain version; report unmapped fraction; for high-stakes cross-build comparisons, re-align rather than liftover.

For human -> mouse comparison, use synteny-based mapping (bnMapper or halLiftover) rather than chain liftOver because conservation is non-trivial.

Peak-Width Adaptive (When ATAC + Broad Histone Marks Co-analyzed)

Trigger: Combined ATAC narrow peaks + H3K27ac (broadPeak) or H3K4me1 in the same workflow.

Mechanism: Fixed 501 bp from ATAC under-counts H3K27ac which spreads across 2-5 kb domains. Counting H3K27ac reads in 501 bp windows misses most signal.

ENCODE-rE2G Candidate Element Lists

Counting Reads in Consensus Peaks

# Convert BED to SAF for featureCounts (BED start is 0-based half-open; SAF Start is 1-based inclusive, so +1)
awk 'BEGIN{OFS="\t"; print "GeneID","Chr","Start","End","Strand"}
     {print $1"_"$2"_"$3, $1, $2+1, $3, "+"}' consensus.bed > consensus.saf

featureCounts -F SAF -a consensus.saf \
    -o consensus_counts.tsv \
    -p --countReadPairs \
    -T 8 \
    sample1.bam sample2.bam sample3.bam

-p --countReadPairs is required for paired-end ATAC; -T 8 parallelizes across 8 threads. Output is the count matrix consumed by DESeq2 / edgeR / DiffBind.

Blacklist Filtering

Blacklist filtering belongs after consensus construction, before or alongside differential testing.

# ENCODE blacklist v2 (Amemiya 2019)
bedtools intersect -v -a consensus_final.bed -b hg38-blacklist.v2.bed.gz \
    > consensus_no_blacklist.bed
echo "After blacklist: $(wc -l < consensus_no_blacklist.bed) peaks"

Common Errors

References

Corces MR et al 2018 Science 362:eaav1898 (Iterative overlap, fixed-width 501 bp consensus standard)
Stark R & Brown G 2011 DiffBind (Bioconductor; canonical reference; summits=250 parameter)
Quinlan AR & Hall IM 2010 Bioinformatics 26:841 (bedtools)
Liao Y et al 2014 Bioinformatics 30:923 (featureCounts)
Amemiya HM et al 2019 Sci Rep 9:9354 (ENCODE blacklist v2)
Li Q et al 2011 Ann Appl Stat 5:1752 (IDR framework)
ENCODE 4 ATAC-seq Standards (encodeproject.org/atac-seq)

Related Skills

atac-seq/atac-peak-calling - Generate per-replicate peaks
atac-seq/differential-accessibility - Use consensus for DA testing
atac-seq/atac-qc - Filter samples before consensus construction
atac-seq/single-cell-atac - Per-cluster consensus for pseudobulk DA
genome-intervals/bed-file-basics - bedtools operations on the consensus
genome-intervals/interval-arithmetic - merge / intersect / subtract patterns
chip-seq/peak-calling - Same consensus strategy applies to ChIP-seq

Adoption

GPTomics/bio-atac-seq-consensus-peakset

$ install --global

Security Scan Results

SKILL.md

Version Compatibility

Consensus Peakset Construction

Why a Consensus Peakset Matters

Strategy Taxonomy

Iterative Overlap Removal (Corces 2018)

Per-Strategy Failure Modes

Naive union -- Width-driven differential

Intersection (peak in all reps) -- Loses condition-specific biology

Majority rule (DiffBind default-ish) -- Borderline peaks dropped

Width-controlled extension -- Shifts off summit

IDR-filtered union -- Computational cost; threshold mismatch

Decision Tree by Goal

Reconciliation Across Methods

Width-Re-centering Patterns

Per-Condition Consensus Then Union

R-native Fixed-Width Re-centering (GenomicRanges)

Cross-Organism / Cross-Build Consensus

Peak-Width Adaptive (When ATAC + Broad Histone Marks Co-analyzed)

ENCODE-rE2G Candidate Element Lists

Counting Reads in Consensus Peaks

Blacklist Filtering

Common Errors

References

Related Skills

Related Skills

GPTomics/bio-workflows-clip-pipeline

GPTomics/bio-comparative-genomics-whole-genome-duplication

GPTomics/bio-comparative-genomics-whole-genome-alignment

GPTomics/bio-comparative-genomics-synteny-analysis

GPTomics/bio-atac-seq-consensus-peakset

$ install --global

Security Scan Results

SKILL.md

Version Compatibility

Consensus Peakset Construction

Why a Consensus Peakset Matters

Strategy Taxonomy

Iterative Overlap Removal (Corces 2018)

Per-Strategy Failure Modes

Naive union -- Width-driven differential

Intersection (peak in all reps) -- Loses condition-specific biology

Majority rule (DiffBind default-ish) -- Borderline peaks dropped

Width-controlled extension -- Shifts off summit

IDR-filtered union -- Computational cost; threshold mismatch

Decision Tree by Goal

Reconciliation Across Methods

Width-Re-centering Patterns

Per-Condition Consensus Then Union

R-native Fixed-Width Re-centering (GenomicRanges)

Cross-Organism / Cross-Build Consensus

Peak-Width Adaptive (When ATAC + Broad Histone Marks Co-analyzed)

ENCODE-rE2G Candidate Element Lists

Counting Reads in Consensus Peaks

Blacklist Filtering

Common Errors

References

Related Skills

Related Skills

GPTomics/bio-workflows-clip-pipeline

GPTomics/bio-comparative-genomics-whole-genome-duplication

GPTomics/bio-comparative-genomics-whole-genome-alignment

GPTomics/bio-comparative-genomics-synteny-analysis