atac-seq/allele-specific-accessibility/SKILL.md
Detect allele-specific chromatin accessibility from ATAC-seq using WASP, GATK ASEReadCounter, or RASQUAL. Use when mapping cis-regulatory genetic variants from heterozygous SNPs, separating cis from trans regulation, building chromatin QTL (caQTL) maps, validating GWAS variant function with allelic imbalance, or detecting reference allele mapping bias before downstream analysis.
npx skillsauth add GPTomics/bioSkills bio-atac-seq-allele-specific-accessibilityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: WASP 0.3.4+, GATK 4.4+, RASQUAL 1.1+, samtools 1.19+, bcftools 1.19+, vcftools 0.1.16+, plink 2.00+, MatrixEQTL 2.3+, QuASAR 0.1+, bowtie2 2.5+, bwa-mem2 2.2.1+.
Verify before use:
<tool> --version then <tool> --help to confirm flagspip show <package> then help(module.function) to check signaturespackageVersion('<pkg>') then ?function_name to verify parametersIf code throws unexpected errors, introspect the installed package and adapt rather than retrying.
"Does this heterozygous SNP affect chromatin accessibility on its allele?" -> Count ATAC reads supporting reference vs alternative allele at heterozygous sites in the same individual; significant deviation from 50:50 indicates cis-regulatory effect. Requires careful handling of reference-allele mapping bias (WASP filtering) and within-individual binomial testing.
WASP (Geijn 2015) for de-biased reference mappinggatk ASEReadCounter for allele-specific count tablesRASQUAL (Kumasaka 2016) for joint cis-mapping with allelic countsQuASAR (Harvey 2015) for genotype + ASE inference simultaneouslyASE/ASB analysis is fundamentally different from cohort-level differential. Statistical framework is binomial within-individual; sample size is the count of heterozygous SNPs in accessible regions, not the number of individuals.
| Tool | Method | Input | Strength | Fails when | |------|--------|-------|----------|------------| | WASP (Geijn 2015) | Realign reads where alt allele swap could change mapping; filter mapping-bias affected sites | BAM + VCF + reference | Mandatory for any ASE/ASB analysis; controls reference-allele bias | Slow on deep coverage; requires re-alignment step | | GATK ASEReadCounter | Count REF and ALT reads at known heterozygous sites | BAM + VCF | Mature; integrates with GATK ecosystem; standard counter | Doesn't fix mapping bias (needs WASP first); single-sample | | RASQUAL (Kumasaka 2016) | Joint cis-eQTL/caQTL model: total counts + allelic imbalance | BAM + VCF + peak counts | Best statistical power for caQTL when sample size is moderate (N=20-100); models genotype uncertainty | Complex setup; per-feature regression (slow); requires LD computation | | QuASAR (Harvey 2015) | Genotype + ASE inference from RNA-seq or ATAC alone | BAM (no VCF needed) | Useful when genotypes are limited; integrates phasing | Less accurate than WASP+GATK when genotypes are known | | MatrixEQTL on per-feature counts | Linear model fit on accessibility per peak | Genotypes + peak counts | Standard cohort-level caQTL; well-supported | No allelic imbalance information; needs sample size N >= 50 | | Allelic Imbalance from Bayesian models (MAJIQ-style) | Bayesian beta-binomial | BAM + VCF | Models overdispersion appropriately | Niche; less mature than WASP+GATK |
Methodology evolves; verify against current Geijn 2015, Kumasaka 2016, Buchkovich 2015 (review) before locking pipelines. Modern caQTL studies (Kumasaka 2019, Anderson 2024) typically combine WASP + RASQUAL at modest N or WASP + GATK + linear caQTL at large N.
When aligning reads to the reference genome, reads carrying the reference allele align with 0 mismatches; reads carrying the alternative allele have 1 mismatch and may fail to align (especially with bwa-mem -k or bowtie2 --very-sensitive thresholds). This inflates the apparent reference-allele frequency at every SNP and confounds ASE.
Trigger: Always (mapping bias is universal at heterozygous SNPs).
Mechanism: Aligners are mismatch-penalized. Without correction, ALT reads systematically under-align. Effect size: 1-5% bias toward reference at typical mismatch penalties.
Symptom: Per-SNP REF allele fraction skews above 50% even at sites without true allelic imbalance.
Fix: WASP. Pseudo-alleles every read at heterozygous sites; re-aligns swapped reads; keeps only reads that map identically to both haplotypes. Mandatory for any ASE/ASB analysis.
Goal: Remove reference-allele mapping bias before counting allele-specific ATAC reads.
Approach: Identify reads overlapping heterozygous SNPs, re-align allele-swapped versions, keep only reads consistent across both haplotypes, then count REF/ALT with GATK ASEReadCounter.
# WASP read-correction pipeline (Geijn 2015)
WASP_DIR=/path/to/WASP
PEAKS=peaks.bed # ATAC peaks for filtering
SAMPLE=sample1
OUT=$SAMPLE.wasp_filtered.bam
# 1. Find reads at SNP sites
python $WASP_DIR/mapping/find_intersecting_snps.py \
--is_paired_end \
--is_sorted \
--output_dir wasp_out/ \
--snp_dir snp_h5/ \
$SAMPLE.bam
# 2. Re-align flipped-allele reads
bowtie2 -x hg38_idx -1 wasp_out/$SAMPLE.remap.fq1.gz \
-2 wasp_out/$SAMPLE.remap.fq2.gz \
-S wasp_out/$SAMPLE.remap.sam
samtools view -bS wasp_out/$SAMPLE.remap.sam | samtools sort -o wasp_out/$SAMPLE.remap.bam
samtools index wasp_out/$SAMPLE.remap.bam
# 3. Keep only consistently-mapped reads
python $WASP_DIR/mapping/filter_remapped_reads.py \
wasp_out/$SAMPLE.to.remap.bam \
wasp_out/$SAMPLE.remap.bam \
wasp_out/$SAMPLE.kept.bam
# 4. Merge kept reads with non-overlapping reads
samtools merge $OUT \
wasp_out/$SAMPLE.kept.bam \
wasp_out/$SAMPLE.keep.bam
# After WASP filtering, GATK ASEReadCounter is safe
gatk ASEReadCounter \
-I $OUT \
-V heterozygous_snps.vcf \
-R hg38.fa \
-O $SAMPLE.ase_counts.tsv
Trigger: Running ASEReadCounter directly on a standard ATAC BAM without WASP filtering.
Mechanism: Reference allele over-counts due to alignment bias.
Symptom: Per-SNP reference fraction systematically > 0.5; aggregate plots show ~0.51-0.55 instead of 0.5.
Fix: WASP filter first, ALWAYS. There are no exceptions.
Trigger: Single-individual ATAC; per-SNP heterozygous coverage typically 10-100 reads.
Mechanism: Per-SNP binomial test has limited power; 10 reads at p=0.5 has 95% CI from 0.18 to 0.82 -- effectively no power for moderate effects.
Fix: Aggregate across many SNPs in the same peak (within-peak ASE); aggregate across replicates at same SNP; combine with cis-caQTL across cohort.
Trigger: Running RASQUAL without pre-computing LD matrix.
Mechanism: RASQUAL needs a per-feature linkage disequilibrium calculation; missing causes the joint model to fail.
Fix: Pre-compute LD with plink --ld-window-kb 5000 per region; supply via --correlation-bias.
Trigger: Using unphased genotypes for ASE.
Mechanism: ASE requires knowing which allele is on which haplotype to assign reads. Unphased het sites ambiguously assign reads.
Fix: Phase genotypes with SHAPEIT5, BEAGLE 5.4, or whatshap (read-based) before running WASP/ASE counter.
Trigger: MatrixEQTL on peak counts without ASE.
Mechanism: MatrixEQTL maps cohort-level associations; misses cis-mode that ASE captures within individual.
Symptom: Power to detect caQTL is low (typical N=50-100 cohort gives ~hundreds of caQTLs vs ASE-augmented can give thousands).
Fix: Use RASQUAL (joint total + ASE) when N <= 100; or combine MatrixEQTL with separate ASE per individual.
Trigger: Per-peak coverage < 30 reads at SNP site.
Mechanism: Binomial test power at p=0.5, n=30 yields detectable shifts only at |delta_p| >= 0.2.
Fix: Pool replicates if available; or restrict to peaks with sufficient coverage; or aggregate to per-individual peak-level ASE rather than per-SNP.
| Setting | Recommended pipeline | |---------|---------------------| | Single individual, ATAC + genotypes | WASP + GATK ASEReadCounter -> per-SNP and per-peak ASE; within-peak aggregation | | Cohort N >= 100, want caQTL | WASP + GATK + MatrixEQTL on peak counts; supplement with ASE for cis-effects | | Cohort N = 20-100 | WASP + RASQUAL (joint total + ASE) for max power | | Cohort with no genotypes | QuASAR (infers genotypes from data) | | Validating GWAS variant function | Look up het samples in cohort; aggregate ASE at the variant; ASB ratio | | Trios or quartets | Per-trio phasing then ASE per individual | | iPSC line genotype validation | Single-individual ASE at known SNPs |
Goal: Build cohort-level chromatin QTLs by combining WASP-corrected per-sample counts with cis-genotype association.
Approach: WASP-correct each BAM, build consensus peakset, count reads in peaks per sample, then test cis-genotype association via MatrixEQTL (cohort) or RASQUAL (joint total + allelic).
# 1. WASP-correct each individual's BAM
for sample in $(cat samples.txt); do
bash wasp_pipeline.sh $sample.bam $sample.vcf
done
# 2. Build consensus peakset (atac-seq/consensus-peakset)
# 3. Count reads in peaks per sample (featureCounts)
featureCounts -F SAF -a consensus.saf -o counts.tsv -p --countReadPairs *.wasp.bam
# 4. Cohort-level caQTL via MatrixEQTL
# (see R script in examples/)
# 5. Per-individual ASE (RASQUAL alternative)
# Parallel: gatk ASEReadCounter per sample, then merge for QuASAR meta-analysis
Goal: Boost per-SNP ASE power by pooling allele counts across heterozygous SNPs in the same peak.
Approach: Map each het SNP to its containing peak, sum REF and ALT counts per peak, run pooled binomial test against 50:50, apply BH FDR, and threshold on effect size.
import pandas as pd, numpy as np
from scipy import stats
# Per-SNP allele counts at heterozygous sites
ase = pd.read_csv('sample.ase_counts.tsv', sep='\t')
ase = ase.rename(columns={'refCount': 'REF', 'altCount': 'ALT'})
ase['totalCount'] = ase['REF'] + ase['ALT']
# Map each SNP to its containing peak
ase['peak'] = map_snps_to_peaks(ase, 'consensus_peaks.bed')
# Aggregate within peak: pooled binomial test
def peak_ase(group):
ref = group['REF'].sum()
total = group['totalCount'].sum()
if total < 30: return pd.Series({'ref_frac': np.nan, 'p_value': np.nan})
p = stats.binomtest(ref, total, p=0.5).pvalue
return pd.Series({'ref_frac': ref / total, 'p_value': p, 'snp_count': len(group)})
peak_ase_df = ase.groupby('peak').apply(peak_ase)
peak_ase_df['adj_p'] = stats.false_discovery_control(peak_ase_df['p_value'].dropna())
sig_ase = peak_ase_df[(peak_ase_df['adj_p'] < 0.05) & (abs(peak_ase_df['ref_frac'] - 0.5) >= 0.2)]
|ref_frac - 0.5| >= 0.2 is a 30:70 effect; smaller imbalances are detectable but biologically minor. Per-peak SNP count >= 2 strengthens the call.
RASQUAL uses a non-standard CLI: it reads the VCF from stdin via tabix and uses single-letter flags. The canonical invocation pattern is:
Goal: Combine total accessibility counts and allele-specific counts into one joint cis-caQTL test per peak.
Approach: Pre-build binary count and offset files via rasqualTools, then iterate per-feature, streaming the cis-window VCF through tabix into RASQUAL with feature coordinates and SNP counts.
# 1. Pre-compute genotype offsets and binary count files (rasqualTools R package)
# (see rasqualTools::saveRasqualMatricesAsBinary; produces .bin files for -y, -k)
# 2. Per-feature (peak) RASQUAL call: pipe tabix VCF in via stdin
# Per-line meaning: feature name, chromosome, start, end, n_total_SNPs, n_test_SNPs
while IFS=$'\t' read -r name chr start end n_all n_test; do
tabix cohort.vcf.gz $chr:$((start-500000))-$((end+500000)) | \
rasqual -y counts.bin \
-k offsets.bin \
-n $N_SAMPLES \
-j $FEATURE_INDEX \
-l $n_all -m $n_test \
-s $start -e $end \
-f $name \
> $name.rasqual.txt
done < features.tsv
-y is the binary count file; -k is the binary size-factor / offset file (both produced by rasqualTools::saveRasqualMatricesAsBinary from R); -j is the row index of the feature; -l/-m are total / test SNP counts in the feature window. RASQUAL does NOT use --features, --counts, --vcf flags; those are common in newer caQTL wrappers but not in stock RASQUAL.
For modern usage, the rasqualTools R wrapper (Kumasaka GitHub) handles this orchestration. Verify against rasqual --help because the flag set is unusual.
RASQUAL output includes joint p-values, total-only and ASE-only sub-tests; the joint test typically gains 1.5-3x power vs MatrixEQTL alone.
| Pattern | Likely cause | Action | |---------|--------------|--------| | GATK ASE shows REF > ALT systematically | WASP not run | Re-run with WASP filter; bias fixed | | RASQUAL joint p << ASE-only p | Cis effect dominated by allelic component | Confirms cis-regulatory mechanism | | ASE detected at SNP not in peak | Possible coding splice or 3' UTR effect | Check annotation; may not be regulatory | | Cohort caQTL doesn't replicate per-individual ASE | Trans effect or technical artifact | Consider trans-effect; or single-individual outliers |
Operational rule for high-confidence reporting: A cis-regulatory variant must show (a) WASP-filtered allelic imbalance with adjusted p < 0.05 and effect size >= 0.2, AND (b) cohort-level caQTL p < 1e-5 (or RASQUAL joint p < 1e-5), AND (c) accessibility peak overlap. Validation against MPRA or CRISPRi-FlowFISH increases confidence.
| Error / symptom | Cause | Solution | |-----------------|-------|----------| | Reference allele systematically over-represented | WASP not run | Mandatory WASP filter | | ASE per-SNP coverage too low | Sparse coverage; rare alleles | Aggregate across SNPs in peak; or filter SNPs with allele freq 0.1-0.9 | | RASQUAL crashes on small features | Per-feature LD missing | Pre-compute LD with plink | | caQTL replication low | Cohort-effect vs cis-effect confusion | Use RASQUAL joint or replicate ASE separately | | Phased vs unphased confusion | Different software expectations | Phase with SHAPEIT5/BEAGLE before any ASE | | GATK ASEReadCounter fails on multiallelic sites | Multi-allelic complications | Pre-filter VCF to biallelic only with bcftools | | WASP runs slow | Re-alignment step is dominant | Parallelize per-chromosome; or use samtools faidx + region-based parallelism |
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.