methylation-analysis/methylation-calling/SKILL.md
Extracts per-cytosine methylation calls from aligned bisulfite/EM-seq reads with bismark_methylation_extractor (Bismark BAM) or the aligner-agnostic MethylDackel/BISCUIT (bwa-meth BAM), producing the beta value M/(M+U) as a coverage file, bedGraph, or genome-wide cytosine report across CpG/CHG/CHH context. Covers conversion-rate QC as the first gate, the 5mC vs 5hmC summed caveat, variant-aware calling so a C/T SNP does not masquerade as unmethylation, paired-end --no_overlap double-counting, symmetric CpG dyad collapse, and the 0-based vs 1-based coordinate trap. Use when extracting methylation levels from a bisulfite/EM-seq alignment, choosing an extractor for a non-Bismark BAM, QC-ing conversion efficiency, or producing coverage/cytosine-report input for testing. For long-read MM/ML modification calling see long-read-sequencing/nanopore-methylation; for the upstream BAM see bismark-alignment; for per-CpG statistics see differential-cpg-testing.
npx skillsauth add GPTomics/bioSkills bio-methylation-callingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: Bismark 0.24+, MethylDackel 0.6+, samtools 1.19+.
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagspip show <package> then help(module.function) to check signaturesIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
The extractor must match the aligner: bismark_methylation_extractor reads Bismark's own XM call string and CANNOT process a bwa-meth BAM (no XM tag); MethylDackel and BISCUIT recompute the call from BAM + reference and are aligner-agnostic. The genome FASTA build (hg38 vs T2T-CHM13) defines every coordinate downstream, and BISCUIT/Bis-SNP need a SNP-aware reference workflow. Always run <tool> --help on the installed build before quoting a default; MethylDackel --maxVariantFrac and EM-seq end-trim recommendations have no universal value.
"Get methylation levels from my bisulfite BAM" -> Count converted-vs-unconverted bases at each reference cytosine and divide - because a methylation level is a per-cytosine COUNT RATIO, not a measurement, and every number is hostage to conversion completeness, SNPs, overlap, and the 5mC/5hmC conflation.
bismark_methylation_extractor -p --comprehensive --bedGraph --cytosine_report --genome_folder genome/ sample_pe.bamMethylDackel extract --mergeContext ref.fa sample.bamScope: per-cytosine M/U counting from a short-read bisulfite/EM-seq/TAPS alignment, the conversion/M-bias/variant QC gates, and the coverage/cytosine-report handoff. Native long-read MM/ML modification calling -> long-read-sequencing/nanopore-methylation. The aligned BAM -> bismark-alignment. Per-CpG statistics on the counts -> differential-cpg-testing. Region calling -> dmr-detection. Array (EPIC/450K) beta-from-intensity is a different readout entirely and is not this skill.
The extractor does not measure methylation. Bisulfite/EM-seq turns an epigenetic state into a sequence state (an unmethylated C converts to T; a methylated C stays C), and the extractor tallies, at each reference cytosine, M reads that show C and U reads that show T, then reports beta = M / (M + U). Every beta is therefore hostage to four upstream choices the extractor cannot fix:
Organize the analysis around defending these four (conversion QC -> overlap -> M-bias -> variant-awareness -> what context/chemistry am I even calling), not around listing flags. And one deeper caveat: beta is itself a lossy average - collapsing reads to a .cov matrix discards the read-level epiallele, the strand (hemimethylation), and the allele (ASM); see the last sections.
| Tool | Citation | Mechanism / role | When |
|------|----------|------------------|------|
| bismark_methylation_extractor | Krueger & Andrews 2011 Bioinformatics 27:1571 | reads Bismark's own XM call string; ecosystem standard | input is a Bismark BAM; want the genome-wide cytosine report / methylKit .cov; plant --CX |
| MethylDackel | github.com/dpryan79/MethylDackel (no journal) | recomputes calls from BAM + reference; fast; aligner-agnostic | bwa-meth BAM; want built-in M-bias --OT/--OB bounds and --maxVariantFrac SNP guard |
| BISCUIT | github.com/huishenlab/biscuit (no journal) | aligner + JOINT methylation/SNP/ASM caller; VCF + epiBED | human population / allele-specific work needing SNP-aware betas |
| Bis-SNP | Liu 2012 Genome Biol 13:R61 | Bayesian joint genotype + methylation | the original SNP-aware caller; mask C/T SNPs from betas |
| methylpy | github.com/yupenghe/methylpy (no journal) | allc format + binomial methylated-flag | the allc / ALLCools ecosystem; built-in per-site significance |
| asTair | bitbucket.org/bsblabludwig/astair (no journal) | polarity-aware caller (mCtoT / CtoT) | TAPS data (inverted polarity); explicit per-context stats |
EM-seq extraction arithmetic is identical to bisulfite (unmodified C reads as T either way) - the same Bismark/MethylDackel command works. TAPS INVERTS the polarity (modified C->T), so a bisulfite extractor reports 1-beta; use ASTAIR --mod_mapping mCtoT.
| Scenario | Recommended | Why |
|----------|-------------|-----|
| Bismark-aligned BAM, mammalian CpG | bismark_methylation_extractor -p --comprehensive --cytosine_report | reads the XM tag; cytosine report is the bsseq/methylKit input |
| bwa-meth-aligned WGBS/EM-seq | MethylDackel extract --mergeContext | bismark extractor cannot read a bwa-meth BAM (no XM tag) |
| Human population / polymorphic CpGs / ASM | BISCUIT (or Bis-SNP) | joint SNP+methylation; C/T SNP recognized as a variant, not unmethylation |
| Plant methylome (CHG/CHH real) | --CX (Bismark) / --CHG --CHH (MethylDackel) + lambda spike-in | non-CpG is biology; CHH-as-conversion-proxy fails |
| TAPS data | ASTAIR --mod_mapping mCtoT | polarity inverted; a bisulfite extractor reports 1-beta |
| Need 5hmC resolved from 5mC | second chemistry (oxBS/TAB/ACE) -> route, not a flag | WGBS/EM-seq sums them; the extractor cannot split |
| Read-level heterogeneity / clonality / cfDNA | keep reads (epiread/epiBED), do NOT collapse to .cov | the epiallele dies in the per-CpG average |
| Long-read modBAM with MM/ML tags | -> long-read-sequencing/nanopore-methylation | native modification tags, not conversion counting |
Goal: Reject a methylome whose betas are inflated by incomplete conversion before computing anything.
Approach: Measure non-conversion globally, two ways. In mammals, genome-wide CHH methylation is near-zero biology, so the observed CHH rate IS the non-conversion rate (~2% CHH implies ~98% conversion). Where non-CpG methylation is real (plants, ESCs, neurons - Schultz 2015), spike in unmethylated lambda phage DNA and align to it: all its cytosines are unmethylated, so its observed "methylation" is the non-conversion rate. Require conversion >=99% (community convention; Shirane 2013 is a representative source). Extract CHH at least once even for a CpG-only mammalian study, purely for this number.
# Bismark: extract all contexts, then read the CHH methylation rate from the splitting report
bismark_methylation_extractor -p --comprehensive --CX --genome_folder genome/ sample_pe.bam
# The *_splitting_report.txt prints C methylated in CHH context %; (100 - that) ~ conversion efficiency in mammals.
# Lambda spike-in (the gold standard, mandatory in plants/ESC/neurons):
# align to the lambda genome, extract, and treat its global methylation as the non-conversion rate.
bismark --genome lambda_genome/ -1 R1.fq.gz -2 R2.fq.gz -o lambda_qc/
MethylDackel --minConversionEfficiency can additionally drop individual reads whose own conversion (from non-CpG Cs) is too low - a per-read complement to the global gate, not a replacement for it.
bismark_methylation_extractor \
-p \ # paired-end (--no_overlap is ON BY DEFAULT for -p)
--comprehensive \ # merge OT/OB/CTOT/CTOB into one file per context
--bedGraph --cytosine_report \ # coverage/bedGraph + genome-wide every-C report
--genome_folder genome/ \ # MANDATORY for --cytosine_report (scans the FASTA for all Cs)
--ignore_r2 2 \ # trim R2 5' end-repair artifact (set from M-bias; near-universal for EM-seq/PBAT)
--parallel 4 --gzip \
-o methylation/ sample_pe.bam
# Outputs: sample_pe.bismark.cov.gz (1-BASED), sample_pe.bedGraph.gz (0-BASED), sample_pe.CpG_report.txt.gz (1-BASED).
Collapse the symmetric CpG dyad with coverage2cytosine --merge_CpG (NOT --merge_non_CpG, which merges the CHG+CHH files). --merge_CpG adds the + strand C at position p and the - strand C at p+1 into one dyad entry, doubling effective coverage; it is CpG-only and incompatible with --CX.
MethylDackel mbias ref.fa sample.bam mbias_prefix # inspect the SVGs; it SUGGESTS --OT/--OB bounds (do NOT accept blindly)
MethylDackel extract \
--mergeContext \ # collapse the symmetric CpG dyad (the MethylDackel equivalent of --merge_CpG)
--maxVariantFrac 0.25 \ # exclude a C if the opposite-strand non-G fraction exceeds this (cheap SNP guard)
--OT 3,0,0,98 --OB 3,0,0,98 \ # inclusion bounds from mbias: first/last bp to keep on read1,read2
ref.fa sample.bam
# Output sample_CpG.bedGraph is 0-BASED, half-open: chr start end round(%meth) count_M count_U. CpG ONLY by default;
# add --CHG --CHH for plants. Default --minDepth 1, -q (MAPQ) 10, -p (Phred) 5.
A beta is a binomial proportion: at coverage n its granularity is 1/n and its SE is ~sqrt(beta(1-beta)/n). At n=1 beta is only 0 or 1; at n=4 it lands on {0,.25,.5,.75,1}. A 1/2 site and a 50/100 site both read beta=0.5 but carry wildly different evidence. The extractors impose no biological floor (MethylDackel --minDepth and Bismark .cov both default to >=1). Do NOT pre-threshold to a single beta and t-test it - that discards the coverage information. Hand the COUNT data (M and total) forward; the downstream DMR callers (DSS/methylKit/bsseq) model the beta-binomial and USE n. A common per-CpG floor is >=10x AFTER symmetric merge, but it is a tradeoff (site count vs precision), not a magic constant.
Standard bisulfite AND standard EM-seq protect BOTH 5mC and 5hmC from conversion, so the "methylated" bin is 5mC+5hmC, never 5mC alone. Worse, bisulfite DEAMINATES the downstream TET-oxidation products 5fC and 5caC, so they read as T and land in the "unmethylated" bin - both bins are biochemically impure at TET-active loci (ESC, early embryo, neurons, some tumors). The full cascade is 5mC -> 5hmC -> 5fC -> 5caC. Resolving any derivative requires a SECOND wet-lab chemistry, not an extractor flag: 5hmC via oxBS-seq (Booth 2012; 5hmC = BS - oxBS by subtraction), TAB-seq (Yu 2012; direct 5hmC), or ACE-seq (Schutsky 2018; enzymatic, low-input); the bisulfite-free TAPS (Liu 2019) sidesteps the harsh chemistry but inverts polarity. Never let a plain WGBS/EM-seq beta be labeled "5mC" or its complement "unmodified C."
beta=0.5 is consistent with three opposite biologies that beta cannot distinguish: a 50/50 mixture of fully-methylated and fully-unmethylated cells, every cell ~50% methylated with CpGs scattered differently per molecule (stochastic disorder), or a true uniform intermediate. The discriminator lives in the JOINT CpG pattern on a single read - intramolecular co-methylation - which the per-CpG average integrates out. Read-level metrics (PDR, Landau 2014; epipolymorphism, Landan 2012; methylation entropy; MHL for cfDNA tissue-of-origin, Guo 2017; FDRP/qFDRP) measure clonal/epigenetic instability and power liquid-biopsy deconvolution, and they require a READ-PRESERVING format (BISCUIT epiread/epiBED or the raw BAM) plus tools like Metheor or methclone - none of which are extractors. The moment the pipeline collapses to a .cov beta matrix, the epiallele is gone. If the question is heterogeneity, clonality, or cfDNA deconvolution, do NOT collapse to beta; this deeper analysis may warrant its own workflow, but at minimum the calling stage must flag that beta discards it.
Hemimethylation (the strand axis). --merge_CpG / --mergeContext is not a neutral coverage optimization: it bakes in the assumption that the two strands of a dyad agree and silently zeroes the hemimethylation channel (one strand methylated, the other not). Hemimethylation is real biology - the obligate post-replication maintenance intermediate, and a stable heritable mark at CTCF/cohesin sites required for chromatin looping (Xu & Corces 2018). Default destranding ON for bulk symmetric-CpG DMR work; turn it OFF (strand-specific extraction) and budget high per-strand coverage the moment strand asymmetry is the question. Per-molecule dyad state needs hairpin-bisulfite (Laird 2004).
Allele-specific methylation (the allele axis). The same C/T SNP that corrupts a beta (insight #2) becomes the measurement axis once reads are phased: assign each read's methylation to the SNP allele it carries and compare beta per allele (BISCUIT epiread -B snps.bed then biscuit asm; or Bis-SNP). The headline trap: a stable ~50% beta at a KNOWN imprinted control region (H19/IGF2, KCNQ1OT1, SNRPN, GNAS, MEG3) is NOT intermediate methylation and NOT a conversion artifact - it is two superimposed monoallelic states (one allele ~100%, one ~0%) averaged into a deceptive midpoint, a signature to PHASE, not a value to model. Sequence-dependent ASM is the mQTL bridge to causal-genomics/mendelian-randomization; produce the allele-phased betas here, do the colocalization there.
Trigger: reporting betas without checking CHH rate or a lambda spike-in. Mechanism: an unconverted unmethylated C is identical to a methylated C; non-conversion inflates every beta. Symptom: uniformly elevated methylation, fabricated low-methylation regions. Fix: require >=99% conversion (CHH-proxy in mammals; lambda spike-in in plants/ESC/neurons).
Trigger: bismark_methylation_extractor on a non-Bismark BAM. Mechanism: it reads Bismark's XM tag, absent from bwa-meth output. Symptom: error or empty/garbage calls. Fix: use MethylDackel or BISCUIT, which recompute from BAM + reference.
Trigger: extracting human population data with no variant-awareness. Mechanism: a T allele at a reference C is scored as converted. Symptom: spurious hypomethylation at polymorphic CpGs; SNP-driven false DMRs. Fix: BISCUIT/Bis-SNP joint calling, or MethylDackel --maxVariantFrac, or mask dbSNP/sample-VCF CpGs.
Trigger: single-end mode on paired data, or a non-default tool. Mechanism: the R1/R2 overlap counts one molecule twice. Symptom: inflated coverage, biased beta on mate disagreement. Fix: -p (Bismark --no_overlap is on by default for paired-end); MethylDackel handles overlap automatically.
Trigger: extracting without inspecting the M-bias plot. Mechanism: end-repair fill-in introduces unmethylated Cs at fragment ends; the per-position methylation deviates near read ends. Symptom: a non-flat M-bias curve; biased calls in the trimmable region. Fix: trim with --ignore/--ignore_r2 (Bismark) or --OT/--OB bounds (MethylDackel). The diagnostic is FLATNESS/positional stability, not any particular global level.
Trigger: joining a MethylDackel bedGraph (0-based) to a Bismark .cov (1-based). Mechanism: the two formats index differently. Symptom: every site shifts by one; strands silently mismatch. Fix: pick one format end-to-end or convert explicitly; the cytosine report and allc are 1-based, both bedGraphs are 0-based.
Trigger: using --merge_non_CpG to merge the symmetric CpG strands. Mechanism: --merge_non_CpG merges the CHG+CHH output files, NOT the CpG dyad. Symptom: strand-specific CpG report unchanged; CpG coverage not doubled. Fix: symmetric CpG collapse is coverage2cytosine --merge_CpG / MethylDackel --mergeContext.
| Threshold | Source | Rationale |
|-----------|--------|-----------|
| Conversion efficiency >=99% | Shirane 2013 PLoS Genet 9:e1003439; community | below it, residual unconverted Cs inflate every beta |
| CHH rate ~= (1 - conversion) in mammals | Schultz 2015 Nature 523:212 | true mammalian CHH is near-zero, so it reads out non-conversion |
| Per-CpG coverage >=10x (after merge) | community | granularity 1/n; SE of an intermediate beta ~0.16 at 10x |
| --ignore_r2 ~2 for EM-seq/PBAT | M-bias plot | R2 5' end-repair fill-in adds unmethylated Cs; set from the plot |
| MethylDackel -q 10 / -p 5 | MethylDackel defaults | MAPQ/base-quality minima; defaults, confirm with --help |
| --maxVariantFrac no universal value | MethylDackel docs | set per-experiment against known-SNP density |
| --merge_CpG doubles dyad coverage | Bismark docs | + and - strand of a symmetric CpG are co-methylated; merge then threshold |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| Empty/garbage calls from bismark extractor | bwa-meth BAM (no XM tag) | use MethylDackel/BISCUIT |
| Uniformly high methylation | incomplete conversion | check CHH rate / lambda; require >=99% |
| Spurious hypomethylation at SNPs | C/T SNP read as conversion | --maxVariantFrac; BISCUIT/Bis-SNP; mask dbSNP |
| Inflated coverage on overlaps | single-end mode on paired data | -p; MethylDackel handles it automatically |
| Sites shifted by one after a join | 0-based vs 1-based format mix | reconcile coordinate base; one format end-to-end |
| Plant CHH/CHG methylome missing | CpG-only default | --CX (Bismark) / --CHG --CHH (MethylDackel) |
| coverage2cytosine: option --merge_CpG ignored | used --merge_non_CpG instead, or paired with --CX | --merge_CpG is CpG-only, incompatible with --CX |
| Inverted (1-beta) methylome | bisulfite extractor on TAPS data | ASTAIR --mod_mapping mCtoT |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.