liquid-biopsy/methylation-based-detection/SKILL.md
Detects cancer and infers tissue-of-origin from cfDNA methylation by choosing conversion chemistry (bisulfite vs EM-seq vs TAPS vs cfMeDIP), calling read-level methylation haplotypes rather than averaged beta values, and deconvolving a hematopoietic-dominated cfDNA mixture against a methylation atlas via NNLS/quadratic programming. Encodes the GRAIL/CCGA thesis that thousands of tissue-specific markers make methylation outperform sparse mutations for multi-cancer early detection (MCED) and localization, and that single concordantly-methylated fragments give ppm-level sensitivity. Uses MethylDackel for extraction (mbias-then-extract), MEDIPS/QSEA for enrichment data, scipy.optimize.nnls for deconvolution. Use when building an MCED or methylation-MRD assay, picking a conversion chemistry for low-input plasma, or deconvolving tissue-of-origin from cfDNA.
npx skillsauth add GPTomics/bioSkills bio-methylation-based-detectionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: MethylDackel 0.6+, Bismark 0.24+, numpy 1.26+, pandas 2.2+, scipy 1.12+, statsmodels 0.14+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Notes specific to this skill: MethylDackel extract bedGraph column order is fixed (chrom / start / end / methylation-% rounded to integer / count-methylated / count-unmethylated); always run MethylDackel mbias first and feed its suggested --OT/--OB trimming into extract. cfMeDIP data are coverage, not conversion — do not feed them into per-CpG bisulfite pipelines.
"Detect cancer and find where it came from using cfDNA methylation" -> Call per-CpG (and read-level) methylation from converted plasma DNA, then deconvolve tissue-of-origin against a reference atlas.
MethylDackel extract for per-CpG methylation from bisulfite/EM-seq BAMsMethylDackel mbias to choose strand-specific trimming before extractionscipy.optimize.nnls for atlas-based tissue deconvolutionMethylation is the right altitude for multi-cancer early detection (MCED) and tissue-of-origin (TOO) in a single assay because the genome carries thousands of stable, cell-type-specific differentially methylated regions, whereas somatic mutations are sparse, recurrent only at a few driver loci, and carry no tissue label. On the same cfDNA inside CCGA, the methylation assay outperformed WGS-SNV/CNV approaches, which is why GRAIL down-selected to a targeted methylation panel (Liu 2020). One panel answers both "is there cancer?" and "where is it?" — mutations answer neither well.
The sensitivity engine is read-level, not site-level. Averaging beta across reads at a CpG discards phasing. A single tumor-derived fragment that is concordantly methylated across the k CpGs of a methylation haplotype block (Guo 2017) has a background probability of roughly p^k of arising from the hematopoietic ocean; for a block of 5-8 CpGs that is small enough that ONE such fragment is strong evidence, independent of tumor fraction. Per-site beta dilutes that signal into sampling noise and clonal-hematopoiesis variance and has essentially no power at parts-per-million tumor fraction. The correct primitive for detection is molecule counting over haplotype blocks, not site averaging.
The conversion step is chosen on the worst possible substrate — already-fragmented, low-input plasma DNA — so destructiveness is load-bearing, not a footnote.
| Method | Destructiveness | Min input | Base resolution | 5mC readout | Key bias / caveat | |--------|-----------------|-----------|-----------------|-------------|-------------------| | Bisulfite (WGBS/targeted) | Severe — depurinates/fragments, >90% loss possible | High (degradation eats low input) | Yes | C->T after conversion; remaining C = methylated | Complexity collapse on cfDNA; incomplete conversion -> false methylation; GC/coverage bias | | EM-seq (Vaisvila 2021) | Much gentler — enzymatic, no chemical fragmentation | Picograms demonstrated | Yes | Same C->T readout as bisulfite, milder | Reads 5mC+5hmC together unless separated; APOBEC over/under-deamination edge cases | | TAPS (Liu 2019) | Non-destructive — mild | Low / cfDNA-friendly | Yes | Direct: 5mC/5hmC -> T, unmethylated C untouched | Only a few % of Cs convert -> preserves complexity, lower seq cost; needs TET + pyridine borane | | cfMeDIP-seq (Shen 2018) | No conversion (antibody enrichment) | Very low (>=5-10 ng) | No — region/enrichment-level only | Antibody pulls down methylated fragments | CpG-density bias; no single-CpG quantitation; needs MEDIPS/QSEA density modeling |
Bisulfite is gold standard for cell-line gDNA, not for low-input fragmented plasma; EM-seq and TAPS exist precisely to recover ctDNA molecules bisulfite destroys. Neither bisulfite nor EM-seq separates 5mC from 5hmC without added oxBS/TAB steps; TAPS variants (TAPSbeta, CAPS) can split the marks. cfMeDIP coverage is enrichment, not quantitation — density bias must be modeled before any absolute-methylation claim.
| Scenario | Recommended | Why | |----------|-------------|-----| | MCED + tissue-of-origin in one assay | Targeted methylation panel + atlas deconvolution | Thousands of tissue markers carry both cancer and organ signal (Liu 2020; Loyfer 2023) | | Ultra-low-input plasma, genome-wide | cfMeDIP-seq | Antibody enrichment works at ng-to-low input where conversion destroys the library (Shen 2018) | | Need base resolution at low input | EM-seq or TAPS, not bisulfite | Gentle/non-destructive conversion preserves complexity bisulfite collapses (Vaisvila 2021; Liu 2019) | | MRD / ppm-level detection | Read-level haplotype counting over pre-defined blocks | One concordant fragment is decisive; averaged beta has no power at low tumor fraction (Guo 2017) | | Absolute methylation level from enrichment data | QSEA (Bayesian density + CNV + TMM) | Converts cfMeDIP coverage to BS-comparable values; MEDIPS gives differential coverage only (Lienhard 2017) |
Methodology evolves; verify current atlas versions and panel-marker coverage against live tool docs before committing — atlas markers are platform-specific and do not transfer across assays.
Goal: Produce per-CpG methylation calls from a bisulfite/EM-seq cfDNA BAM, with end-repair artifacts trimmed.
Approach: Run mbias first to read the suggested strand-specific trimming, then extract with --mergeContext and that trimming so each CpG is one row; parse the fixed 6-column bedGraph.
# Step 1: choose trimming. mbias prints a suggestion like --OT 2,0,0,98 and writes M-bias SVGs.
MethylDackel mbias ref.fa sample.bam sample_mbias
# Step 2: extract per-CpG (one row per CpG) with the suggested trimming.
MethylDackel extract ref.fa sample.bam -o sample --mergeContext --minDepth 1 --OT 2,0,0,98
# Output sample_CpG.bedGraph columns (fixed order):
# chrom start end methylation%(integer, rounded) count_methylated count_unmethylated
Add --CHG --CHH only to audit non-CpG methylation (a conversion-failure check); CpG is the default context. Low per-CpG depth gates can erase cfDNA signal — prefer region/molecule aggregation over a high --minDepth.
Goal: Attribute cfDNA to its cell types of origin and surface a solid-tissue coefficient elevated above the hematopoietic baseline.
Approach: Model the observed methylation vector m ~ A*w with atlas A (rows = markers, cols = cell types), solve for non-negative mixing fractions w with NNLS over atlas-covered markers, then renormalize so the fractions sum to one.
from scipy.optimize import nnls
def deconvolve_tissue(sample_beta, atlas):
'sample_beta: Series indexed by marker; atlas: DataFrame markers x cell_types.'
markers = sample_beta.index.intersection(atlas.index)
w, _ = nnls(atlas.loc[markers].values, sample_beta.loc[markers].values)
w = w / w.sum()
return dict(zip(atlas.columns, w))
The simplex constraint (w >= 0, sum w = 1) is mandatory — unconstrained regression gives nonsense fractions (Moss 2018). Use Loyfer 2023's fragment-level WGBS atlas (39 cell types from 205 healthy samples) where the assay covers its markers; a generic atlas does not transfer, because WGBS-fragment markers differ from 450K/EPIC probes and from capture-panel coverage.
Goal: Define a discriminating panel by finding regions (not single CpGs) that separate cancer from normal cfDNA.
Approach: Aggregate per-CpG beta into pre-defined regions/blocks, test cancer vs normal per region, and control FDR with Benjamini-Hochberg specified explicitly — naming method='fdr_bh' rather than relying on the statsmodels default ('hs', Holm-Sidak).
from scipy import stats
from statsmodels.stats.multitest import multipletests
def region_dmrs(cancer, normal, region_col='region'):
'cancer/normal: long DataFrames with [region_col, beta]; one row per sample-region.'
out = []
for region, c in cancer.groupby(region_col)['beta']:
n = normal.loc[normal[region_col] == region, 'beta'].dropna()
c = c.dropna()
if len(c) < 3 or len(n) < 3:
continue
_, p = stats.mannwhitneyu(c, n, alternative='two-sided')
out.append((region, c.mean() - n.mean(), p))
import pandas as pd
res = pd.DataFrame(out, columns=['region', 'delta_beta', 'pvalue'])
res['fdr'] = multipletests(res['pvalue'], method='fdr_bh')[1]
return res.sort_values('fdr')
Goal: Get density-corrected differential methylation from antibody-enrichment coverage rather than conversion data.
Approach: Use MEDIPS (Lienhard 2014) for CpG-density-corrected differential coverage, or QSEA (Lienhard 2017) when absolute, BS-comparable methylation levels are needed — QSEA adds a Bayesian CpG-density model, CNV correction, and TMM effective-library-size normalization. Both are R/Bioconductor; do not pass cfMeDIP coverage through MethylDackel.
Trigger: Reporting region beta means / per-CpG DMR t-tests for an MCED or MRD assay. Mechanism: Averaging across reads discards fragment-level concordance, the exact signal that lets one tumor fragment be called. Symptom: No power at low tumor fraction despite deep coverage. Fix: Count concordantly-methylated molecules over haplotype blocks; reserve beta for discovery and QC.
Trigger: Genome-wide per-CpG Welch t-tests with plain Benjamini-Hochberg. Mechanism: ~28M correlated CpGs violate BH independence (anticonservative) and per-site estimates are coverage-starved. Symptom: Inflated "DMR" lists that do not replicate. Fix: Region/block methods (dmrseq/metilene/methylKit windows) with permutation or correlation-aware FDR.
Trigger: WGBS on low-input plasma. Mechanism: Chemical depurination/fragmentation destroys input, collapsing unique molecules. Symptom: Low library complexity, duplicate-heavy, lost ctDNA molecules. Fix: EM-seq or TAPS; track conversion completeness via CHH methylation.
Trigger: Deconvolving with a mis-specified or unmatched hematopoietic reference. Mechanism: >90% of cfDNA is leukocyte/megakaryocyte-derived; reference error leaks variance into the small tumor term. Symptom: False or unstable TOO; a methylation analog of CHIP (clonal hematopoiesis/age/inflammation shifts the WBC methylome). Fix: Age/condition-matched background, fine-grained atlas, treat tumor as a small residual.
Trigger: Reading cfMeDIP coverage as methylation level, or running it through a per-CpG pipeline. Mechanism: Antibody enriches CpG-dense regions; there is no single-CpG quantitation. Symptom: Apparent hypermethylation tracking CpG density, not biology. Fix: Model density with MEDIPS coupling factor or QSEA's Bayesian model.
| Threshold | Source | Rationale |
|-----------|--------|-----------|
| Specificity 99.3%, sensitivity 54.9%, TOO 93% (among detected) | Liu 2020 Ann Oncol 31(6):745 | CCGA2 targeted-methylation operating point; screening fixes high specificity and accepts modest sensitivity |
| Specificity 99.5%, sensitivity 51.5%; stage I 16.8% -> IV 90.1%; TOO 88.7% | Klein 2021 Ann Oncol 32(9):1167 | CCGA3 clinical validation; sensitivity is stage- and tumor-type-dominated, never quote it as uniform |
| cfMeDIP input >= 5-10 ng | Shen 2018 Nature 563:579 | Enrichment works below conversion-assay input floors but still needs ng-scale material |
| Deconvolution: w >= 0 and sum w = 1 | Moss 2018 Nat Commun 9:5068; Loyfer 2023 Nature 613:355 | Simplex constraint mandatory; unconstrained regression yields nonsense fractions |
| Avoid high per-CpG --minDepth for cfDNA | MethylDackel docs; community | Per-CpG depth gates erase low-coverage cfDNA signal; aggregate over regions/molecules instead |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.