methylation-analysis/methylkit-analysis/SKILL.md
--- name: bio-methylation-methylkit description: Imports Bismark coverage or cytosine-report files into the methylKit object model, then runs the import-to-results spine - filterByCoverage, normalizeCoverage, unite/destrand, calculateDiffMeth, getMethylDiff - for both per-CpG (DMC) and fixed-tile (DMR) differential methylation, plus tileMethylCounts, PCA/correlation/clustering QC, and assocComp/removeComp batch handling. Covers the silent default traps that shape the false-positive rate: overdis
npx skillsauth add GPTomics/bioSkills methylation-analysis/methylkit-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: methylKit 1.28+, GenomicRanges 1.54+.
Before using code patterns, verify installed versions match. If versions differ:
packageVersion('<pkg>') then ?function_name to verify parametersIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
The assembly= string (e.g. hg38) is metadata only - methylKit never checks it. The genome build is real elsewhere: coordinates must match the alignment genome, and annotation packages (TxDb.Hsapiens.UCSC.hg38.knownGene, annotatr build_annotations(genome='hg38')) are genome-build-specific. methylKit's overdispersion, test, and adjust defaults have shifted across Bioconductor releases - run ?calculateDiffMeth on the installed build before trusting any default.
"Analyze methylation across my samples" -> Import per-cytosine counts into a methylRawList, then filter, normalize, unite, and test - because each of those steps is a modeling decision that sets which CpGs survive and how many false positives the test emits, not boilerplate.
methRead(pipeline='bismarkCoverage') -> filterByCoverage() -> normalizeCoverage() -> unite(destrand=) -> calculateDiffMeth(overdispersion='MN') -> getMethylDiff()Scope: the methylKit OBJECT MODEL and the short-read bisulfite import-to-results workflow, for BOTH per-CpG (DMC) and fixed-tile (DMR) results. Which per-site test to use (count vs continuous, beta vs M) -> differential-cpg-testing. Selection-aware region callers (dmrseq/DSS/metilene) and region FDR -> dmr-detection. Long-read MM/ML modBAM input -> long-read-sequencing/nanopore-methylation (its counts pipe back into this object model).
Coverage filtering, normalization, destranding, and the overdispersion model are not setup before the "real" test - they ARE the test. Each silently changes which CpGs exist and what the p-value means, and methylKit's defaults are tuned for nothing in particular. Three corollaries every misuse violates:
calculateDiffMeth defaults to overdispersion='none' - a plain logistic LRT that assumes binomial-only variance and over-calls under biological replication. Two healthy replicates differ at a CpG far more than coin-flip sampling predicts; only overdispersion='MN' adds the between-replicate (beta-binomial) layer. The paper discusses overdispersion; the function does not apply it unless told.overdispersion='MN' automatically switches to the F-test, so a passed test='Chisq' is ignored with no warning. adjust defaults to SLIM (methylKit's own q-method), not BH, so counts are not comparable to a DSS/limma BH analysis. tileMethylCounts(cov.bases=0) lets a one-CpG window become a "region." None of these throw an error; the result just quietly changes.Organize the workflow around defending these, not around calling functions in order.
Run these in order. Skipping or reordering them changes the result silently.
Goal: Load per-cytosine counts into a methylRawList, choosing the parser that matches the Bismark output format.
Approach: pipeline='bismarkCoverage' reads .cov/.cov.gz (chr/start/end/%meth/numC/numT - NO strand, so destranding is limited); pipeline='bismarkCytosineReport' reads the CX/CpG report (carries strand + context, enables proper destranding). treatment is an integer vector (0/1, or 0/1/2 for multi-group). context='CpG' only - never destrand CHG/CHH downstream.
library(methylKit)
file_list <- list('ctrl1.cov.gz', 'ctrl2.cov.gz', 'treat1.cov.gz', 'treat2.cov.gz')
sample_ids <- list('ctrl_1', 'ctrl_2', 'treat_1', 'treat_2')
meth_obj <- methRead(file_list, sample.id=sample_ids, treatment=c(0,0,1,1),
assembly='hg38', context='CpG', pipeline='bismarkCoverage')
# dbtype='tabix', save.db=TRUE gives disk-backed methylRawDB objects for large WGBS
Goal: Drop unreliable and artifactual CpGs per sample, then remove library-size-driven coverage differences so a deeper sample does not look more "confident."
Approach: filterByCoverage(lo.count, hi.perc) removes the noisy low tail and the artifactual high tail; normalizeCoverage scales coverage between samples. Both are per-sample and must precede unite and tileMethylCounts.
meth_filt <- filterByCoverage(meth_obj, lo.count=10, lo.perc=NULL, hi.count=NULL, hi.perc=99.9)
meth_norm <- normalizeCoverage(meth_filt, method='median')
Goal: Build the per-base table of CpGs covered across samples for testing.
Approach: unite keeps CpGs covered in ALL samples; min.per.group=2L relaxes that to >=2 per group (keeps more sites, allows missingness). destrand=TRUE merges the + and - strand counts of a CpG dyad - valid ONLY for symmetric CpG context AND only meaningful when strand is present (cytosine report). On .cov (bismarkCoverage, no strand) destranding is limited; on CHG/CHH it is wrong.
meth_united <- unite(meth_norm, destrand=TRUE) # destrand only if strand info present
meth_united <- unite(meth_norm, min.per.group=2L) # allow missingness across replicates
Goal: Confirm samples cluster by biology, not by batch, before believing any DMC.
Approach: Run correlation, PCA, and clustering on the united (% methylation) object. A control clustering with the treated group, or PC1 tracking sequencing batch, means the contrast is confounded.
getCorrelation(meth_united, plot=TRUE)
PCASamples(meth_united)
clusterSamples(meth_united, dist='correlation', method='ward.D', plot=TRUE)
Goal: Test each CpG for a group difference using a model that accounts for between-replicate overdispersion.
Approach: With replicates, set overdispersion='MN', which automatically uses the F-test (the test= argument is then ignored - passing test='Chisq' alongside MN does not produce a chi-square test). Set adjust='BH' if the q-values must be comparable to other tools; the default SLIM is methylKit-specific. getMethylDiff filters by effect size AND q.
diff_meth <- calculateDiffMeth(meth_united, overdispersion='MN', adjust='BH', mc.cores=4)
dmcs <- getMethylDiff(diff_meth, difference=25, qvalue=0.01) # all DMCs
dmcs_hyper <- getMethylDiff(diff_meth, difference=25, qvalue=0.01, type='hyper')
# positive meth.diff = hyper in the higher-treatment group
Goal: Aggregate CpGs into fixed windows for a quick region-level scan.
Approach: Tile AFTER filter/normalize, raise cov.bases so a window needs real CpG support, then flow through the same unite -> calculateDiffMeth -> getMethylDiff path. The same getMethylDiff returns DMCs on a per-base object and (window) DMRs on a tiled object.
tiles <- tileMethylCounts(meth_norm, win.size=1000, step.size=1000, cov.bases=3) # cov.bases>=3
tiles_united <- unite(tiles, destrand=FALSE) # tiles are not strand objects
diff_tiles <- calculateDiffMeth(tiles_united, overdispersion='MN', adjust='BH', mc.cores=4)
dmrs <- getMethylDiff(diff_tiles, difference=25, qvalue=0.01)
The per-tile q is a per-test SLIM/BH value: it does NOT model correlation between tiles and is NOT corrected for the region-selection step. Fixed windows also split or merge true DMRs at arbitrary boundaries. Treat methylKit tiles as a defensible screen; for rigorous region FDR (a permutation null that survives region selection) go to dmr-detection (dmrseq).
methylKit's calculateDiffMeth is a single-factor 2-group test. For known batch, remove the associated principal components before testing; for >2 groups, subset to pairwise contrasts. Complex designs (covariates, multi-factor) exceed what methylKit models - move to dmr-detection (DSS multiFactor / dmrseq covariates) or a continuous limma-on-M path (differential-cpg-testing).
sample_anno <- data.frame(batch=c('a','a','b','b'))
as_comp <- assocComp(meth_united, sample_anno) # which PCs track the covariate
meth_corrected <- removeComp(meth_united, comp=1) # drop the batch PC, then test
meth_AB <- reorganize(meth_united, sample.ids=c('ctrl_1','ctrl_2','treat_1','treat_2'),
treatment=c(0,0,1,1)) # subset/relabel for a pairwise contrast
pool(meth_united, sample.ids=...) sums replicate counts into one pseudo-sample per group. This DESTROYS biological replication - the test then has no within-group variance estimate and its p-values are meaningless for inference. Use it only for no-replicate exploratory visualization, never for the reported test.
Trigger: calculateDiffMeth with replicates and no overdispersion= argument. Mechanism: the default 'none' is a binomial-only logistic LRT that ignores between-replicate variance. Symptom: implausibly many significant CpGs; q-values far smaller than a beta-binomial tool gives on the same data. Fix: overdispersion='MN' (which uses the F-test) whenever replicates exist.
Trigger: passing both overdispersion='MN' and test='Chisq'. Mechanism: MN forces the F-test; test= is silently ignored. Symptom: the reported "chi-square test" was never run. Fix: drop test= when using MN; test= is honored only with overdispersion='none'.
Trigger: comparing methylKit q-value counts to a DSS/limma BH analysis. Mechanism: adjust defaults to SLIM, methylKit's own sliding-linear-model q-method. Symptom: DMC counts disagree with another tool's BH results at the "same" q. Fix: set adjust='BH' for cross-tool comparability and state the method used.
Trigger: tileMethylCounts at the default cov.bases=0. Mechanism: a window with one covered CpG becomes a "region." Symptom: thousands of single-CpG "DMRs," many noisy. Fix: raise cov.bases to >=3.
Trigger: tileMethylCounts(meth_obj, ...) or uniting before filtering/normalizing. Mechanism: low-coverage and library-size artifacts propagate into the tiles and the test. Symptom: artifactual regions; deeper samples look hyper-confident. Fix: filterByCoverage then normalizeCoverage BEFORE tiling/uniting.
Trigger: unite(destrand=TRUE) on .cov input or on CHG/CHH context. Mechanism: .cov carries no strand; non-CpG dyads are not symmetric. Symptom: double-counting or wrong merges. Fix: destrand CpG only, ideally from the cytosine report.
Trigger: pool() followed by calculateDiffMeth. Mechanism: pooling removes within-group variance. Symptom: tiny p-values with no biological meaning. Fix: never pool for the reported test; keep replicates separate.
| Threshold | Source | Rationale |
|-----------|--------|-----------|
| lo.count = 10 | convention (methylKit tutorial) | below ~10x a single-CpG percentage is a coin flip; not a derived value |
| hi.perc = 99.9 | convention | drops the top 0.1% coverage (PCR/repeat artifacts) |
| overdispersion = 'MN' with replicates | Akalin 2012 Genome Biol 13:R87 | adds the beta-binomial between-replicate layer; default 'none' over-calls |
| adjust = 'BH' (default SLIM) | Akalin 2012 Genome Biol 13:R87 | BH for comparability; SLIM is methylKit-specific |
| getMethylDiff difference=25, qvalue=0.01 | methylKit defaults | 25% is tutorial convention, NOT derived; justify per feature/coverage/purity and report it |
| tileMethylCounts cov.bases >= 3 | nuance | default 0 admits single-CpG tiles; require real CpG support |
| win.size=step.size=1000 (default) | methylKit defaults | step < win gives overlapping (sliding) tiles |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| Implausibly many DMCs | overdispersion='none' under replication | set overdispersion='MN' |
| Reported chi-square never ran | test='Chisq' with MN | MN forces F; drop test= |
| Counts disagree with another tool at same q | default adjust='SLIM' | set adjust='BH', state the method |
| Thousands of single-CpG "regions" | cov.bases=0 | raise cov.bases to >=3 |
| Destrand error / double counts | destrand on .cov or non-CpG | destrand CpG only, from cytosine report |
| Meaningless tiny p-values | pool() before testing | keep replicates; never pool for inference |
| Deeper sample looks more confident | no normalizeCoverage | normalize before unite/test |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.