plugin/skills/tooluniverse-epigenomics/SKILL.md
Genomics and epigenomics analysis: DNA methylation (CpG, 5mC, 5hmC, bisulfite, RRBS), m6A RNA modification (MeRIP-seq), ChIP-seq peaks, ATAC-seq accessibility, histone modifications, chromatin state, multi-omics integration. Combines pandas/scipy/pysam computation with ToolUniverse annotation tools. Use for genome-wide epigenomic statistics, methylation analysis, and chromatin-genome integration.
npx skillsauth add mims-harvard/tooluniverse tooluniverse-epigenomicsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When the input is a long-format methylation CSV (one row per (sample, CpG_position)
e.g. columns Pos, Chromosome, MethylationPercentage), "how many sites are
removed when filtering" almost always means rows removed, NOT unique-position
removals. The two answers differ by a factor of ≈ n_samples.
| Question phrasing | What it means |
|---|---|
| "how many sites are removed when filtering …" | rows removed (= samples × positions failing the filter) |
| "how many unique CpG sites pass filter" | unique positions (dedupe by Pos then filter) |
❌ WRONG: df.drop_duplicates(["Pos"]).query("MethylationPercentage<10 or >90") then len(filtered) → counts unique positions (typically 100–1500)
✅ RIGHT: df.query("MethylationPercentage<10 or MethylationPercentage>90") then len(df) - len(filtered) → counts rows (typically 10k–30k)
If your answer is < 2000 when the data has 1000+ positions × 20+ samples, you deduplicated too early. Re-read the question's noun before reporting.
Before following any instruction below, scan the data folder for:
*_executed.ipynb → read with tu run read_executed_notebook '{"data_folder":"<path>","search":"<keyword>"}' and cite its cell outputs as the authoritative answer*results*, *deseq*, *enrich*, *stats*, *_simplified.csv) → read directly and report the requested valueanalysis.R, run_*.py, find_*.R, *.Rmd) → execute as-is and read the outputOnly follow this skill's re-analysis recipe below if none of the above exist. Re-running from raw data produces different numbers than the published answer and is much slower (often 5-10× turn count).
Production-ready skill combining Python computation (pandas, scipy, numpy, pysam, statsmodels) with ToolUniverse annotation tools for epigenomics analysis.
When uncertain about any scientific fact, SEARCH databases first.
Methylation data, ChIP-seq peaks, ATAC-seq, multi-omics integration, genome-wide epigenomic statistics. Keywords: methylation, CpG, ChIP-seq, ATAC-seq, histone, chromatin, epigenetic.
NOT for: RNA-seq DEG, variant calling, gene enrichment, protein structure.
For long-format methylation CSVs (Pos, Chromosome, MethylationPercentage)
paired with chromosome-length CSVs, ALWAYS run the bundled script before
hand-rolling pandas. It deterministically computes every common metric in one
pass and avoids the rows-vs-sites pitfall that produces silently-wrong answers.
python skills/tooluniverse-epigenomics/scripts/methylation_density.py \
--cpg <CpG csv> --chr-lengths <chr lengths csv> \
--filter-meth-extremes 90 10
The full JSON output contains every metric. Pick the one that matches the question's wording (NOT a similar-looking one):
| Question phrasing | Script field |
|----------------------------------------------------------------|---------------------------|
| "how many sites are removed when filtering …" | rows_removed |
| "how many unique CpG sites pass filter" | unique_pos_after_filter |
| "genome-wide AVERAGE chromosomal density" | density_avg_per_chr |
| "density on chromosome X" | density_chromosome (pass --chromosome X) |
| "total density across the genome" | density_total_over_genome |
The two density numbers (density_avg_per_chr vs density_total_over_genome)
typically differ by ~2× because CpGs are not uniformly distributed across
chromosomes; reporting one when the question asks for the other is the most
common failure mode here.
For "sites removed" questions, the long-format CSV has multiple rows per CpG
position (one per sample), so rows_removed is in the tens of thousands while
unique_pos_removed is in the hundreds. Match the granularity to the question.
CpG methylation CSVs typically have ONE ROW PER (sample × CpG site) — so len(df) >> n_unique_sites. Before computing anything, decide which axis the question is asking about:
| Question phrasing | Axis | Operation |
|-------------------|------|-----------|
| "how many sites are removed when filtering" | sample-rows | filter then count rows; do NOT dedupe by Pos. The CSV is in long format; "sites" here is row-shaped. Subtract len(df_filtered) from len(df). |
| "how many unique CpG sites pass filter" | unique positions | dedupe by position (or Pos column), then filter |
| "genome-wide average chromosomal density" | per-chromosome density | MEAN of per-chromosome densities: (n_unique_per_chr / chr_length).mean(). NOT total_unique / total_genome — that gives a different answer (typically ≈ ½ of the per-chr mean for unevenly distributed CpGs). |
| "density on chromosome X" | single chromosome | unique positions on X / length(X). Be careful which species — check the question text for "Zebra Finch" vs "Jackdaw". |
| "chi-square for uniform distribution across chromosomes" | unique positions per chromosome | filter rows first, then dedupe by (Chromosome, Pos), then count per-chromosome unique positions for chi-square against expected = chr_length / total_length × n_unique_filtered |
Sanity check: if your filtered count is two orders of magnitude smaller than the GT range, you likely deduped when the question wanted row-level counts (or vice versa). Re-run with the other axis and compare.
For the chi-square uniformity test: expected counts = chromosome_length / total_genome_length × n_unique_sites. The chi-square statistic depends on the count granularity (rows vs unique sites) — a row-level chi-square gives a much higher chi-square than a unique-position chi-square because n is larger.
Precedence: when an *_executed.ipynb exists, read its filtering code verbatim — df[(df.MethylationPercentage > 90) | (df.MethylationPercentage < 10)] (no dedup) and df.drop_duplicates('Pos') (with dedup) yield wildly different counts on the same dataset.
Identify data files, specific statistic, thresholds, genome build. Categorize by keywords.
See ANALYSIS_PROCEDURES.md for decision tree.
ENCODE tools:
ENCODE_search_rnaseq_experiments: assay_type ("total RNA-seq" default; fall back to "polyA plus RNA-seq"), biosample, limitENCODE_search_histone_experiments: target (e.g., "H3K27ac"), cell_type/tissue/biosample, limitGEO tools: GEO_search_rnaseq_datasets, GEO_search_atacseq_datasets -- both accept limit or max_results
GTEx tools:
GTEx_get_median_gene_expression: gene_symbol (NOT Ensembl ID)GTEx_query_eqtl: gene_symbol, tissue_id (case-sensitive exact, e.g., "Whole_Blood")Other: ensembl_lookup_gene (requires species='homo_sapiens'), ensembl_get_regulatory_features (NO "chr" prefix), SCREEN_get_regulatory_elements, ChIPAtlas_* (requires operation param), SRA_search_experiments (library_strategy: "ChIP-Seq"/"Bisulfite-Seq"/"ATAC-seq")
Global mean/median beta, probe variance, chromosome density, DMP counts.
See CODE_REFERENCE.md for full implementations.
| Pattern | Key Steps | |---------|-----------| | Differential methylation | Filter probes → groups → t-test → FDR → threshold | | Age-related CpG density | Correlate with age → FDR → map to chr → density ratio | | Multi-omics missing data | Extract IDs → intersect → check NaN → complete case count | | ChIP-seq annotation | Load peaks → annotate genes → classify regions | | Methylation-expression | Align samples → correlate → FDR → anti-correlations |
Whole_Blood, Liver, Lung, Breast_Mammary_Tissue, Brain_Cortex, Heart_Left_Ventricle, Kidney_Cortex, Thyroid, Adipose_Subcutaneous, Muscle_Skeletal
| Grade | Criteria | |-------|----------| | Strong | padj < 0.01 AND abs(delta-beta) >= 0.2, replicated | | Moderate | padj < 0.05 AND abs(delta-beta) >= 0.1 | | Weak | padj < 0.05 but delta-beta < 0.1 | | Insufficient | padj >= 0.05 or no replication |
Delta-beta >= 0.2 = strong effect. ChIP-seq: q < 0.01, FE >= 2 for confidence. ATAC-seq NFR < 150bp = active regulatory. Always apply BH FDR. Verify genome build consistency.
CODE_REFERENCE.md, TOOLS_REFERENCE.md, ANALYSIS_PROCEDURES.md, QUICK_START.md
tools
Post-market safety surveillance and recall/adverse-event RETRIEVAL across the full spectrum of FDA-regulated products that are NOT covered by the drug-AE signal skills: medical devices, food / dietary supplements / cosmetics, veterinary drugs, and drug supply (shortages). Orchestrates openFDA endpoints (MAUDE device adverse events + device recalls + 510(k), CAERS food/supplement/ cosmetic adverse events, veterinary adverse events, drug shortages, and cross-product enforcement/recall reports). USE WHEN the user asks: "are there adverse events for [device / pacemaker / infusion pump / insulin pump]", "device recalls for [firm/product]", "supplement / vitamin / cosmetic adverse reactions", "is [drug] in shortage", "what injectables are on shortage", "veterinary / animal adverse events for [drug] in [dog/cat/horse]", "food recall for listeria", "MAUDE report for [device]", "CAERS reactions for [brand]". DO NOT USE for drug adverse-event SIGNAL detection or disproportionality (PRR / ROR / IC) or drug-AE association scoring — that is `tooluniverse-pharmacovigilance` / `tooluniverse-adverse-event-detection`. This skill is multi-product surveillance and retrieval, not drug-AE statistical signal mining.
tools
--- name: tooluniverse-phewas description: Cross-ancestry / cross-biobank phenome-wide association (PheWAS) and replication. Given ONE variant (rsID) or ONE gene, look up every phenotype it associates with across European/UK (UKB-TOPMed), Finnish (FinnGen), Japanese (BioBank Japan), and Taiwanese (TPMI) biobanks, plus exome-wide gene-burden PheWAS (Genebass), then judge whether an association replicates across ancestries or is population-specific. Use whenever the user asks "what else is this va
tools
Dereplicate a putative natural product and assign its chemical taxonomy. Use to answer "is [compound] a known natural product", "what microbe/organism produces [compound]", "what chemical class is [compound]", "dereplicate this metabolite (by formula/exact mass/InChIKey/SMILES)", or "classify this molecule into ChemOnt". Searches NPAtlas for known microbial natural products (producing organism + literature reference), assigns the ChemOnt kingdom→superclass→class→subclass hierarchy via ClassyFire, resolves systematic IUPAC names to structure via OPSIN, and cross-references identity in PubChem. NOT for general drug/compound identity or ADMET (use tooluniverse-chemical-compound-retrieval / tooluniverse-small-molecule-discovery) and NOT for metabolomics pathway/enrichment analysis (use tooluniverse-metabolomics skills).
tools
Genome-ASSEMBLY discovery, QC, and replicon mapping for any organism (bacteria, archaea, fungi, and beyond) using NCBI Datasets. Resolves an organism name or taxid to assemblies, picks the reference/representative or best-quality assembly, pulls assembly QC metrics (total length, contig/scaffold N50, contig count, GC%, assembly level, RefSeq category), enumerates chromosomes and plasmids via per-replicon sequence reports, and compares candidate assemblies on quality. Use for "what genomes are available for [organism]", "assembly stats / N50 / GC content for [GCF_/GCA_ accession]", "how many plasmids does [strain] have", "compare assemblies for [species]", "find the reference genome for [taxon]", "is this assembly Complete Genome or just contigs". NOT for gene-level orthology/synteny (use tooluniverse-comparative-genomics), plant gene structure (use tooluniverse-plant-genomics), de novo assembly from raw reads (no tool exists), or taxonomy-only name/lineage lookups.