Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

mims-harvard/tooluniverse-epigenomics

Name: tooluniverse-epigenomics
Author: mims-harvard

plugin/skills/tooluniverse-epigenomics/SKILL.md

npx skillsauth add mims-harvard/tooluniverse tooluniverse-epigenomics

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Genomics and Epigenomics Data Processing

⚠️ TOP-OF-MIND RULE: long-format methylation CSV — count ROWS, not unique positions

When the input is a long-format methylation CSV (one row per (sample, CpG_position) e.g. columns Pos, Chromosome, MethylationPercentage), "how many sites are removed when filtering" almost always means rows removed, NOT unique-position removals. The two answers differ by a factor of ≈ n_samples.

| Question phrasing | What it means | |---|---| | "how many sites are removed when filtering …" | rows removed (= samples × positions failing the filter) | | "how many unique CpG sites pass filter" | unique positions (dedupe by Pos then filter) |

❌ WRONG: df.drop_duplicates(["Pos"]).query("MethylationPercentage<10 or >90") then len(filtered) → counts unique positions (typically 100–1500)

✅ RIGHT: df.query("MethylationPercentage<10 or MethylationPercentage>90") then len(df) - len(filtered) → counts rows (typically 10k–30k)

If your answer is < 2000 when the data has 1000+ positions × 20+ samples, you deduplicated too early. Re-read the question's noun before reporting.

RULE ZERO — Check for pre-computed results FIRST

Before following any instruction below, scan the data folder for:

*_executed.ipynb → read with tu run read_executed_notebook '{"data_folder":"<path>","search":"<keyword>"}' and cite its cell outputs as the authoritative answer
Pre-computed result files (CSV/TSV with names like *results*, *deseq*, *enrich*, *stats*, *_simplified.csv) → read directly and report the requested value
Canonical analysis scripts (analysis.R, run_*.py, find_*.R, *.Rmd) → execute as-is and read the output

Only follow this skill's re-analysis recipe below if none of the above exist. Re-running from raw data produces different numbers than the published answer and is much slower (often 5-10× turn count).

Production-ready skill combining Python computation (pandas, scipy, numpy, pysam, statsmodels) with ToolUniverse annotation tools for epigenomics analysis.

LOOK UP, DON'T GUESS

When uncertain about any scientific fact, SEARCH databases first.

When to Use

Methylation data, ChIP-seq peaks, ATAC-seq, multi-omics integration, genome-wide epigenomic statistics. Keywords: methylation, CpG, ChIP-seq, ATAC-seq, histone, chromatin, epigenetic.

NOT for: RNA-seq DEG, variant calling, gene enrichment, protein structure.

Key Principles

Data-first - Load/inspect before analysis
Question-driven - Extract specific numeric answer
Coordinate system awareness - Track genome build (hg19/hg38/mm10), chr prefix
Statistical rigor - FDR correction, effect size filtering
CpG identification - Parse Illumina probe IDs, genomic coordinates

PRIMARY SCRIPT — methylation_density.py (use FIRST for CpG-density questions)

For long-format methylation CSVs (Pos, Chromosome, MethylationPercentage) paired with chromosome-length CSVs, ALWAYS run the bundled script before hand-rolling pandas. It deterministically computes every common metric in one pass and avoids the rows-vs-sites pitfall that produces silently-wrong answers.

python skills/tooluniverse-epigenomics/scripts/methylation_density.py \
  --cpg <CpG csv> --chr-lengths <chr lengths csv> \
  --filter-meth-extremes 90 10

The full JSON output contains every metric. Pick the one that matches the question's wording (NOT a similar-looking one):

| Question phrasing | Script field | |----------------------------------------------------------------|---------------------------| | "how many sites are removed when filtering …" | rows_removed | | "how many unique CpG sites pass filter" | unique_pos_after_filter | | "genome-wide AVERAGE chromosomal density" | density_avg_per_chr | | "density on chromosome X" | density_chromosome (pass --chromosome X) | | "total density across the genome" | density_total_over_genome |

The two density numbers (density_avg_per_chr vs density_total_over_genome) typically differ by ~2× because CpGs are not uniformly distributed across chromosomes; reporting one when the question asks for the other is the most common failure mode here.

For "sites removed" questions, the long-format CSV has multiple rows per CpG position (one per sample), so rows_removed is in the tens of thousands while unique_pos_removed is in the hundreds. Match the granularity to the question.

Distinguish "rows" vs "unique sites" — methylation CSVs are usually long-format

CpG methylation CSVs typically have ONE ROW PER (sample × CpG site) — so len(df) >> n_unique_sites. Before computing anything, decide which axis the question is asking about:

| Question phrasing | Axis | Operation | |-------------------|------|-----------| | "how many sites are removed when filtering" | sample-rows | filter then count rows; do NOT dedupe by Pos. The CSV is in long format; "sites" here is row-shaped. Subtract len(df_filtered) from len(df). | | "how many unique CpG sites pass filter" | unique positions | dedupe by position (or Pos column), then filter | | "genome-wide average chromosomal density" | per-chromosome density | MEAN of per-chromosome densities: (n_unique_per_chr / chr_length).mean(). NOT total_unique / total_genome — that gives a different answer (typically ≈ ½ of the per-chr mean for unevenly distributed CpGs). | | "density on chromosome X" | single chromosome | unique positions on X / length(X). Be careful which species — check the question text for "Zebra Finch" vs "Jackdaw". | | "chi-square for uniform distribution across chromosomes" | unique positions per chromosome | filter rows first, then dedupe by (Chromosome, Pos), then count per-chromosome unique positions for chi-square against expected = chr_length / total_length × n_unique_filtered |

Sanity check: if your filtered count is two orders of magnitude smaller than the GT range, you likely deduped when the question wanted row-level counts (or vice versa). Re-run with the other axis and compare.

For the chi-square uniformity test: expected counts = chromosome_length / total_genome_length × n_unique_sites. The chi-square statistic depends on the count granularity (rows vs unique sites) — a row-level chi-square gives a much higher chi-square than a unique-position chi-square because n is larger.

Precedence: when an *_executed.ipynb exists, read its filtering code verbatim — df[(df.MethylationPercentage > 90) | (df.MethylationPercentage < 10)] (no dedup) and df.drop_duplicates('Pos') (with dedup) yield wildly different counts on the same dataset.

Workflow

Phase 0: Question Parsing

Identify data files, specific statistic, thresholds, genome build. Categorize by keywords. See ANALYSIS_PROCEDURES.md for decision tree.

Phase 1: Methylation Processing

Load beta/M-value matrix (CSV/TSV/parquet/HDF5)
Filter by variance, missing rate, probe type, chromosome, CpG island relation
Differential methylation: T-test/Wilcoxon between groups + FDR
Age-related CpG: Pearson/Spearman correlation + FDR
Chromosome density: CpG count / chromosome length

Phase 2: ChIP-seq Peak Analysis

Load BED/narrowPeak/broadPeak, normalize chromosomes
Peak stats, annotation to genes, overlap analysis (Jaccard)

Phase 3: ATAC-seq

NFR detection (<150bp peaks), region classification

Phase 4: Multi-Omics Integration

Methylation-expression correlation per probe-gene (Pearson/Spearman + FDR)
ChIP-seq + expression: promoter peaks vs expression levels

Phase 5: Clinical Data

Missing data analysis across modalities, complete case identification

Phase 6: ToolUniverse Annotation

ENCODE tools:

ENCODE_search_rnaseq_experiments: assay_type ("total RNA-seq" default; fall back to "polyA plus RNA-seq"), biosample, limit
ENCODE_search_histone_experiments: target (e.g., "H3K27ac"), cell_type/tissue/biosample, limit

GEO tools: GEO_search_rnaseq_datasets, GEO_search_atacseq_datasets -- both accept limit or max_results

GTEx tools:

GTEx_get_median_gene_expression: gene_symbol (NOT Ensembl ID)
GTEx_query_eqtl: gene_symbol, tissue_id (case-sensitive exact, e.g., "Whole_Blood")

Other: ensembl_lookup_gene (requires species='homo_sapiens'), ensembl_get_regulatory_features (NO "chr" prefix), SCREEN_get_regulatory_elements, ChIPAtlas_* (requires operation param), SRA_search_experiments (library_strategy: "ChIP-Seq"/"Bisulfite-Seq"/"ATAC-seq")

Phase 7: Genome-Wide Statistics

Global mean/median beta, probe variance, chromosome density, DMP counts.

See CODE_REFERENCE.md for full implementations.

Common Patterns

| Pattern | Key Steps | |---------|-----------| | Differential methylation | Filter probes → groups → t-test → FDR → threshold | | Age-related CpG density | Correlate with age → FDR → map to chr → density ratio | | Multi-omics missing data | Extract IDs → intersect → check NaN → complete case count | | ChIP-seq annotation | Load peaks → annotate genes → classify regions | | Methylation-expression | Align samples → correlate → FDR → anti-correlations |

GTEx Tissue IDs

Whole_Blood, Liver, Lung, Breast_Mammary_Tissue, Brain_Cortex, Heart_Left_Ventricle, Kidney_Cortex, Thyroid, Adipose_Subcutaneous, Muscle_Skeletal

Evidence Grading

| Grade | Criteria | |-------|----------| | Strong | padj < 0.01 AND abs(delta-beta) >= 0.2, replicated | | Moderate | padj < 0.05 AND abs(delta-beta) >= 0.1 | | Weak | padj < 0.05 but delta-beta < 0.1 | | Insufficient | padj >= 0.05 or no replication |

Delta-beta >= 0.2 = strong effect. ChIP-seq: q < 0.01, FE >= 2 for confidence. ATAC-seq NFR < 150bp = active regulatory. Always apply BH FDR. Verify genome build consistency.

Limitations

No pybedtools/pyBigWig: pure Python intervals
Illumina-centric (450K/EPIC); uses t-test/Wilcoxon (not limma)
No peak calling (assumes pre-called)
API rate limits: ~20 genes per batch

Reference Files

CODE_REFERENCE.md, TOOLS_REFERENCE.md, ANALYSIS_PROCEDURES.md, QUICK_START.md

mims-harvard/tooluniverse-epigenomics

plugin/skills/tooluniverse-epigenomics/SKILL.md

Genomics and epigenomics analysis: DNA methylation (CpG, 5mC, 5hmC, bisulfite, RRBS), m6A RNA modification (MeRIP-seq), ChIP-seq peaks, ATAC-seq accessibility, histone modifications, chromatin state, multi-omics integration. Combines pandas/scipy/pysam computation with ToolUniverse annotation tools. Use for genome-wide epigenomic statistics, methylation analysis, and chromatin-genome integration.

1,368 stars

tools

Updated May 22, 2026

$ install --global

skillsauth

npx skillsauth add mims-harvard/tooluniverse tooluniverse-epigenomics

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 22, 2026, 6:19 AM138.8s6 files scanned

SKILL.md

name:: tooluniverse-epigenomics
description:: Genomics and epigenomics analysis: DNA methylation (CpG, 5mC, 5hmC, bisulfite, RRBS), m6A RNA modification (MeRIP-seq), ChIP-seq peaks, ATAC-seq accessibility, histone modifications, chromatin state, multi-omics integration. Combines pandas/scipy/pysam computation with ToolUniverse annotation tools. Use for genome-wide epigenomic statistics, methylation analysis, and chromatin-genome integration.
disable-model-invocation:: true

Genomics and Epigenomics Data Processing

⚠️ TOP-OF-MIND RULE: long-format methylation CSV — count ROWS, not unique positions

❌ WRONG: df.drop_duplicates(["Pos"]).query("MethylationPercentage<10 or >90") then len(filtered) → counts unique positions (typically 100–1500)

✅ RIGHT: df.query("MethylationPercentage<10 or MethylationPercentage>90") then len(df) - len(filtered) → counts rows (typically 10k–30k)

If your answer is < 2000 when the data has 1000+ positions × 20+ samples, you deduplicated too early. Re-read the question's noun before reporting.

RULE ZERO — Check for pre-computed results FIRST

Before following any instruction below, scan the data folder for:

*_executed.ipynb → read with tu run read_executed_notebook '{"data_folder":"<path>","search":"<keyword>"}' and cite its cell outputs as the authoritative answer
Pre-computed result files (CSV/TSV with names like *results*, *deseq*, *enrich*, *stats*, *_simplified.csv) → read directly and report the requested value
Canonical analysis scripts (analysis.R, run_*.py, find_*.R, *.Rmd) → execute as-is and read the output

Production-ready skill combining Python computation (pandas, scipy, numpy, pysam, statsmodels) with ToolUniverse annotation tools for epigenomics analysis.

LOOK UP, DON'T GUESS

When uncertain about any scientific fact, SEARCH databases first.

When to Use

Methylation data, ChIP-seq peaks, ATAC-seq, multi-omics integration, genome-wide epigenomic statistics. Keywords: methylation, CpG, ChIP-seq, ATAC-seq, histone, chromatin, epigenetic.

NOT for: RNA-seq DEG, variant calling, gene enrichment, protein structure.

Key Principles

Data-first - Load/inspect before analysis
Question-driven - Extract specific numeric answer
Coordinate system awareness - Track genome build (hg19/hg38/mm10), chr prefix
Statistical rigor - FDR correction, effect size filtering
CpG identification - Parse Illumina probe IDs, genomic coordinates

PRIMARY SCRIPT — methylation_density.py (use FIRST for CpG-density questions)

python skills/tooluniverse-epigenomics/scripts/methylation_density.py \
  --cpg <CpG csv> --chr-lengths <chr lengths csv> \
  --filter-meth-extremes 90 10

The full JSON output contains every metric. Pick the one that matches the question's wording (NOT a similar-looking one):

Distinguish "rows" vs "unique sites" — methylation CSVs are usually long-format

CpG methylation CSVs typically have ONE ROW PER (sample × CpG site) — so len(df) >> n_unique_sites. Before computing anything, decide which axis the question is asking about:

Workflow

Phase 0: Question Parsing

Identify data files, specific statistic, thresholds, genome build. Categorize by keywords. See ANALYSIS_PROCEDURES.md for decision tree.

Phase 1: Methylation Processing

Load beta/M-value matrix (CSV/TSV/parquet/HDF5)
Filter by variance, missing rate, probe type, chromosome, CpG island relation
Differential methylation: T-test/Wilcoxon between groups + FDR
Age-related CpG: Pearson/Spearman correlation + FDR
Chromosome density: CpG count / chromosome length

Phase 2: ChIP-seq Peak Analysis

Load BED/narrowPeak/broadPeak, normalize chromosomes
Peak stats, annotation to genes, overlap analysis (Jaccard)

Phase 3: ATAC-seq

NFR detection (<150bp peaks), region classification

Phase 4: Multi-Omics Integration

Methylation-expression correlation per probe-gene (Pearson/Spearman + FDR)
ChIP-seq + expression: promoter peaks vs expression levels

Phase 5: Clinical Data

Missing data analysis across modalities, complete case identification

Phase 6: ToolUniverse Annotation

ENCODE tools:

ENCODE_search_rnaseq_experiments: assay_type ("total RNA-seq" default; fall back to "polyA plus RNA-seq"), biosample, limit
ENCODE_search_histone_experiments: target (e.g., "H3K27ac"), cell_type/tissue/biosample, limit

GEO tools: GEO_search_rnaseq_datasets, GEO_search_atacseq_datasets -- both accept limit or max_results

GTEx tools:

GTEx_get_median_gene_expression: gene_symbol (NOT Ensembl ID)
GTEx_query_eqtl: gene_symbol, tissue_id (case-sensitive exact, e.g., "Whole_Blood")

Phase 7: Genome-Wide Statistics

Global mean/median beta, probe variance, chromosome density, DMP counts.

See CODE_REFERENCE.md for full implementations.

Common Patterns

GTEx Tissue IDs

Whole_Blood, Liver, Lung, Breast_Mammary_Tissue, Brain_Cortex, Heart_Left_Ventricle, Kidney_Cortex, Thyroid, Adipose_Subcutaneous, Muscle_Skeletal

Evidence Grading

Delta-beta >= 0.2 = strong effect. ChIP-seq: q < 0.01, FE >= 2 for confidence. ATAC-seq NFR < 150bp = active regulatory. Always apply BH FDR. Verify genome build consistency.

Limitations

No pybedtools/pyBigWig: pure Python intervals
Illumina-centric (450K/EPIC); uses t-test/Wilcoxon (not limma)
No peak calling (assumes pre-called)
API rate limits: ~20 genes per batch

Reference Files

CODE_REFERENCE.md, TOOLS_REFERENCE.md, ANALYSIS_PROCEDURES.md, QUICK_START.md

Related Skills

mims-harvard/tooluniverse-self-review

tools

VerifiedTrustedCommunity

Generate the success criteria for a task or question, then review work against them. Given a task, goal, or open-ended question, decompose it into scenarios, evaluation perspectives, and fine-grained weighted YES/NO criteria using the Recursive Expansion Tree (RET) method; if work is supplied, score it criterion-by-criterion and surface what is missing or could be better. Use when asked to self-review or check your own work, judge whether a task is done well or completely, build a definition-of-done or completeness checklist, create an evaluation rubric or grading criteria, score or grade answers to a question, set up an LLM-as-judge rubric, or when the user mentions self-review, completeness check, success criteria, evaluation criteria, scoring rubric, Qworld, or the RET algorithm.

1,583SKILL.mdUpdated Jul 22, 2026

mims-harvard/tooluniverse-self-review

mims-harvard/tooluniverse-peptide-target-deorphanization

tools

VerifiedTrustedCommunity

Find the real protein target(s) of a peptide from its sequence — peptide target deorphanization / off-target identification, for ANY target class (GPCR, ion channel, protease, cytokine/growth-factor receptor, enzyme, integrin), not only GPCRs. Use when a peptide has a phenotype but does not bind its hypothesized target, when a peptide binds a target in one species or assay but not another, or to screen candidate targets for an orphan peptide. A target-class router steers a multi-route keyless pipeline (PROSITE/ELM motif, BLAST homology, HGNC/InterPro/GPCRdb/GtoPdb target-family enumeration, OpenTargets phenotype anchor, EnsemblCompara/Alliance cross-species reconciliation) plus optional NVIDIA-NIM co-folding (Boltz2, AlphaFold2-Multimer, OpenFold3) for structural confirmation.

1,583SKILL.mdUpdated Jul 22, 2026

mims-harvard/tooluniverse-peptide-target-deorphanization

mims-harvard/tooluniverse-cs-setup

tools

VerifiedTrustedCommunity

Install or update ToolUniverse in Claude Science — create the conda env, install the tooluniverse pip package, and (re)build the tooluniverse-research skill by fetching the current workflow library from GitHub. Use for first-time setup, upgrading the ToolUniverse version, refreshing the bundled workflows after an upstream release, or reinstalling on a new machine.

1,583SKILL.mdUpdated Jul 22, 2026

mims-harvard/tooluniverse-cs-setup

mims-harvard/tooluniverse-codex-plugin

tools

VerifiedTrustedCommunity

Install, set up, verify, update, pin, uninstall, or troubleshoot the ToolUniverse plugin on OpenAI Codex. ALWAYS consult this skill for any of those — don't answer from memory, because the exact marketplace name (mims-harvard/ToolUniverse), the "codex plugin marketplace add" then "codex plugin add -m tooluniverse" flow, Codex's startup auto-upgrade behavior, the uvx tooluniverse MCP server, and the API-key env vars are easy to get wrong. Use it whenever someone wants to get ToolUniverse (or "the 1000+ scientific tools" / "the harvard tools") working on Codex, says the Codex plugin or its tools/skills won't load, hits a uvx or MCP-server startup error, asks how Codex updates it, wants to pin or remove it, or finds it running an old tool version — even if they never say the word "plugin". Not for the Claude Code plugin (use tooluniverse-claude-code-plugin), for running research with the tools, or for authoring new tools or skills.

1,583SKILL.mdUpdated Jul 22, 2026

mims-harvard/tooluniverse-codex-plugin

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/mims-harvard/tooluniverse.git

# Copy into Claude Code skills folder (global)
cp -r tooluniverse/plugin/skills/tooluniverse-epigenomics ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

mims-harvard/tooluniverse

1,368 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT