Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

aipoch/geniml

Name: geniml
Author: aipoch

scientific-skills/Data Analysis/geniml/SKILL.md

npx skillsauth add aipoch/medical-research-skills geniml

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Clean

VirusTotalMulti-engine malware detection

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

When to Use

You have many BED files and need numeric features for clustering, similarity search, or downstream supervised learning (e.g., ChIP-seq/ATAC-seq region sets).
You want unsupervised embeddings of genomic regions to compare region sets across experiments (Region2Vec).
You need joint embeddings of regions and metadata labels (e.g., tissue/cell type/condition) to enable cross-modal queries like Region → Label or Label → Region (BEDspace).
You are analyzing single-cell ATAC-seq and want cell embeddings for clustering/annotation and integration with Scanpy workflows (scEmbed).
You need a consensus peak set (“universe”) built from multiple BED files to standardize tokenization and region definitions across datasets (Universe construction).

Key Features

Region2Vec: Word2vec-style unsupervised embeddings for genomic regions from tokenized BED data.
BEDspace: StarSpace-based joint embedding space for region sets and metadata labels; supports similarity search and cross-modal retrieval.
scEmbed: Single-cell ATAC-seq embedding workflow (tokenize cells → train → encode cells) compatible with Scanpy.
Universe (Consensus Peaks) Builder: Generates reference peak sets using multiple statistical approaches (CC, CCF, ML, HMM).
Utilities:
- Tokenization: Universe-based tokenization (hard/soft tokenization patterns).
- Evaluation: Embedding quality metrics (e.g., silhouette, Davies–Bouldin).
- BEDshift: Region randomization/null-model generation while preserving genomic context.
- BBClient / caching: Faster repeated access to BED resources.
- Text2BedNN: Neural search backend for genomic queries.

Additional details are commonly documented in: references/region2vec.md, references/bedspace.md, references/scembed.md, references/consensus_peaks.md, references/utilities.md.

Dependencies

Python: 3.9+ (recommended)
geniml: latest from PyPI (or GitHub main)
Optional ML extras: geniml[ml] (typically pulls PyTorch and related ML dependencies)
Scanpy stack (for scEmbed workflows): scanpy (plus anndata, numpy, scipy)
StarSpace (for BEDspace training): external binary from https://github.com/facebookresearch/StarSpace
Universe coverage generation: uniwig (used to generate coverage tracks in universe workflows)

Example Usage

1) Install

# Base install
uv pip install geniml

# With ML extras (e.g., PyTorch and related dependencies)
uv pip install "geniml[ml]"

# Development version
uv pip install git+https://github.com/databio/geniml.git

2) End-to-end: Build a universe → tokenize BEDs → train Region2Vec → evaluate

# (A) Build coverage tracks (example pattern)
cat bed_files/*.bed > combined.bed
uniwig -m 25 combined.bed chrom.sizes coverage/

# (B) Build a universe (coverage cutoff method)
geniml universe build cc \
  --coverage-folder coverage/ \
  --output-file universe.bed \
  --cutoff 5 \
  --merge 100 \
  --filter-size 50

# (C) Tokenize BED files, train Region2Vec, and evaluate embeddings
from geniml.tokenization import hard_tokenization
from geniml.region2vec import region2vec
from geniml.evaluation import evaluate_embeddings

# 1) Tokenize BED files against the universe
hard_tokenization(
    src_folder="bed_files/",
    dst_folder="tokens/",
    universe_file="universe.bed",
    p_value_threshold=1e-9,
)

# 2) Train Region2Vec
region2vec(
    token_folder="tokens/",
    save_dir="model/",
    num_shufflings=1000,
    embedding_dim=100,
)

# 3) Evaluate (requires labels/metadata aligned to embeddings)
metrics = evaluate_embeddings(
    embeddings_file="model/embeddings.npy",
    labels_file="metadata.csv",
)

print(metrics)

3) Single-cell ATAC-seq: tokenize cells → train scEmbed → cluster with Scanpy

import scanpy as sc
from geniml.scembed import ScEmbed
from geniml.io import tokenize_cells

# 1) Load AnnData
adata = sc.read_h5ad("scatac_data.h5ad")

# 2) Tokenize cells using a universe
tokenize_cells(
    adata="scatac_data.h5ad",
    universe_file="universe.bed",
    output="tokens.parquet",
)

# 3) Train scEmbed
model = ScEmbed(embedding_dim=100)
model.train(dataset="tokens.parquet", epochs=100)

# 4) Encode cells and attach embeddings to AnnData
embeddings = model.encode(adata)
adata.obsm["scembed_X"] = embeddings

# 5) Standard Scanpy neighborhood graph + clustering + UMAP
sc.pp.neighbors(adata, use_rep="scembed_X")
sc.tl.leiden(adata)
sc.tl.umap(adata)

Implementation Details

Tokenization (Universe-based)

Goal: Convert genomic intervals into discrete “tokens” defined by a reference universe (consensus peak set).
Hard tokenization: Assigns intervals to universe bins/peaks deterministically (commonly used for Region2Vec/scEmbed pipelines).
Key parameter: p_value_threshold controls stringency of mapping/overlap significance (lower is stricter; overly strict thresholds can reduce coverage).

Region2Vec (Region Embeddings)

Core idea: Treat each BED file (or region set) like a “document” and each universe peak like a “word”; learn embeddings using a word2vec-style objective.
Important knobs:
- embedding_dim: dimensionality of learned vectors (e.g., 50–300).
- num_shufflings: increases training signal by shuffling/co-occurrence augmentation; higher values increase runtime.

BEDspace (Joint Region + Label Embeddings)

Core idea: Learn a shared vector space for region sets and metadata labels using StarSpace, enabling:
- Region → Label retrieval (predict likely labels for a query region set)
- Label → Region retrieval (find region sets associated with a label)
Operational requirement: StarSpace must be installed and its path provided/configured for training.

scEmbed (Single-cell Embeddings)

Core idea: Apply Region2Vec-like training on tokenized single-cell accessibility profiles to produce cell embeddings.
Best practice: Pre-tokenize cells (e.g., to Parquet) to reduce repeated preprocessing and speed up training.
Downstream: Use embeddings as adata.obsm[...] and run standard Scanpy steps (neighbors, Leiden, UMAP).

Universe Construction (Consensus Peaks)

Purpose: Create a stable reference peak set for tokenization and cross-dataset comparability.
Methods:
- CC (Coverage Cutoff): threshold-based peak calling from coverage.
- CCF (Coverage Cutoff Flexible): cutoff with flexible boundaries/confidence intervals.
- ML (Maximum Likelihood): probabilistic modeling of peak positions.
- HMM (Hidden Markov Model): state-based segmentation; typically most computationally intensive.
Typical parameters:
- --cutoff: minimum coverage to call peaks (CC/CCF).
- --merge: merge distance for nearby peaks.
- --filter-size: minimum peak length to keep.

aipoch/geniml

scientific-skills/Data Analysis/geniml/SKILL.md

Machine learning toolkit for genomic interval (BED) data; use it when you need to tokenize BED collections and train embeddings for regions/cells/labels, build consensus peak universes, or run similarity search and downstream ML on chromatin accessibility datasets.

37 stars

tools

Updated Mar 26, 2026

$ install --global

skillsauth

npx skillsauth add aipoch/medical-research-skills geniml

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

4 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Clean

VirusTotalMulti-engine malware detection

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 5, 2026, 12:53 AM95.9s1 file scanned

SKILL.md

name:: geniml
description:: Machine learning toolkit for genomic interval (BED) data; use it when you need to tokenize BED collections and train embeddings for regions/cells/labels, build consensus peak universes, or run similarity search and downstream ML on chromatin accessibility datasets.
license:: MIT
skill-author:: AIPOCH

When to Use

You have many BED files and need numeric features for clustering, similarity search, or downstream supervised learning (e.g., ChIP-seq/ATAC-seq region sets).
You want unsupervised embeddings of genomic regions to compare region sets across experiments (Region2Vec).
You need joint embeddings of regions and metadata labels (e.g., tissue/cell type/condition) to enable cross-modal queries like Region → Label or Label → Region (BEDspace).
You are analyzing single-cell ATAC-seq and want cell embeddings for clustering/annotation and integration with Scanpy workflows (scEmbed).
You need a consensus peak set (“universe”) built from multiple BED files to standardize tokenization and region definitions across datasets (Universe construction).

Key Features

Region2Vec: Word2vec-style unsupervised embeddings for genomic regions from tokenized BED data.
BEDspace: StarSpace-based joint embedding space for region sets and metadata labels; supports similarity search and cross-modal retrieval.
scEmbed: Single-cell ATAC-seq embedding workflow (tokenize cells → train → encode cells) compatible with Scanpy.
Universe (Consensus Peaks) Builder: Generates reference peak sets using multiple statistical approaches (CC, CCF, ML, HMM).
Utilities:
- Tokenization: Universe-based tokenization (hard/soft tokenization patterns).
- Evaluation: Embedding quality metrics (e.g., silhouette, Davies–Bouldin).
- BEDshift: Region randomization/null-model generation while preserving genomic context.
- BBClient / caching: Faster repeated access to BED resources.
- Text2BedNN: Neural search backend for genomic queries.

Additional details are commonly documented in: references/region2vec.md, references/bedspace.md, references/scembed.md, references/consensus_peaks.md, references/utilities.md.

Dependencies

Python: 3.9+ (recommended)
geniml: latest from PyPI (or GitHub main)
Optional ML extras: geniml[ml] (typically pulls PyTorch and related ML dependencies)
Scanpy stack (for scEmbed workflows): scanpy (plus anndata, numpy, scipy)
StarSpace (for BEDspace training): external binary from https://github.com/facebookresearch/StarSpace
Universe coverage generation: uniwig (used to generate coverage tracks in universe workflows)

Example Usage

1) Install

# Base install
uv pip install geniml

# With ML extras (e.g., PyTorch and related dependencies)
uv pip install "geniml[ml]"

# Development version
uv pip install git+https://github.com/databio/geniml.git

2) End-to-end: Build a universe → tokenize BEDs → train Region2Vec → evaluate

# (A) Build coverage tracks (example pattern)
cat bed_files/*.bed > combined.bed
uniwig -m 25 combined.bed chrom.sizes coverage/

# (B) Build a universe (coverage cutoff method)
geniml universe build cc \
  --coverage-folder coverage/ \
  --output-file universe.bed \
  --cutoff 5 \
  --merge 100 \
  --filter-size 50

# (C) Tokenize BED files, train Region2Vec, and evaluate embeddings
from geniml.tokenization import hard_tokenization
from geniml.region2vec import region2vec
from geniml.evaluation import evaluate_embeddings

# 1) Tokenize BED files against the universe
hard_tokenization(
    src_folder="bed_files/",
    dst_folder="tokens/",
    universe_file="universe.bed",
    p_value_threshold=1e-9,
)

# 2) Train Region2Vec
region2vec(
    token_folder="tokens/",
    save_dir="model/",
    num_shufflings=1000,
    embedding_dim=100,
)

# 3) Evaluate (requires labels/metadata aligned to embeddings)
metrics = evaluate_embeddings(
    embeddings_file="model/embeddings.npy",
    labels_file="metadata.csv",
)

print(metrics)

3) Single-cell ATAC-seq: tokenize cells → train scEmbed → cluster with Scanpy

import scanpy as sc
from geniml.scembed import ScEmbed
from geniml.io import tokenize_cells

# 1) Load AnnData
adata = sc.read_h5ad("scatac_data.h5ad")

# 2) Tokenize cells using a universe
tokenize_cells(
    adata="scatac_data.h5ad",
    universe_file="universe.bed",
    output="tokens.parquet",
)

# 3) Train scEmbed
model = ScEmbed(embedding_dim=100)
model.train(dataset="tokens.parquet", epochs=100)

# 4) Encode cells and attach embeddings to AnnData
embeddings = model.encode(adata)
adata.obsm["scembed_X"] = embeddings

# 5) Standard Scanpy neighborhood graph + clustering + UMAP
sc.pp.neighbors(adata, use_rep="scembed_X")
sc.tl.leiden(adata)
sc.tl.umap(adata)

Implementation Details

Tokenization (Universe-based)

Goal: Convert genomic intervals into discrete “tokens” defined by a reference universe (consensus peak set).
Hard tokenization: Assigns intervals to universe bins/peaks deterministically (commonly used for Region2Vec/scEmbed pipelines).
Key parameter: p_value_threshold controls stringency of mapping/overlap significance (lower is stricter; overly strict thresholds can reduce coverage).

Region2Vec (Region Embeddings)

Core idea: Treat each BED file (or region set) like a “document” and each universe peak like a “word”; learn embeddings using a word2vec-style objective.
Important knobs:
- embedding_dim: dimensionality of learned vectors (e.g., 50–300).
- num_shufflings: increases training signal by shuffling/co-occurrence augmentation; higher values increase runtime.

BEDspace (Joint Region + Label Embeddings)

Core idea: Learn a shared vector space for region sets and metadata labels using StarSpace, enabling:
- Region → Label retrieval (predict likely labels for a query region set)
- Label → Region retrieval (find region sets associated with a label)
Operational requirement: StarSpace must be installed and its path provided/configured for training.

scEmbed (Single-cell Embeddings)

Core idea: Apply Region2Vec-like training on tokenized single-cell accessibility profiles to produce cell embeddings.
Best practice: Pre-tokenize cells (e.g., to Parquet) to reduce repeated preprocessing and speed up training.
Downstream: Use embeddings as adata.obsm[...] and run standard Scanpy steps (neighbors, Leiden, UMAP).

Universe Construction (Consensus Peaks)

Purpose: Create a stable reference peak set for tokenization and cross-dataset comparability.
Methods:
- CC (Coverage Cutoff): threshold-based peak calling from coverage.
- CCF (Coverage Cutoff Flexible): cutoff with flexible boundaries/confidence intervals.
- ML (Maximum Likelihood): probabilistic modeling of peak positions.
- HMM (Hidden Markov Model): state-based segmentation; typically most computationally intensive.
Typical parameters:
- --cutoff: minimum coverage to call peaks (CC/CCF).
- --merge: merge distance for nearby peaks.
- --filter-size: minimum peak length to keep.

Related Skills

aipoch/conventional-oncology-hub-gene

tools

VerifiedTrustedCommunity

Generates complete conventional oncology bulk-transcriptome biomarker and hub-gene research designs from a user-provided cancer type and study direction. Always use this skill whenever a user wants to design, plan, or build a tumor bioinformatics study centered on differential expression, prognostic filtering or risk modeling, PPI-based hub-gene prioritization, diagnostic/prognostic evaluation, clinical association, immune infiltration context, methylation context, and optional tissue or cell validation. Covers five study patterns (signature-first prognostic workflow, hub-gene-first biomarker workflow, hybrid signature-to-hub workflow, immune-context biomarker workflow, translational validation workflow) and always outputs four workload configs (Lite / Standard / Advanced / Publication+) with recommended primary plan, step-by-step workflow, figure plan, validation strategy, minimal executable version, publication upgrade path...

348SKILL.mdUpdated Apr 28, 2026

aipoch/conventional-oncology-hub-gene

aipoch/conventional-non-oncology-hub-gene

development

VerifiedTrustedCommunity

Generates complete conventional non-oncology bioinformatics research designs from a user-provided disease context, process-related gene family or biological theme, and validation direction. Use when a study centers on multi-dataset bulk transcriptome integration, DEG analysis, process-gene intersection, enrichment analysis, GSEA, PPI hub-gene prioritization, TF/miRNA regulatory networks, ROC-based biomarker evaluation, and immune infiltration analysis. Covers five study patterns (process-DEG discovery, enrichment/GSEA interpretation, hub-gene prioritization, regulatory-network and immune interpretation, multi-layer public validation) and always outputs Lite / Standard / Advanced / Publication+ with a recommended primary plan, stepwise workflow, figure plan, validation hierarchy, minimal executable version, publication upgrade path, and strictly verified literature retrieval.

348SKILL.mdUpdated Apr 28, 2026

aipoch/conventional-non-oncology-hub-gene

aipoch/confounder-and-bias-control-planner

tools

VerifiedTrustedCommunity

Plans confounder control, variable adjustment logic, and bias mitigation strategies at the protocol stage for clinical, epidemiologic, translational, observational, and biomarker studies. Always use this skill when a user needs to identify major confounders, decide which variables should or should not be adjusted for, compare matching/stratification/weighting approaches, anticipate selection or measurement bias, or pressure-test a study design before execution. Focus on bias sensing, causal structure awareness, variable-role classification, and critical design review rather than generic statistical advice.

348SKILL.mdUpdated Apr 28, 2026

aipoch/confounder-and-bias-control-planner

aipoch/comparative-network-toxicology-shared-mechanism-reference-grounded

testing

VerifiedTrustedCommunity

Generates complete comparative network-toxicology research designs from a user-provided exposure pair, shared toxic phenotype, and validation direction. Use when a study centers on two related exposures under one outcome and needs target collection, shared-vs-specific target decomposition, enrichment, PPI hub prioritization, docking, optional transcriptomic cross-checks, and conservative mechanistic synthesis. Covers five study patterns and always outputs Lite / Standard / Advanced / Publication+ with a recommended primary plan, stepwise workflow, figure plan, validation hierarchy, minimal executable version, publication upgrade path, and strictly verified literature retrieval.

348SKILL.mdUpdated Apr 28, 2026

aipoch/comparative-network-toxicology-shared-mechanism-reference-grounded

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/aipoch/medical-research-skills.git

# Copy into Claude Code skills folder (global)
cp -r medical-research-skills/scientific-skills/Data Analysis/geniml ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

aipoch/medical-research-skills

37 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT