chip-seq/chromatin-state-segmentation/SKILL.md
Segments the genome into chromatin states from combinatorial histone modification and chromatin factor ChIP-seq data. Uses ChromHMM (multivariate HMM on binarized signal, v1.27), Segway (Dynamic Bayesian Network on continuous signal), EpiSegMix (flexible-distribution HMM with duration modeling, 2024), EpiLogos (multi-biosample visualization), IDEAS (cell-type-aware joint), and full-stack ChromHMM (Vu Ernst 2022) for cross-cell-type segmentations. Handles state-count selection (15 vs 18 vs 25 states), binarization choice, OverlapEnrichment / NeighborhoodEnrichment downstream analysis, and cross-biosample integration. Use when learning chromatin states from a histone mark panel, characterizing learned states by genomic feature enrichment, or comparing chromatin landscapes across cell types.
npx skillsauth add GPTomics/bioSkills bio-chipseq-chromatin-state-segmentationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: ChromHMM 1.27+, Segway 3.0+, EpiSegMix 1.0+, EpiLogos (Meuleman lab), IDEAS 1.20+, samtools 1.19+, bedtools 2.31+. ChromHMM requires Java 8+; runs as java -mx<MEMORY> -jar ChromHMM.jar <command>.
"Integrate multiple histone modification ChIP-seq tracks into chromatin states" -> Learn a small set of recurring combinatorial patterns of histone marks (active promoter, active enhancer, poised enhancer, polycomb-repressed, heterochromatic, transcribed, etc.) and segment the genome by which state each region belongs to. Output: per-state genomic intervals, state-by-mark emission matrix, and state-state transition matrix.
BinarizeBam -> LearnModel -> OverlapEnrichment / NeighborhoodEnrichmenttrain -> posterior -> identifyChromatin state segmentation requires a panel of histone marks; minimum 4-5 marks (e.g., H3K4me3, H3K27ac, H3K4me1, H3K36me3, H3K27me3) for meaningful states. With fewer marks, simpler peak-based annotation (chipseq/peak-annotation) is more appropriate.
| Tool | Method | Strength | Fails when | |------|--------|----------|------------| | ChromHMM (Ernst & Kellis 2012; v1.27 current) | Multivariate HMM on binarized 200 bp bins | Canonical; widely used; integrated with Roadmap Epigenomics 15-state model; mature toolchain | Binarization throws away signal quantitation; default 200 bp bins may be too coarse for sharp boundaries | | Segway (Hoffman 2012) | Dynamic Bayesian Network on continuous signal | Higher resolution; uses signal magnitudes not binarized | More complex setup; slower; less standardized output | | EpiSegMix (Schmitz, Aggarwal, Laufer, Walter, Salhab, Rahmann 2024 Bioinformatics 40:btae178) | HMM with flexible read-count distributions + duration modeling | Modern; handles both narrow and broad mark distributions in one model | Newer; smaller user base | | EpiLogos (Meuleman lab) | Multi-biosample visualization tool | Built on top of ChromHMM/Segway segmentations; compare ChromHMM states across 100s of biosamples | Visualization tool, not a segmentation method itself | | IDEAS (Zhang 2016) | Cell-type-aware joint inference | Across-cell-type segmentation respecting cell-type identity | Slower; complex parameter tuning | | EpiCSeg (Mammana 2015) | Negative binomial mixture | Read-count-based; doesn't need binarization | Less standardized output | | GenoSTAN | HMM with various emission distributions | Flexible | Less actively developed | | Roadmap 25-state model (Kundaje 2015) | ChromHMM 25-state precomputed model | Reference for cross-cell-type interpretation | Tied to Roadmap mark panel (5 core marks) | | Full-stack ChromHMM (Vu Ernst 2022) | 100-state segmentation across 1032 datasets / 127 reference epigenomes | Comprehensive cross-tissue annotation | Computationally intensive to retrain |
ChromHMM is the de facto standard. The workflow has 4 stages:
# Build cellMarkFileTable: cell_type<TAB>mark<TAB>file<TAB>(optional control)
cat > cellMarkFileTable.txt << EOF
GM12878 H3K4me3 gm12878_h3k4me3.bam gm12878_input.bam
GM12878 H3K27me3 gm12878_h3k27me3.bam gm12878_input.bam
GM12878 H3K27ac gm12878_h3k27ac.bam gm12878_input.bam
GM12878 H3K4me1 gm12878_h3k4me1.bam gm12878_input.bam
GM12878 H3K36me3 gm12878_h3k36me3.bam gm12878_input.bam
EOF
# Binarize BAMs into 200 bp bins; emission = whether mark exceeds Poisson threshold
java -mx16G -jar ChromHMM.jar BinarizeBam \
-b 200 \
chromsizes_hg38.txt \
bam_dir/ \
cellMarkFileTable.txt \
binarized_output/
Output: per-chromosome _binary.txt files, one row per 200 bp bin, columns = marks, values 0/1.
# Train HMM with N states; common choices: 15, 18, 25
# 15 states: Ernst & Kellis 2011 model; canonical
# 18 states: extends with additional regulatory states
# 25 states: Roadmap Epigenomics extended model
java -mx16G -jar ChromHMM.jar LearnModel \
-p 8 \
binarized_output/ \
model_15state/ \
15 \
hg38
# Output: model_15state.txt (emission + transition matrices),
# emissions_15.png (visualization), transitions_15.png,
# per-chromosome _segments.bed (state assignments)
# AND automatically runs OverlapEnrichment + NeighborhoodEnrichment
| Roadmap 15-state assignments (canonical) | |-------------------------------------------| | 1_TssA — Active TSS (high H3K4me3, H3K27ac) | | 2_TssAFlnk — Flanking TSS (H3K4me3, H3K27ac, lower) | | 3_TxFlnk — Transcript flanking | | 4_Tx — Strong transcription (H3K36me3, H3K79me2 if available) | | 5_TxWk — Weak transcription | | 6_EnhG — Enhancer in gene body (H3K4me1, H3K27ac) | | 7_Enh — Generic enhancer (H3K4me1, H3K27ac) | | 8_ZNF/Rpts — Zinc-finger / repeats | | 9_Het — Heterochromatin (H3K9me3) | | 10_TssBiv — Bivalent TSS (H3K4me3 + H3K27me3) | | 11_BivFlnk — Bivalent flanking | | 12_EnhBiv — Bivalent enhancer (H3K4me1 + H3K27me3) | | 13_ReprPC — Polycomb-repressed (H3K27me3) | | 14_ReprPCWk — Weak Polycomb | | 15_Quies — Quiescent (no signal) |
# OverlapEnrichment: enrichment of each state for external feature sets
java -mx16G -jar ChromHMM.jar OverlapEnrichment \
model_15state/GM12878_15_segments.bed \
/path/to/anchor_files/ \
enrichment_output/GM12878 \
-labels
# NeighborhoodEnrichment: enrichment relative to anchor positions (e.g., TSS)
java -mx16G -jar ChromHMM.jar NeighborhoodEnrichment \
model_15state/GM12878_15_segments.bed \
/path/to/tss_anchors.txt \
enrichment_output/GM12878_TSS \
-labels
Anchor files: BED files of features (CGIs, repeats, conserved elements, etc.) for OverlapEnrichment; position files for NeighborhoodEnrichment.
| States | Use case | Mark panel size | |--------|----------|-----------------| | 8-10 | Initial exploration; small mark panel (3-4 marks) | 3-5 marks | | 15 | Roadmap Epigenomics canonical | 5 core (H3K4me3, H3K27ac, H3K4me1, H3K36me3, H3K27me3) | | 18 | Roadmap extended (adds bivalent states, fine enhancer subtypes) | 5-7 marks | | 25 | Roadmap Epigenomics extended; cross-cell-type compatibility | 6+ marks | | 50+ | Full-stack model (Vu Ernst 2022) | Many marks across many cell types |
Practical workflow: Train at N=15, 18, 25; compare emission matrices; choose the smallest N where biology is interpretable. Higher N risks over-segmentation (state splitting random variation).
# Train Segway model (10 states by default; supports more)
segway train \
--num-labels 25 \
--num-instances 3 \
--resolution 100 \
chromsizes_hg38.bed \
h3k4me3.bw h3k27ac.bw h3k4me1.bw h3k36me3.bw h3k27me3.bw \
--traindir traindir/
# Posterior inference + segmentation
segway posterior --traindir traindir/ --identifydir identifydir/ \
chromsizes_hg38.bed \
h3k4me3.bw h3k27ac.bw h3k4me1.bw h3k36me3.bw h3k27me3.bw
# Output: identifydir/segway.bed (state assignments)
Segway uses bigWig (continuous signal) vs ChromHMM binarized binary. Trade-off: more information per region (continuous) but more complex training.
EpiLogos doesn't perform segmentation; it visualizes existing ChromHMM/Segway segmentations across many biosamples (epilogos.org).
# Use precomputed ChromHMM segmentations across multiple cell types
# Web interface: https://epilogos.altius.org/
# Local: github.com/meuleman/epilogos
Useful for: cross-cell-type comparison; identifying tissue-specific regulatory states; cohort-level chromatin landscape summaries.
The full-stack model trained on 1032 datasets / 127 reference epigenomes:
# Use precomputed model from Ernst lab
# github.com/ernstlab/full_stack_ChromHMM_annotations
# Annotate new sample by applying model to binarized data
java -mx16G -jar ChromHMM.jar MakeSegmentation \
full_stack_model_100states.txt \
binarized_sample/ \
full_stack_output/
Useful for: applying a comprehensive cross-tissue annotation to a new sample; comparing to canonical Roadmap states.
Trigger: Studying TF binding boundaries or sharp enhancer transitions at 200 bp resolution.
Mechanism: ChromHMM default 200 bp bins; biology may shift within a bin.
Fix: Reduce to -b 100 or -b 50 (smaller bin); increases memory and compute time but improves boundary resolution. Re-train model at finer resolution.
Trigger: Distinguishing low- from high-signal regions of the same state.
Mechanism: ChromHMM binarizes each 200 bp bin to 0/1 per mark; state assignment uses combinatorial pattern, not magnitude.
Fix: Use Segway (continuous signal) or EpiSegMix (flexible distributions) for magnitude-aware segmentation.
Trigger: Training with N=50 states on a 4-mark panel; or N=10 on a 7-mark panel.
Mechanism: Excess states fragment biology; insufficient states force unrelated regions into the same state.
Symptom: Emission matrix shows redundant states (multiple states with same emission profile) at high N; or biologically distinct regions lumped together at low N.
Fix: Train at N=15, 18, 25; inspect emission matrix similarity; choose the smallest N where states are interpretable as distinct biology.
Trigger: Applying Roadmap 25-state model to a sample with different mark panel.
Mechanism: Model was trained on specific marks; emission probabilities are mark-specific. Applying to different mark panel produces nonsensical state assignments.
Fix: Either train a new model on the available mark panel; OR ensure the exact same marks (and ordering) as used in the model.
Trigger: Using IgG controls in BinarizeBam for histone marks.
Mechanism: ChromHMM's binarization compares mark signal to control; histone mark biology assumes input (sonicated chromatin) as background, not IgG.
Fix: Use sonicated input as control for histone mark ChIP. IgG is not appropriate for ChromHMM binarization of histone marks.
Trigger: Training separate models per cell type and trying to compare state assignments.
Mechanism: State 5 in cell type A may not correspond to state 5 in cell type B if trained independently.
Fix: Train one model on concatenated data from all cell types (joint segmentation); or apply a single precomputed model (Roadmap 15-state, full-stack) to all samples for consistent state labels.
| Pattern | Likely cause | Action | |---------|--------------|--------| | ChromHMM and Segway segments differ | Different bin sizes / binarization vs continuous | Both can be valid; inspect emission matrices; pick the tool matching the resolution needs | | State assignment varies wildly between replicates | Insufficient marks; over-binned | Increase mark panel; reduce state count | | Active TSS state overlaps polycomb state at promoters | Bivalent biology (Bernstein 2006) | Expected for ESC-like cells; not an error; consider bivalent-specific state in N=18 model | | Roadmap 25-state model annotates unknown cell type | Cross-cell-type generalization | Use cautiously; verify against tissue-specific tracks | | Heterochromatin (H3K9me3) state has too many bins | H3K9me3 covers large fraction of genome | Expected; heterochromatin is genome-wide |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| java.lang.OutOfMemoryError | Insufficient JVM heap | java -mx32G -jar ChromHMM.jar ... |
| BinarizeBam very slow | Large BAMs without index | samtools index all BAMs first |
| All states have similar emissions | Mark panel too small | Need at least 5 marks for canonical 15-state model |
| Segments file empty for some chromosomes | Chromosome not in chromsizes file | Add or use -chrom flag to restrict |
| State labels don't match Roadmap | Trained model independently | Use Roadmap precomputed model OR map states by emission similarity |
| ChromHMM "no signal in marks" | All bins binarized to 0 | Check signal quality; verify control normalization |
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.