hi-c-analysis/hic-data-io/SKILL.md
Loads, converts, and manipulates Hi-C contact matrices in cooler format (.cool/.mcool/.scool) and Juicer .hic, using cooler (Python + CLI), hic2cool, and hictk. Covers the single-resolution mcool URI (file.mcool::/resolutions/<bp>), the load-bearing divisive-vs-multiplicative weight-naming rule (KR/VC/VC_SQRT auto-divisive vs cooler's multiplicative weight), what survives .hic<->.cool conversion (FRAG matrices and norm vectors do not), raw-vs-balanced coarsening, the .pairs upper-triangle/chromsize-order contract, and chrom-naming/bin-table provenance. Use when loading a cooler, converting .hic to .mcool, selecting a resolution, building a cooler from pairs or a matrix, coarsening/zoomifying, importing Juicer norm vectors, or debugging all-NaN balanced matrices and chr1-vs-1 empty fetches.
npx skillsauth add GPTomics/bioSkills bio-hi-c-analysis-hic-data-ioInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: cooler 0.10+, hic2cool 1.0+, hictk 1.0+, bioframe 0.7+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Two version boundaries change BEHAVIOUR, not just signatures: hic2cool >= 0.5.0 stores Juicer norms un-inverted (divisive); < 0.5.0 inverted them to multiplicative, so two coolers made from one .hic by different hic2cool versions disagree numerically. cooler standardized creation/balance signatures around 0.8-0.10 (balance_cooler keyword-only after clr; store=False default). Record the converter version in provenance.
"Load my Hi-C contact matrix, convert it, and pull out a region." -> Open the cooler at one resolution, fetch raw or balanced pixels, and convert across .hic/.cool/.mcool while knowing what does not survive the round trip.
cooler.Cooler('file.mcool::/resolutions/10000').matrix(balance=True).fetch('chr1')hic2cool convert in.hic out.mcool -r 0 / hictk convert in.hic out.mcool / cooler cload pairs ...A .cool is a thin, open, HDF5-native store: three tables (chroms, bins, pixels) holding raw observed integer counts in an upper-triangle COO layout, plus optional cached weight bias columns. Balancing, expected, O/E, eigenvectors -- everything else is computed downstream on demand. Juicer's .hic is the opposite philosophy: a sealed binary deliverable with all resolutions, precomputed normalization vectors, and expected vectors welded in. Every footgun in this skill descends from that split:
.hic -> .cool loses FRAG (restriction-fragment) matrices (cooler has no FRAG concept) and Juicer's precomputed expected; a norm whose vector is missing arrives as all-NaN. .cool -> .hic loses asymmetric matrices, arbitrary extra pixel value columns, and non-Hi-C labeled arrays.weight column NAME is load-bearing. cooler's own ICE weight is applied MULTIPLICATIVELY by matrix(balance=True). Juicer KR/VC norms are DIVISIVE. Cooler.matrix(divisive_weights=None) (the default) decides by column name: weights named KR, VC, or VC_SQRT are auto-treated as divisive; everything else (including weight) is multiplicative. Import a KR vector under the name weight and it is applied the wrong way -- garbage, no error..mcool is a container of resolutions, not a matrix. Every downstream tool wants a single-resolution URI file.mcool::/resolutions/<bp>, never the bare .mcool. Each resolution is balanced from scratch; the 100kb weight is NOT derivable from the 10kb weight.| Format / Tool | Role | Mechanism | When |
|---------------|------|-----------|------|
| .cool | single-resolution store | HDF5 chroms/bins/pixels, raw counts + optional weight | one resolution; the analysis unit |
| .mcool | multi-resolution container | /resolutions/<bp>/ each a full cooler | HiGlass tilesets; pick a resolution via URI |
| .scool | single-cell container | shared bins, /cells/<name>/pixels | scHi-C; avoids duplicating the bin table per cell |
| .hic (Juicer) | sealed deliverable | binary, all resolutions + baked norm/expected | Juicer/Juicebox ecosystem; BP or FRAG bins |
| .pairs (4DN) | upstream contact list | bgzip + pairix index; upper-triangle, flipped | input to cooler cload; provenance of the matrix |
| cooler (CLI+Py) | the open standard store/API | pandas/scipy selectors; cload/balance/zoomify/dump | the default; cooltools integration |
| hic2cool | .hic -> .cool/.mcool | 4DN-canonical norm handler (>=0.5.0 un-inverted) | importing Juicer norms; BP only |
| hictk | fast cross-format convert/dump | C++; reads .hic v6-9 + cooler, writes .hic v9 + cooler | large files; faster than hic2cool/straw; no FRAG, no asymmetric |
| Scenario | Recommended | Why |
|----------|-------------|-----|
| Bare .mcool, tool errors / wrong scale | append ::/resolutions/<bp> URI | the .mcool is a container; tools need one resolution |
| .hic -> .mcool, want Juicer KR/SCALE preserved | hic2cool convert -r 0 | 4DN-canonical norm handling; keeps divisive names |
| .hic -> cooler, large file, speed matters | hictk convert | C++, order-of-magnitude faster; but no FRAG |
| .hic was FRAG-binned | re-bin from pairs in BP | FRAG does not survive any converter -> contact-pairs |
| Build a cooler from .pairs | cooler cload pairs -c1 -p1 -c2 -p2 sizes.txt:bp | needs flipped/deduped pairs -> contact-pairs |
| Need lower resolution | cooler zoomify --balance (sum RAW, re-ICE) | cannot sum balanced pixels; re-balance per resolution |
| Imported KR/VC norms | keep their original names | the KR/VC/VC_SQRT auto-divisive rule fires only on those names |
| matrix(balance=True) raises / all NaN | balance first; NaN rows = masked bins | unbalanced file has no weight; masked bins are expected NaN |
| fetch('1') returns empty on a chr1 file | harmonize chrom naming first | chr1-vs-1 silently zeros every join, no error |
| Balance/normalize for analysis | -> matrix-operations | ICE/KR mechanics, O/E, expected live there |
| Two coolers, compare pixels | confirm bin tables byte-identical | different contigs/order shift every bin_id |
| scHi-C many cells | cooler.create_scool (.scool) -> single-cell | shared bin table; per-cell pixels |
import cooler
cooler.fileops.list_coolers('matrix.mcool') # e.g. ['/resolutions/1000', '/resolutions/10000', ...]
clr = cooler.Cooler('matrix.mcool::/resolutions/10000') # single-resolution URI, never the bare .mcool
clr.binsize, clr.chromnames, clr.info['sum'] # info is unvalidated metadata, not a guarantee
'weight' in clr.bins().columns # is this resolution balanced?
clr.matrix().fetch(region) takes a UCSC region string or a bare chrom name; one arg -> symmetric square, two args -> rectangular submatrix (incl. trans). clr.bins().fetch('chr1'), clr.pixels().fetch(region) slice the tables.
Goal: Pull a chromosome submatrix as a dense array, correctly balanced.
Approach: Confirm the file carries a weight column, then matrix(balance=True); on a balanced file, all-NaN rows are masked low-coverage bins (correct), not a bug. For Juicer-imported KR/VC/VC_SQRT weights, name them correctly and the divisive auto-rule fires; force it with divisive_weights= only if a custom column is misnamed.
balanced = clr.matrix(balance=True).fetch('chr1') # multiplicative cooler weight
raw = clr.matrix(balance=False).fetch('chr1') # observed counts
kr = clr.matrix(balance='KR').fetch('chr1') # KR/VC/VC_SQRT auto-treated as divisive by name
sparse = clr.matrix(balance=True, sparse=True).fetch('chr1') # scipy COO for large chromosomes
hic2cool convert in.hic out.mcool -r 0 # -r 0 = all resolutions -> .mcool; norm vectors un-inverted (>=0.5.0)
hic2cool convert in.hic out.cool -r 10000 # single resolution -> .cool
hic2cool extract-norms in.hic out.mcool # add Juicer norm vectors to an existing matching cooler
hictk convert in.hic out.mcool # fast C++ path; --resolutions to subset (single -> .cool)
hictk convert in.mcool out.hic # cooler -> .hic v9 ONLY; drops asymmetric/extra columns
Neither converter reads/writes FRAG-binned matrices; a FRAG .hic yields only its BP resolutions. Record the hic2cool version: the 0.5.0 inversion boundary changes weight values.
Goal: Turn an in-memory numpy contact matrix into a cooler without a hand-rolled O(n^2) loop.
Approach: Binnify the chromsizes, take only the upper triangle (cooler stores symmetric_upper), pull the nonzero coordinates with a single vectorized np.triu + np.nonzero, and assemble the pixel DataFrame in one shot.
import cooler
import bioframe
import numpy as np
import pandas as pd
chromsizes = bioframe.fetch_chromsizes('hg38') # pin the assembly + chrom naming up front
bins = cooler.binnify(chromsizes, 10000) # 10kb bins; bin table identity defines pixel comparability
upper = np.triu(matrix) # cooler stores the upper triangle only
i, j = np.nonzero(upper) # vectorized; never loop over all bin pairs
pixels = pd.DataFrame({'bin1_id': i, 'bin2_id': j, 'count': upper[i, j]})
cooler.create_cooler('new.cool', bins, pixels, assembly='hg38', symmetric_upper=True)
For pairs, prefer the CLI: cooler cload pairs -c1 2 -p1 3 -c2 4 -p2 5 chromsizes.txt:10000 in.pairs out.cool (the pairs must already be flipped/deduped -> contact-pairs).
Goal: Produce a lower-resolution or multi-resolution file whose weights are valid.
Approach: Coarsen the RAW counts then re-run ICE at each new resolution; zoomify does exactly this. Never sum balanced pixels -- weights are resolution-specific and summed balanced values are silently wrong.
cooler.zoomify_cooler('hires.cool', 'out.mcool', resolutions=[10000, 50000, 100000, 500000], chunksize=10_000_000)
cooler.coarsen_cooler('hires.cool', 'lowres.cool', factor=5, chunksize=10_000_000) # raw sum; re-balance after
cooler zoomify -r 10000,50000,100000,500000 --balance hires.cool # raw-coarsen then ICE afresh per level
cooler.merge_coolers('merged.cool', ['rep1.cool', 'rep2.cool'], mergebuf=20_000_000) # bin tables MUST match
np.save('chr1.npy', clr.matrix(balance=True).fetch('chr1'))
cooler dump -t pixels --join --balanced in.cool > pixels.tsv # --balanced requires a balanced file
cooler dump -t bins in.cool > bins.tsv
cooler info in.cool ; cooler ls -l in.mcool
The bin table IS the identity of a cooler. Two coolers are pixel-comparable only if their bin tables are byte-identical: same chroms, same order, same contigs present, same binsize. Dropping chrM/scaffolds in one pipeline shifts every bin_id and makes pixel comparison nonsense. chr1 (UCSC) vs 1 (Ensembl) silently zeros every cross-tool join and fetch('1') on a chr-named file returns nothing with no error -- harmonize naming (and the assembly) across the cooler, the genome FASTA, any phasing/annotation track, and any blacklist BED (genome-intervals/bed-file-basics). info['genome-assembly'] is unvalidated metadata, not a checksum. The .pairs #chromsize header ORDER defines the upper-triangle convention and the cooler's assembly/chromsizes must match the order used to flip the pairs upstream (contact-pairs), or bin assignment and the triangle disagree.
Standard Hi-C is pairwise. Multi-way assays (Pore-C, SPRITE, GAM) record higher-order co-occurrence; the common path is to DECOMPOSE concatemers into pairwise contacts and store them in a normal cooler, deferring true multi-way analysis to assay-specific tools rather than forcing it into the COO model. For single-cell Hi-C, .scool shares one bin table across /cells/<name>/pixels (cooler.create_scool); per-cell sparsity and imputation are a distinct world -> single-cell/scatac-analysis for the single-cell chromatin context.
Trigger: cooler.Cooler('f.mcool') or a cooltools call on the bare .mcool. Mechanism: an .mcool is a group of resolutions, not one matrix. Symptom: KeyError, wrong/aggregate resolution, or a tool error. Fix: use f.mcool::/resolutions/<bp>; list with cooler.fileops.list_coolers.
weightTrigger: renaming a divisive Juicer norm to weight. Mechanism: divisive_weights=None treats only KR/VC/VC_SQRT as divisive; weight is multiplicative. Symptom: balanced values are wrong, no error. Fix: keep the original KR/VC/VC_SQRT name, or pass divisive_weights=True explicitly.
Trigger: two coolers from one .hic made by hic2cool <0.5.0 and >=0.5.0. Mechanism: pre-0.5.0 inverted norms to multiplicative; >=0.5.0 keeps them divisive. Symptom: the same norm gives different balanced values. Fix: regenerate both with one version; record the version in provenance.
Trigger: building a coarse balanced matrix by summing finer balanced values. Mechanism: balancing weights are resolution-specific. Symptom: plausible-looking but wrong coarse values. Fix: sum RAW then re-ICE per resolution (cooler zoomify --balance).
Trigger: matrix(balance=True) raises or returns all NaN. Mechanism: an unbalanced file has no weight; or those bins were masked during balancing. Symptom: error (unbalanced) or NaN rows/cols (masked). Fix: balance first (cooler balance / balance_cooler(..., store=True)); accept masked-bin NaNs as correct.
.hic "lost resolution" after conversionTrigger: converting a FRAG-binned .hic. Mechanism: cooler/hictk/hic2cool have no FRAG concept. Symptom: only BP resolutions appear; FRAG matrix missing. Fix: re-bin in BP from the pairs.
Trigger: cooler is chr1, a track/blacklist is 1 (or vice versa). Mechanism: chrom names never match. Symptom: empty fetch, zero overlap, no error. Fix: harmonize naming across cooler, FASTA, tracks, blacklist.
| Threshold | Source | Rationale |
|-----------|--------|-----------|
| hic2cool >= 0.5.0 (un-inverted norms) | hic2cool README | the 0.5.0 boundary flips divisive-vs-multiplicative storage; pin it |
| Cooler weight named KR/VC/VC_SQRT -> divisive | cooler divisive_weights rule | only these names auto-trigger divisive; all else multiplicative |
| ignore_diags=2 (balance default) | cooler.balance_cooler default | drop the main + first diagonal (self/near-diagonal artifacts) before ICE |
| mad_max=5 (balance default) | cooler.balance_cooler default | mask bins >5 MAD from the median coverage marginal |
| min_nnz=10 (balance default) | cooler.balance_cooler default | mask sparse bins with <10 nonzero pixels |
| mcool resolution ladder = integer multiples of base | HiGlass tiling | non-integer-multiple levels break tile aggregation; 4DN uses a nice-number series |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| clr.matrix(balance=True) raises / all NaN | unbalanced file, or masked bins | balance first; masked-bin NaN is expected |
| Empty / wrong-resolution result on .mcool | bare .mcool passed | use f.mcool::/resolutions/<bp> |
| Balanced values look wrong, no error | KR/VC norm renamed to weight (treated multiplicative) | keep KR/VC/VC_SQRT name or set divisive_weights=True |
| Two coolers from one .hic disagree | hic2cool 0.5.0 inversion boundary | regenerate with one version; record it |
| fetch('1') returns nothing | chr1-vs-1 naming mismatch | harmonize chrom naming across all inputs |
| FRAG resolutions missing after convert | no FRAG concept in cooler | re-bin from pairs in BP |
| Coarse matrix values wrong | summed balanced pixels | sum raw then re-ICE (zoomify --balance) |
| AttributeError on a cooler function | pre-0.8/0.10 API change | help(cooler.<fn>); balance_cooler is keyword-only after clr |
development
Find restriction enzyme cut sites in DNA sequences using Biopython Bio.Restriction. Search with single enzymes, batches of enzymes, or commercially available enzyme sets. Returns cut positions for linear or circular DNA. Use when finding restriction enzyme cut sites in sequences.
development
Create restriction maps showing enzyme cut positions on DNA sequences using Biopython Bio.Restriction. Visualize cut sites, calculate distances between sites, and generate text or graphical maps. Use when creating or analyzing restriction maps.
development
Analyze restriction digest fragments using Biopython Bio.Restriction. Predict fragment sizes, get fragment sequences, simulate gel electrophoresis patterns, and perform double digests. Use when analyzing restriction digest fragment patterns.
development
Select restriction enzymes by criteria using Biopython Bio.Restriction. Find enzymes that cut once, don't cut, produce specific overhangs, are commercially available, or have compatible ends for cloning. Use when selecting restriction enzymes for cloning or analysis.