hi-c-analysis/matrix-operations/SKILL.md
Balances Hi-C contact matrices (ICE via cooler.balance_cooler, KR/SCALE/VC context), computes distance-decay expected with cooltools (expected_cis per-diagonal P(s), expected_trans scalar), builds observed/expected (O/E) matrices, and diagnoses polymer state from the P(s) log-derivative. Covers the within-matrix-vs-cross-sample distinction (balancing is NOT a normalizer), the equal-visibility assumption that CNV/aneuploidy violates (use raw counts for copy-number), cis-only balancing, mad_max/blacklist masking before balancing, multiplicative cooler weights vs divisive juicer weights, and the resolution-vs-depth budget. Use when balancing a .cool/.mcool, computing expected or P(s), making O/E matrices for compartments/loops, deciding ICE vs KR vs SCALE, choosing a resolution for a given depth, or troubleshooting NaN/all-NaN balanced matrices; route cross-sample comparison to hic-differential.
npx skillsauth add GPTomics/bioSkills bio-hi-c-analysis-matrix-operationsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: cooler 0.10+, cooltools 0.7+, bioframe 0.7+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
cooltools standardized its API around 0.7 (functions take a view_df viewframe; expected_cis defaults to smooth=True, aggregate_smoothed=True). .mcool is multi-resolution: analysis functions take a single-resolution URI (file.mcool::/resolutions/10000), never the bare .mcool. A matrix must be balanced (a stored weight column) before O/E, compartments, insulation, or dots; clr.matrix(balance=True) on an unbalanced cooler returns all-NaN.
"Make the pixels of my Hi-C matrix comparable to each other." -> Balance (remove per-bin coverage bias under equal-visibility), then divide by distance-matched expected (remove the polymer P(s) background) to get O/E.
cooler.balance_cooler(clr, cis_only=True, store=True), then cooltools.expected_cis(clr) and divide observed by per-diagonal expected.Balancing (ICE/KR) is a within-matrix operation: it solves for per-bin bias weights so every bin has equal genome-wide visibility, making a single map internally consistent. It does nothing to relate map A to map B. Two balanced matrices at different sequencing depth still differ in absolute magnitude, dynamic range, and noise floor -- and rescale_marginals makes the absolute balanced values arbitrary-scaled anyway. "I balanced both, now I'll subtract/log2-ratio them" is the single most common error in the field: the depth difference is read as biology. Cross-sample comparison requires downsampling to equal valid-pair count, distance-matched O/E, and a replicate-aware differential tool (multiHiCcompare, HiCcompare loess-over-distance, dcHiC) -- route to hic-differential.
Two corollaries that flow from the same equal-visibility model:
| Method | What it does | Mechanism | When | |--------|-------------|-----------|------| | ICE (cooler native) | true matrix balancing | iterative proportional fitting (Sinkhorn); equalizes all marginals | default; robust, converges on sparse/low-depth where KR fails | | KR (juicer) | true matrix balancing | Knight-Ruiz Newton solver; SAME fixed point as ICE | fast (few iterations); fails to converge on sparse/high-res maps | | SCALE (juicer) | true matrix balancing | modern KR-family solver, more robust | juicer's default; converges where KR diverges on sparse maps | | VC / vanilla coverage | NOT true balancing | single pass: divide by row-coverage * col-coverage (one ICE iteration) | fast robust fallback; leaves residual bias | | VC_SQRT | NOT true balancing | divide by sqrt of coverage product (gentler than VC) | very sparse data where full balancing overfits | | LOIC / CAIC (Servant 2018) | CNV-aware balancing | condition on copy-number; LOIC keeps the CN effect, CAIC removes it | aneuploid/tumor genomes (plain ICE is wrong here) |
KR and ICE reach the same balanced map -- choose by convergence, not quality: KR is faster but blows up on sparse/low-depth/high-resolution matrices; ICE is the robust default; SCALE is the juicer-side answer when KR fails.
| Scenario | Recommended | Why |
|----------|-------------|-----|
| Need to balance a diploid map | cooler.balance_cooler(cis_only=True, store=True) (ICE) | robust default; cis-only is the analysis convention |
| KR failed to converge (sparse/high-res) | fall back to ICE, or SCALE on the juicer side | same fixed point; ICE/SCALE are the robust solvers |
| Tumor / aneuploid genome, 3D structure | CNV-aware LOIC/CAIC (Servant 2018) | plain ICE erases real copy-number |
| CNV / SV calling from Hi-C | RAW counts (no balancing) | coverage is the signal -> copy-number |
| Compartments at 100kb-1Mb | balance -> expected_cis -> O/E -> Pearson -> eigenvector | O/E removes P(s) so the plaid is visible -> compartment-analysis |
| Focal loops / TADs | balance -> expected_cis -> O/E | local enrichment needs the distance background removed -> loop-calling, tad-detection |
| P(s) / polymer-state diagnostic | expected_cis(smooth=True) -> log-derivative | the derivative reads out loop-extrusion machinery |
| Imported a juicer KR/VC weight column | check divisive_weights before applying | cooler weights are multiplicative, juicer's are divisive |
| Compare two conditions | downsample to equal depth, then -> hic-differential | balancing is within-matrix, not a cross-sample normalizer |
| Bare .mcool passed and KeyError | use file.mcool::/resolutions/<bp> URI | .mcool is a container of resolutions |
Goal: Remove one-dimensional per-bin coverage bias so every bin has equal genome-wide visibility within this single map.
Approach: Mask low-coverage and blacklisted bins FIRST (mad_max on log-marginals + explicit blacklist of centromere/rDNA/unmappable), drop the first two diagonals (ligation chemistry, not 3D contact), then run cis-only ICE; the multiplicative weight vector is stored in the weight column.
import cooler
clr = cooler.Cooler('matrix.mcool::/resolutions/10000')
bias, stats = cooler.balance_cooler(clr, cis_only=True, mad_max=5, ignore_diags=2, blacklist=None, store=True)
print('converged:', stats['converged'], 'scale:', stats['scale']) # stats also reports var, divisive_weights
clr = cooler.Cooler('matrix.mcool::/resolutions/10000') # re-open to see the stored weights
balanced = clr.matrix(balance=True).fetch('chr1') # raw[i,j] * w[i] * w[j]; masked bins -> NaN
cis_only=True is the convention for compartment/TAD/loop work -- trans signal is weak, noisy ambient ligation that pulls the bias estimates toward trans noise. ignore_diags=2 drops the main diagonal (self-ligation/dangling ends) and first off-diagonal (undigested/religated fragments): huge untrustworthy counts that would dominate the marginals. mad_max=5 filters bins whose log-marginal is >5 MAD below the median; without it a near-empty unmappable/centromeric bin gets a gigantic weight and ICE diverges. Masking (mad_max + blacklist) MUST precede balancing -- balancing cannot rescue a no-signal bin, it amplifies it.
CLI equivalent:
cooler balance --cis-only --mad-max 5 --ignore-diags 2 matrix.mcool::/resolutions/10000
Goal: Build the distance-decay background (the denominator for O/E) and the P(s) curve for diagnostics.
Approach: cis expected is a per-diagonal curve (one value per separation s -- this IS P(s)); trans expected is a single scalar per chromosome-pair block (trans contacts are ~distance-independent). cooltools enforces the split with two functions.
import cooltools
import bioframe
clr = cooler.Cooler('matrix.mcool::/resolutions/10000')
view_df = bioframe.make_viewframe(clr.chromsizes) # whole-chromosome regions; or arms for acrocentric genomes
cvd = cooltools.expected_cis(clr, view_df=view_df, smooth=True, aggregate_smoothed=True, ignore_diags=2)
# columns include: region1, region2, dist, dist_bp, n_valid, count.avg, balanced.avg, balanced.avg.smoothed.agg
trans_exp = cooltools.expected_trans(clr, view_df=view_df) # one balanced.avg per region1-region2 block
smooth=True smooths P(s) in log10(distance) space (smooth_sigma=0.1). Pre-0.7.0 cooltools errored on raw smoothing (clr_weight_name=None, smooth=True, issue #456); that was fixed in 0.7.0, and raw smoothing now returns count.avg.smoothed. Regardless of version, balance first: a raw expected still carries per-bin coverage bias, so it is not a clean P(s)/O/E denominator.
Goal: Divide out the polymer distance-decay so enrichment (loops, plaid) stands above the local background.
Approach: Map the per-diagonal cis expected (balanced.avg, keyed by dist) onto a dense balanced matrix by diagonal offset -- vectorized with numpy diagonal indexing, NOT an O(n^2) Python loop.
import numpy as np
def oe_matrix(clr, region, cvd):
obs = clr.matrix(balance=True).fetch(region)
chrom = region.split(':')[0] if isinstance(region, str) else region[0]
exp = cvd[cvd['region1'] == chrom].set_index('dist')['balanced.avg']
exp_by_dist = exp.reindex(range(obs.shape[0])).to_numpy() # one expected per separation s
expected = exp_by_dist[np.abs(np.subtract.outer(np.arange(obs.shape[0]), np.arange(obs.shape[0])))]
return obs / expected # NaN where expected is NaN (masked diags)
oe = oe_matrix(clr, 'chr1', cvd)
log_oe = np.log2(oe) # symmetric around 0 for display
cooltools also ships cooltools.lib.numutils.observed_over_expected(matrix, mask) (returns a 4-tuple (OE, dist_bins, sum_pixels, n_pixels)) for a self-contained dense O/E without a precomputed cvd.
Goal: Read out chromatin polymer state and loop-extrusion machinery from the shape of the contact-decay curve.
Approach: Take the slope of P(s) in log-log space; a reference slope near -1 over 0.1-1 Mb is the crumpled-globule background, and a loop-extrusion bump (~100kb interphase) appears as a peak in the derivative. Smooth in logspace FIRST or the derivative is pure noise.
agg = cvd[(cvd['region1'] == cvd['region2']) & (cvd['dist'] > 0)].drop_duplicates('dist_bp')
slope = np.gradient(np.log(agg['balanced.avg.smoothed.agg']), np.log(agg['dist_bp']))
# slope ~ -1 over 0.1-1 Mb; a bump toward 0 near ~100kb flags cohesin loop extrusion (flattens on WAPL/RAD21 loss)
The achievable resolution is a function of depth and genome size, not a free choice. Rule of thumb: a bin needs ~1000 contacts to be reliably populated, and the number of bins scales as N^2 with the number of genomic bins -- so halving bin size quarters per-bin coverage. Coarse features are cheap, focal features are expensive:
| Feature | Resolution | Approximate depth (human) | Why | |---------|-----------|---------------------------|-----| | A/B compartments | 100kb-1Mb | tens of millions of valid pairs | chromosome-scale, few large bins -> cheap | | TADs / insulation | 10-50kb | hundreds of millions | sub-Mb domains; window 5-25x the bin | | Loops / dots | <=10kb | billions (Rao 2014 in-situ Hi-C ~ billions) | focal kb-scale pixels; coarse bins blur anchors |
Calling 10kb loops from a shallow library binned at 50kb is not a resolution choice -- there is no signal there. Choose the finest resolution where median per-bin contacts stay near ~1000.
Trigger: balancing two libraries then log2-ratioing/subtracting. Mechanism: balancing is within-matrix; balanced magnitude still scales with depth and rescale_marginals makes it arbitrary. Symptom: systematic genome-wide "differences" that track sequencing depth. Fix: downsample to equal valid pairs, compare O/E, use a replicate-aware tool -> hic-differential.
Trigger: balance_cooler on a genome with large copy-number swings. Mechanism: equal-visibility forces equal marginals, erasing the real ~CN-fold coverage. Symptom: high-copy regions look cis-depleted/trans-enriched (Servant 2018); CNV vanishes. Fix: raw counts for CNV calling; LOIC/CAIC for 3D structure.
Trigger: low-coverage centromere/rDNA/unmappable bins left in before ICE. Mechanism: a near-empty bin gets a gigantic bias weight. Symptom: ICE fails to converge, or stripe artifacts radiate from a few bins. Fix: set mad_max and pass blacklist; masking precedes balancing.
Trigger: compartment calling skips the expected/O/E step. Mechanism: the distance-decay dominates the balanced cis map. Symptom: top eigenvector is the P(s) curve, not the A/B plaid. Fix: divide by expected_cis (O/E) before correlating/eigendecomposing.
Trigger: applying an imported juicer KR/VC vector as if it were a cooler weight. Mechanism: cooler weights are multiplicative (rawww), juicer's are divisive (raw/w/w). Symptom: correction inverts -- high-bias bins get MORE extreme; nothing errors. Fix: check the weight column's divisive_weights attribute before applying.
Trigger: computing P(s)/O/E from expected_cis(clr_weight_name=None). Mechanism: raw expected still carries per-bin coverage bias, so dividing by it does not cleanly remove the polymer background; pre-0.7.0 cooltools additionally errored on smooth=True with raw (issue #456, fixed in 0.7.0). Symptom: O/E still shows coverage stripes; on old cooltools, an error demanding balanced data. Fix: balance first, then expected_cis on the balanced weight column.
| Threshold | Source | Rationale |
|-----------|--------|-----------|
| ignore_diags=2 | cooler default; ICE convention | drops self-ligation/dangling (diag 0) + undigested/religated (diag 1); ligation chemistry, not 3D contact |
| mad_max=5 | cooler default | drops bins >5 MAD below median LOG-marginal; near-empty bins otherwise get exploding weights |
| min_nnz=10 | cooler default | <10 nonzero pixels per row is too sparse to estimate a bias reliably |
| tol=1e-5, max_iters=200 | cooler defaults | convergence = variance of balanced marginals < tol; non-convergence usually = a masking problem, not a tol problem |
| smooth_sigma=0.1 | cooltools default | Gaussian std in log10(distance) units for P(s) smoothing |
| ~1000 contacts/bin | depth-budget convention | per-bin coverage floor for a reliably populated bin; bins scale ~N^2 |
| Compartment res 100kb-1Mb | compartment scale | finer bins mix in TAD/loop structure |
| TAD/insulation res 10-50kb | domain scale | sub-Mb domains; window 5-25x the bin |
| Loop res <=10kb (needs ~billions of pairs) | Rao 2014 in-situ Hi-C | focal kb-scale; coarse bins blur loop anchors |
| P(s) slope ~ -1 over 0.1-1 Mb | crumpled/fractal globule | reference background; deviations/derivative read out polymer state |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| clr.matrix(balance=True) all NaN | cooler not balanced | run cooler.balance_cooler(..., store=True) first |
| Empty / wrong-resolution result on .mcool | bare .mcool passed | use file.mcool::/resolutions/<bp> URI |
| ICE not converging | low-coverage bins not masked | raise/set mad_max, pass blacklist; do not just raise max_iters |
| expected_cis(smooth=True) errors on raw (pre-0.7.0 only) | clr_weight_name=None on old cooltools (issue #456, fixed 0.7.0) | upgrade to cooltools 0.7+, or balance first / smooth=False |
| O/E enrichment inverted on a tumor | ICE applied to an aneuploid | use raw counts / LOIC/CAIC; equal-visibility is violated |
| Imported weight makes bias worse | divisive juicer weight applied as multiplicative | check divisive_weights; reciprocate if needed |
| Cross-condition "difference" tracks depth | subtracting balanced matrices | downsample + O/E + replicate-aware test -> hic-differential |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.