hi-c-analysis/contact-pairs/SKILL.md
Turns Hi-C/Micro-C FASTQ into a deduplicated, filtered .pairs file with pairtools and decides whether the library worked. Covers the bwa mem -SP5M / bwa-mem2 / chromap --preset hic alignment idiom (mates mapped as independent single-end reads), pairtools parse vs parse2 and the walks-policy choice (5unique pairwise vs all for Pore-C/Micro-C concatemers), pair-type classification (keep UU and rescued UC), dedup (PCR vs optical/by-tile), select by pair_type/MAPQ/distance, restriction-fragment handling (restrict, Arima dual-enzyme, Micro-C/DNase fragment-free), and allele-specific phasing (pairtools phase to two coolers). The library-QC decision uses % long-range cis as the one-number quality metric, trans as the noise floor, orientation balance as fragment-map-free dangling-end/self-circle QC, and % duplicates as a complexity proxy. Use when processing Hi-C/Micro-C/Omni-C reads into pairs, judging library quality, handling multi-enzyme or restriction-agnostic protocols, or generating allele-specific contacts.
npx skillsauth add GPTomics/bioSkills bio-hi-c-analysis-contact-pairsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: pairtools 1.1+, bwa 0.7.17+ (or bwa-mem2 2.2+), chromap 0.2+, samtools 1.19+, cooler 0.10+
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagspairtools defaults have shifted across releases (e.g. parse --max-molecule-size is 750 bp in 1.1.x; dedup --backend defaults to scipy). parse and parse2 report DIFFERENT positions by default - confirm --report-position before mixing outputs. If a command errors, introspect with pairtools <subcommand> --help and adapt rather than retrying.
"Turn my Hi-C reads into clean contacts and tell me if the library worked" -> Align mates independently through ligation junctions, classify and deduplicate pairs to a 5'-canonical .pairs file, then read the cis/trans and orientation statistics to decide library quality before any matrix is built.
bwa mem -SP5M ref.fa R1.fq R2.fq | pairtools parse -c chrom.sizes | pairtools sort | pairtools dedup | pairtools statsA Hi-C library's usable signal is not "reads sequenced" - it is the uniquely-mapped, deduplicated, long-range cis contacts. Everything between FASTQ and the matrix exists to strip a specific artifact of proximity-ligation chemistry, and the diagnostic ratios from pairtools stats reveal whether the experiment succeeded before any compute is spent binning it. Three load-bearing consequences:
% long-range cis is the one-number quality metric; trans is the noise floor. True crosslink-ligation contacts are overwhelmingly cis and distance-decaying. Random ligation between two unrelated molecules in solution is as likely to be trans as cis-far, so trans% is a direct readout of the spurious-ligation floor. A good in-situ human library runs cis>=1kb ~50-65%, inter-chromosomal <10%. But these numbers are genome-size dependent - a yeast or bacterial genome legitimately has higher expected trans (more inter-chromosomal volume per cis distance). Never apply a human trans threshold to a microbe.
The ligation junction lives INSIDE the read, so a local/split aligner aligning mates independently is required. A single read often sequences through a ligation junction (locus A | locus B within one read). An end-to-end aligner soft-clips or mis-maps it and the contact is lost. bwa mem -SP5M aligns R1 and R2 as independent single-end reads (-SP skips mate rescue and pairing - proper-pair logic would destroy every long-range and trans contact) and marks the 5'-most chimeric segment primary (-5, the anchor for pairtools' 5' convention). The chimera fraction rises with read length, so on 150bp PE and on Micro-C long reads this is a large, real chunk of contacts.
Keep UU AND rescued UC; selecting only UU silently discards every rescued ligation. A naive select pair_type=="UU" throws away the chimeric reads pairtools successfully reconstructed (UC = combined-unique) - a meaningful fraction on long reads. The 4DN standard keeps UU and UC.
| Aligner | Role | Hi-C invocation | When |
|---------|------|-----------------|------|
| bwa mem -SP5M | reference standard; local/split alignment reconstructs in-read junctions | bwa mem -SP5M -t N ref.fa R1 R2 | default; best inter-contig accuracy in benchmarks |
| bwa-mem2 | drop-in faster reimplementation, identical output, same flags | bwa-mem2 mem -SP5M -t N ref.fa R1 R2 | when speed matters and the larger index fits RAM |
| chromap --preset hic | ultrafast integrated align + dedup + pairs (4DN .pairs out) | chromap --preset hic -x idx -r ref.fa -1 R1 -2 R2 -o out.pairs | ~10x faster scans; trades fine walk-policy control for speed |
-SP5M, letter by letter: -S skip mate rescue; -P skip pairing (no proper-pair rescue); together -SP align mates as independent single-end reads. -5 mark the 5'-most split segment primary (anchors the 5' convention). -M is legacy compatibility only (secondary flag 256 vs supplementary 2048); pairtools handles either - never agonize over -M, never drop -SP5.
| Scenario | Recommended | Why |
|----------|-------------|-----|
| Standard in-situ/Omni-C, FASTQ -> matrix | bwa mem -SP5M -> parse (5unique) -> sort -> dedup -> stats | the 4DN/distiller default; restriction-agnostic |
| Fastest scan, control not critical | chromap --preset hic | integrated align+dedup+pairs ~10x faster |
| Multi-way contacts (Pore-C, MC-3C, Micro-C walks) | parse --walks-policy all or parse2 --expand | 5unique COLLAPSES concatemers to pairwise silently |
| Micro-C / DNase Hi-C | NO fragment map; do NOT apply a 1kb min-distance cut | sub-1kb (nucleosome ladder) is signal, not artifact |
| Arima / Hi-C 3.0 dual-enzyme, fragment-level | fragment file must encode BOTH motifs (restrict -f) | a single-enzyme digest file is silently wrong |
| Repeat-heavy genome / stringent loop anchors | raise parse --min-mapq 30 | reclassifies borderline reads M, drops repeat false anchors |
| Allele-specific / diploid folding | diploid ref + -XA suboptimal hits -> pairtools phase -> two coolers | needs the suboptimal-score gap to resolve haplotypes |
| Decide whether to sequence deeper | complexity curve from dup model / preseq lc_extrap | a bare dup% without depth is meaningless |
| Build the matrix from clean pairs | -> hic-data-io (cooler cload pairs) | binning happens after classification/dedup |
| Annotate boundary/anchor coordinates | -> genome-intervals/bed-file-basics | pairs are 1-based, half-open conventions differ |
bwa index ref.fa # or bwa-mem2 index ref.fa
bwa mem -SP5M -t 16 ref.fa R1.fq.gz R2.fq.gz | \ # -SP: independent SE; -5: 5' segment primary
samtools view -b -@ 8 - > aligned.bam
# chromap fast path (integrated align + dedup + 4DN pairs, no separate pairtools needed):
# chromap -i -r ref.fa -o idx && \
# chromap --preset hic -x idx -r ref.fa -1 R1.fq.gz -2 R2.fq.gz -o sample.pairs
# Parse alignments into a 5'-canonical .pairs. min-mapq 1 (default) is permissive: only MAPQ-0 is "multi".
pairtools parse -c chrom.sizes --walks-policy 5unique --min-mapq 1 \
--add-columns mapq --drop-sam aligned.bam | \
pairtools sort --nproc 8 | \ # block-sort; flips to upper-triangular (5'-canonical)
pairtools dedup --max-mismatch 3 --mark-dups \ # within 3bp on both sides = duplicate; tag DD
--output-stats sample.dedup.stats | \
pairtools select '(pair_type=="UU") or (pair_type=="UC")' \ # keep both-unique AND rescued chimeric
-o sample.valid.pairs.gz
Dedup MUST run on a flipped, 5'-canonical file (sort does the flip); on a non-canonical file dedup under-collapses and dup% reads falsely low. --max-molecule-size (750 bp in 1.1.x) governs single-ligation chimera rescue; --max-inter-align-gap (20 bp) sets when a coverage gap becomes a null alignment.
pairtools stats is the canonical readout. Read it as a funnel, not a single number.
pairtools stats --bytile-dups -o sample.stats.tsv sample.valid.pairs.gz
# Key fields: frac_dups; frac_cis; cis_1kb+/cis_20kb+; trans; pair_types; dist_freq orientation FF/FR/RF/RR.
--bytile-dups separates OPTICAL dups (patterned NovaSeq flowcells, same tile, adjacent coordinates) from PCR dups. Only the PCR fraction reflects library complexity; reading total dup% as over-amplification wrongly condemns a good library. A bare dup% without the depth it was measured at is meaningless - use the complexity/yield curve (preseq lc_extrap) to decide whether deeper sequencing buys uniques or duplicates.Apply the QC-derived distance cut without a fragment map:
pairtools select '(chrom1!=chrom2) or (abs(pos2-pos1) > 1000)' \ # keep trans + cis beyond the orientation-equalization distance
-o sample.filtered.pairs.gz sample.valid.pairs.gz
Modern pipelines SKIP fragment filtering on purpose - the distance cut + dedup + UU/UC filter + balancing absorb the residual, and a digest file is one genome-specific place to mis-specify the enzyme. pairtools restrict -f frags.bed is opt-in for: sub-kb / restriction-fragment-resolution maps, capture-Hi-C / 4C-style fragment analysis, and bench QC where the dangling-end fraction is the digestion-efficiency readout.
cooler digest --out frags.bed chrom.sizes ref.fa DpnII # single-enzyme: DpnII ^GATC
pairtools restrict -f frags.bed -o restricted.pairs.gz parsed.pairs.gz
Arima dual-enzyme has FOUR junction motifs (GATCGATC, GANTGATC, GANTANTC, GATCANTC); a single-enzyme (DpnII-only) digest file silently mis-assigns fragments. Micro-C / DNase Hi-C have NO fragment map (MNase/DNase cut sequence-nonspecifically) - any tool requiring a restriction file cannot process them, which is exactly why the restriction-agnostic pairtools path became the field default.
Goal: Resolve each contact to maternal vs paternal haplotype for allele-specific 3D folding.
Approach: Align to a diploid reference reporting suboptimal hits, so each read carries its best alignment on both homologs; pairtools phase reads the gap between the two best alignment scores to tag each side resolved-hap1 / resolved-hap2 / non-resolved / multi-mapper - the score gap is what separates a genuinely uninformative read from an actual repeat.
bcftools consensus -H 1 -f ref.fa phased.vcf.gz > hap1.fa # build a diploid (two-homolog) reference
bcftools consensus -H 2 -f ref.fa phased.vcf.gz > hap2.fa
bwa mem -SP5M ref_diploid.fa R1.fq R2.fq | \ # report suboptimal hits so both homologs are kept
pairtools parse -c chrom.sizes --add-columns XB,AS,XS | \
pairtools phase --phase-suffixes _hap1 _hap2 --tag-mode XB | \
pairtools sort | pairtools dedup -o phased.pairs.gz
# Then split into hap1/hap2/trans pairs and cload each into a SEPARATE cooler.
Trigger: plain bwa mem without -SP, or proper-pair logic. Mechanism: mate rescue forces the shotgun insert model on mates from different loci. Symptom: trans% collapses, compartments vanish, sparse map. Fix: bwa mem -SP5M, mates aligned independently.
-5Trigger: copying a pre-2016 -SP command lacking -5. Mechanism: the 5'-most chimeric segment is not primary, so pairtools picks an inconsistent anchor. Symptom: degraded flip/dedup, smeared loops. Fix: always -SP5M (or -SP5).
Trigger: Pore-C/MC-3C/Micro-C walks parsed with the default. Mechanism: 5unique reports only the two 5'-most alignments, collapsing concatemers to pairwise. Symptom: "we found few multi-contacts." Fix: --walks-policy all or parse2 --expand.
Trigger: combining a parse2 (.pairs, default outer/junction-anchored) file with a parse (5'-anchored) file. Mechanism: the two report positions by different conventions. Symptom: coordinates shift by the alignment length; dedup under-collapses; loops smear. Fix: one parser/convention per project; prefer plain parse if walks are not needed.
Trigger: select pair_type=="UU". Mechanism: discards rescued chimeric (UC) pairs. Symptom: lower valid-pair yield, especially on long reads. Fix: keep (pair_type=="UU") or (pair_type=="UC").
Trigger: dedup run before sort/flip, or on mixed parse/parse2 positions. Mechanism: duplicates are not in canonical coordinates. Symptom: dup% reads falsely low - library looks better than it is. Fix: sort (flips to 5'-canonical) before dedup.
Trigger: total dup% on a patterned flowcell taken as library complexity. Mechanism: optical (same-tile) dups inflate apparent PCR rate. Symptom: a good library condemned as over-amplified. Fix: --bytile-dups / --output-bytile-stats; judge complexity on the PCR fraction only.
Trigger: applying Hi-C's >1kb min-distance cut to Micro-C. Mechanism: Micro-C's nucleosome-ladder signal lives below 1kb. Symptom: the structure Micro-C exists to capture is erased. Fix: derive the cut from the orientation-vs-distance plot per library.
| Threshold | Source | Rationale |
|-----------|--------|-----------|
| cis>=1kb ~50-65% of nodup pairs (good in-situ human) | Dovetail/Arima QC guidance (~approx) | long-range cis is the signal; library- and genome-size dependent |
| inter-chromosomal (trans) <10% (clean), 20-30% acceptable | in-situ Hi-C practice | trans is the spurious-ligation floor; NEVER apply to small genomes |
| FF/FR/RF/RR -> ~25% each above ~1kb | random strand combination at true contacts | convergence is the fragment-map-free positive QC signal |
| parse --min-mapq 1 default, raise to 30 for stringency | pairtools default | 1 drops only MAPQ-0; 30 removes repeat-driven false anchors |
| dedup --max-mismatch 3 bp | pairtools default | tolerates mapping wobble; 0 over-splits, larger over-collapses complexity |
| parse --max-molecule-size 750 bp (1.1.x) | pairtools default | bound on single-ligation chimera rescue; revisit for unusual size selection |
| min-distance cut ~1kb (Hi-C), derive from orientation plot | orientation-equalization distance | the cut is protocol-specific; Micro-C signal is sub-1kb |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| trans% high, no compartments | aligned without -SP (proper-pair logic) | re-align bwa mem -SP5M |
| Few multi-way contacts on Pore-C/Micro-C | default --walks-policy 5unique collapsed walks | --walks-policy all / parse2 --expand |
| Loops smeared, dedup under-collapses | mixed parse/parse2 position conventions | one parser per project; sort before dedup |
| Valid-pair yield lower than expected | select kept only UU | keep UU and UC |
| dup% suspiciously low | dedup ran before sort/flip | sort to 5'-canonical first |
| dup% high on NovaSeq, complexity looks bad | optical dups counted as PCR | --bytile-dups; judge on PCR fraction |
| Micro-C structure disappears after filtering | fixed 1kb min-distance cut | derive cut from orientation-vs-distance |
| Fragment assignment wrong on Arima data | single-enzyme digest file | encode all four Arima junction motifs |
| phase resolves nothing | aligned to a haploid reference | diploid ref + suboptimal (-XA) alignments |
-SP5M idiomtools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.