genome-assembly/scaffolding/SKILL.md
Orders and orients assembled contigs into chromosome-scale scaffolds from long-range linking data, inserting N-gap spacers (adds no sequence). Covers Hi-C/Omni-C scaffolding (YaHS, SALSA2, 3D-DNA/Juicer), Hi-C read-mapping prerequisites (map each end separately, no mate rescue, dedup, enzyme-aware), reading the contact map for misjoins/inversions/false-duplications, manual curation in Juicebox/PretextView (the VGP/DToL standard), reference-guided scaffolding (RagTag) and its karyotype-erasure hazard, genetic-map (ALLMAPS) and Bionano optical-map integration, chimera-breaking before scaffolding, gap-filling, and telomere/contig-vs-scaffold-N50 QC (tidk). Use when turning contigs into chromosomes with Hi-C, integrating a linkage map or optical map, choosing a scaffolder by available linking data, or judging whether a chromosome-scale assembly is trustworthy.
npx skillsauth add GPTomics/bioSkills bio-genome-assembly-scaffoldingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: YaHS 1.2+, SALSA2 2.3+, juicer_tools 1.22/2.0+, bwa 0.7.17+, chromap 0.2+, samtools 1.19+, RagTag 2.1+, tidk 0.2.3+, seqkit 2.6+.
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagshifiasm output filenames and YaHS/SALSA2 enzyme and resolution defaults have changed across releases; confirm output names and -e/-r behaviour against the installed version. The Hi-C-mapping flag set (bwa mem -5SP vs the Arima/VGP per-end pipeline) varies by vendor (Arima, Dovetail/Omni-C, Phase Genomics) - check the current mapping guide before pasting flags. If a command errors, introspect the tool and adapt rather than retrying.
"Turn my contigs into chromosomes" -> Order and orient contigs by long-range linking signal, insert N-gap spacers, then curate the contact map - scaffolding adds no sequence, so the output is a draft chromosome structure, not finished sequence.
yahs contigs.fa hic_to_contigs.bam (Hi-C default), run_pipeline.py -a contigs.fa -b sorted.bed -m yes (SALSA2), ragtag.py scaffold ref.fa contigs.fa (reference-guided), allmaps path (linkage maps)YaHS finishes in minutes and emits a file literally named _scaffolds_final.agp. It is not final. Hi-C scaffolding is a contact-frequency inference - it orders/orients contigs by the polymer-physics signal that contact frequency decays with 1D distance - and it is error-prone exactly at the joins. Every reference-grade pipeline (VGP, Darwin Tree of Life, Earth BioGenome) treats the automated AGP as a first pass that a human curator then corrects in PretextView or Juicebox: breaking misjoins, flipping inversions, removing false duplications, assigning chromosomes by eye. Howe 2021 quantified the hidden labor across 111 VGP/DToL assemblies: on average 221 interventions per Gb (67 breaks, 105 joins, 49 false-duplication removals). Three load-bearing consequences:
The contact map is the QC, not a decoration. A correct chromosome shows one bright diagonal with smooth off-diagonal decay; off-diagonal blocks, anti-diagonal "bowties," and bleeding between chromosomes are errors to inspect and break (see Reading the Contact Map). An assembly paper whose methods say "scaffolded with [tool]" but never mention curation/Juicebox/PretextView is shipping a draft - downgrade any claim that depends on large-scale order (synteny, fusions, structural variants).
Scaffold N50 is not contig N50, and conflating them is the classic reviewer catch. Scaffolding inserts runs of N (estimated, often arbitrary length) and adds zero sequence. A genome can post "scaffold N50 = 60 Mb, chromosome-scale!" while its contig N50 is 200 kb - the contiguity is borne by gaps, not finished sequence. Always report contig N50, scaffold N50, gap count, and total N bases separately.
Reordering cannot rescue a bad contig. Scaffolders take contigs as given; a chimeric contig drags the wrong sequence into the wrong chromosome. Chimeras must be broken before scaffolding (see Chimera-Breaking).
| Modality | How it links | Scale | Status | |----------|--------------|-------|--------| | Hi-C / Omni-C | proximity ligation; contact frequency ~ 1/(1D distance) | chromosome-scale | Dominant modern method. Omni-C is enzyme-free/sequence-agnostic; Arima uses fixed enzyme motifs | | Bionano optical maps (DLS) | labeled motif patterns aligned to in-silico contig maps | megabase | Resolves large SVs and bridges complex repeats; separate instrument + library (cost decision) | | 10x linked reads | barcoded short reads from one long molecule | ~10-100 kb | DISCONTINUED by 10x (2020). Legacy data only; Tigmint/ARCS still run, no new data | | Genetic / linkage maps | marker recombination order = chromosome order | chromosome-scale, low resolution | ALLMAPS integrates multiple maps; orthogonal validation of Hi-C | | Reference (homology) | align contigs to a related genome, copy its order | as good as synteny | RagTag. Fast, but imposes the reference's karyotype - see hazard below | | Long reads as linkers | k-mer pairs spanning junctions | read length | LINKS; mostly superseded by assembling with the long reads directly | | Mate-pair / jumping | large-insert paired short reads | 2-40 kb | Legacy (SSPACE); chimeric-insert artifacts; never recommend today |
| Available linking data | Use | Why |
|-------------|-----|-----|
| Hi-C/Omni-C, want fast contiguous default | YaHS | community default; fast, high N90, AGP + Juicebox outputs in one command |
| Hi-C, want graph-aware conservative joins + input-error correction | SALSA2 -m yes (-g graph.gfa) | iterative; breaks chimeric input; uses assembly graph to avoid orientation errors |
| Hi-C, interactive curation central to workflow | 3D-DNA + Juicer + Juicebox (JBAT) | the Aiden-lab .assembly <-> Juicebox round-trip is built around hand-editing |
| Hi-C, vertebrate/reference-grade | YaHS or SALSA2 -> PretextView/JBAT curation | VGP/DToL standard: automate then curate |
| Diploid, Hi-C, want both haplotypes as chromosomes | hifiasm --h1/--h2 to PHASE first, then YaHS to SCAFFOLD each haplotype | same Hi-C, two jobs - phasing picks the homolog, scaffolding orders along it (see Phasing vs Scaffolding) |
| Closely related reference, karyotype NOT a question | RagTag correct then scaffold | homology ordering in minutes - but never if karyotype is the biology |
| One or more genetic/linkage maps | ALLMAPS | integrates multiple maps; robust to marker errors; validates/anchors Hi-C |
| Bionano optical maps + sequence assembly | Bionano Solve hybrid scaffold (before Hi-C) | megabase maps bridge complex repeats and large SVs |
| 10x linked reads (legacy data) | Tigmint (break) -> ARCS/ARKS (link) | break misassemblies first; no new 10x data exists |
| Contigs not yet QC'd / haplotigs present | -> assembly-qc, purge_dups | scaffolding haplotigs strings them up as fake chromosomes |
| Hi-C contact map for TADs/loops, not chromosomes | -> hi-c-analysis/matrix-operations | different use of the same assay |
Modern reference-grade recipe (vertebrate/eukaryote): HiFi -> hifiasm contigs -> (optional Bionano hybrid scaffold) -> Hi-C scaffold (YaHS/SALSA2) -> manual curation in PretextView/JBAT -> gap-fill (TGS-GapCloser) -> QC (tidk telomeres, contact-map diagonal, contig-vs-scaffold N50).
Hi-C reads are not a normal paired-end library: the two ends come from different genomic loci ligated together, so standard PE proper-pair/insert-size logic mis-handles them.
# Map each end SEPARATELY with no mate rescue/pairing; -5 reports the 5' (junction) portion as primary.
bwa index contigs.fa
bwa mem -5SP -T0 -t 16 contigs.fa hic_R1.fq.gz hic_R2.fq.gz | \
samtools view -@ 8 -b - > aligned.bam
# Mandatory: mark/remove PCR + optical duplicates (Hi-C is duplicate-rich) before scaffolding.
samtools sort -@ 8 -n aligned.bam | samtools fixmate -m - - | \
samtools sort -@ 8 - | samtools markdup -@ 8 - hic_to_contigs.bam
samtools index hic_to_contigs.bam
-5SP): -5 = 5' portion of a chimeric junction read as primary; -S/-P skip mate rescue and pairing. The Arima/VGP per-end pipeline (bwa mem on R1 and R2 independently, filter_five_end.pl, two_read_bam_combiner.pl) is the higher-fidelity equivalent.-q, SALSA2 reads MAPQ) to drop repeat-ambiguous reads.GATC); the scaffolder uses -e to model legitimate read starts. Omni-C / DNase Hi-C is enzyme-free -> omit -e in YaHS, use -e DNASE in SALSA2.chromap --preset hic -r contigs.fa -x index -1 R1 -2 R2 ... aligns + dedups Hi-C ~10x faster; YaHS accepts its output.samtools faidx contigs.fa
# Enzyme: -e GATC (DpnII/Dovetail), -e GATC,GANTC (Arima 2-enzyme); OMIT -e for Omni-C/enzyme-free.
yahs -e GATC -o out contigs.fa hic_to_contigs.bam
# Outputs: out_scaffolds_final.agp + out_scaffolds_final.fa, intermediate out_rNN.agp, out.bin
# Contig error-correction is ON by default (--no-contig-ec to disable; usually a mistake to disable).
# Juicebox prep for curation (.hic + .assembly to load in JBAT):
# NOTE: `juicer` here is the small utility BUNDLED with YaHS (operates on the .bin), NOT Aiden-lab Juicer;
# `juicer_tools.jar` below IS the separate Aiden-lab jar. Do not confuse the two.
juicer pre -a -o out_JBAT out.bin out_scaffolds_final.agp contigs.fa.fai 2>tmp_assembly.log
java -Xmx48G -jar juicer_tools.jar pre out_JBAT.txt out_JBAT.hic <(cat tmp_assembly.log | grep PRE_C_SIZE | awk '{print $2" "$3}')
# After hand-editing in Juicebox, export out_JBAT.review.assembly, then:
juicer post -o out_curated out_JBAT.review.assembly out_JBAT.liftover.agp contigs.fa
YaHS also accepts a BED (bamToBed) or PA5/pairs input. The _scaffolds_final.agp is the editable, reviewable object - curation edits the AGP, then regenerates FASTA. --telo-motif lets YaHS use telomere signal during scaffolding.
bamToBed -i aligned.bam > alignment.bed
sort -k4 alignment.bed > sorted.bed # MUST sort by read name (column 4)
python run_pipeline.py -a contigs.fa -l contigs.fa.fai -b sorted.bed \
-e GATC -o salsa_out -m yes -g contigs_graph.gfa -p yes
# -e DNASE for Omni-C/enzyme-free; -m yes enables input-error (chimera) correction; -p yes writes AGP+FASTA per iteration
The signature -m yes step breaks input contigs where the Hi-C signal contradicts the assembler's join before scaffolding. Assembly headers must not contain :. -i sets iterations (default 3), -c min contig length (default 1000).
ragtag.py correct ref.fa contigs.fa # break query at putative misassemblies vs reference FIRST
ragtag.py scaffold ref.fa ragtag_output/ragtag.correct.fasta -t 16 # order/orient by homology (minimap2 default)
# Outputs ragtag.scaffold.agp + .fasta; inserts 100 bp N gaps by default; -C collapses unplaced into chr0.
The peril (load-bearing): RagTag orders contigs to match the reference, so by construction the output looks like the reference. Every real biological rearrangement that distinguishes the target organism - a chromosome fusion, fission, translocation, or inversion polymorphism - is silently erased and replaced by the reference's structure. Never reference-scaffold a genome whose karyotype or large-scale structure is a biological question - the pipeline "discovers" the reference's karyotype because it was imposed (this has happened in the literature). Acceptable uses: a structurally conserved close relative needing only a coordinate system for gene content; a scaffold hypothesis to then test/correct against Hi-C (ragtag.py merge -b hic.bam lets Hi-C arbitrate). On a genome with interesting structure the honest move is Hi-C + curation, full stop.
ALLMAPS (Tang 2015 Genome Biol 16:3) integrates one or more genetic maps to order/orient scaffolds: allmaps merge map1.csv map2.csv -o maps.bed then allmaps path maps.bed contigs.fa. It is the gold standard for validating chromosome assignment and for species where Hi-C is hard; weight multiple maps by quality. Bionano Solve hybrid scaffolding (no journal paper; cite the Bionano Solve Theory of Operation: Hybrid Scaffold technical document) aligns DLS optical maps to in-silico contig maps, resolves conflicts, and merges into hybrid scaffolds - run it before Hi-C on large/repetitive genomes where megabase maps bridge repeat arrays Hi-C fumbles. The decision is economic (separate instrument + library), not purely technical.
The map displays its own errors to a trained eye - render with PretextMap -> PretextView (interactive, emits a curated AGP), PretextSnapshot (static PNG), or Juicebox/JBAT.
-m yes and YaHS default contig-EC break chimeric input where Hi-C contradicts the join; Tigmint (Jackman 2018 BMC Bioinformatics 19:393) breaks before ARCS for linked reads; RagTag correct for the reference-based version. Disabling these to "preserve the assembly" is a common self-inflicted wound.--h1/--h2 during assembly). When handed "Hi-C, make chromosomes," the first question is haploid scaffolding or diploid phasing+scaffolding? - the tools and failure modes differ entirely.Trigger: publishing _scaffolds_final without curation. Mechanism: automated joins are inferences error-prone at junctions (~221 edits/Gb in VGP/DToL). Symptom: misjoins/inversions in the contact map; later-discovered wrong synteny/fusions. Fix: curate in PretextView/JBAT before any large-scale-order claim.
Trigger: reporting only scaffold N50. Mechanism: scaffolding adds N-gaps, not sequence. Symptom: spectacular scaffold N50 over a small contig N50 - contiguity is nitrogen. Fix: report both N50s, gap count, total N bases separately.
Trigger: RagTag on a non-model genome whose karyotype is studied. Mechanism: ordering to the reference imposes the reference's structure. Symptom: fusions/translocations/inversions silently erased; circular "discovery" of the reference karyotype. Fix: Hi-C + curation; use RagTag only as a hypothesis to test.
Trigger: scaffolding contigs as given, contig-EC off. Mechanism: a chimeric contig drags region B into region A's chromosome; every join to it is wrong. Symptom: misjoins the contact map blames on the scaffolder. Fix: keep SALSA2 -m yes / YaHS contig-EC on; Tigmint for linked reads.
Trigger: bwa mem without -5SP, no dedup, or proper-pair logic. Mechanism: the two ends are different loci; mate rescue and insert-size filtering corrupt placement. Symptom: sparse/noisy contact map, weak joins. Fix: map each end separately (-5SP or per-end pipeline), dedup, MAPQ filter, enzyme-aware.
Trigger: running a gap-closer over the whole assembly to "finish" it. Mechanism: a long read bridges into a different copy of the repeat the gap sits in. Symptom: filled length far from the gap estimate; locally clean, globally mis-assembled. Fix: gap-fill after curation, sanity-check lengths, report closed-vs-left.
| Threshold | Source | Rationale |
|-----------|--------|-----------|
| ~221 curation interventions/Gb (67 breaks, 105 joins, 49 false-dup) | Howe 2021 GigaScience | expect the automated draft to be substantially editable, not finished |
| Hi-C ~50-100M valid pairs (vertebrate); rule-of-thumb ~1x coverage per ~1 Mb | VGP practice (~approx) | too little -> weak/missing joins, sparse noisy map; depends on genome size/repeats |
| Scaffold N50 / contig N50 > ~10x | curator reflex | contiguity is gap-borne; flag genome as contiguous-on-paper but unfinished |
| T2T chromosome = telomere at BOTH ends + zero internal gaps | T2T convention | tidk both-end recovery with internal Ns = chromosome-scale but not finished |
| RagTag default gap = 100 bp N | Alonge 2022 | placeholder length is arbitrary; never a real distance estimate |
| MAPQ filter on Hi-C reads (e.g. YaHS -q 10) | mapping practice | drop repeat-ambiguous reads that create spurious joins |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| _scaffolds_final looks chromosome-scale but synteny is wrong | shipped uncurated draft | curate the contact map in PretextView/JBAT |
| Huge scaffold N50, small contig N50 | contiguity borne by N-gaps | report both; do not conflate |
| Reference-scaffolded genome "has" the reference's karyotype | RagTag imposed the structure | re-scaffold with Hi-C + curation |
| Sparse/noisy contact map, few joins | Hi-C mapped as normal PE / under-sequenced | remap with -5SP+dedup; add Hi-C coverage |
| YaHS makes spurious joins | enzyme/MAPQ misconfigured; chimeric contigs | set correct -e (omit for Omni-C), raise -q, keep contig-EC on |
| SALSA2 errors on input | : in FASTA headers or BED not name-sorted | rename headers; sort -k4 the BED |
| Filled gaps far from estimated length | gap-filler bridged wrong repeat copy | gap-fill post-curation; sanity-check lengths; report closed-vs-left |
--h1/--h2) before each is scaffoldeddevelopment
Find restriction enzyme cut sites in DNA sequences using Biopython Bio.Restriction. Search with single enzymes, batches of enzymes, or commercially available enzyme sets. Returns cut positions for linear or circular DNA. Use when finding restriction enzyme cut sites in sequences.
development
Create restriction maps showing enzyme cut positions on DNA sequences using Biopython Bio.Restriction. Visualize cut sites, calculate distances between sites, and generate text or graphical maps. Use when creating or analyzing restriction maps.
development
Analyze restriction digest fragments using Biopython Bio.Restriction. Predict fragment sizes, get fragment sequences, simulate gel electrophoresis patterns, and perform double digests. Use when analyzing restriction digest fragment patterns.
development
Select restriction enzymes by criteria using Biopython Bio.Restriction. Find enzymes that cut once, don't cut, produce specific overhangs, are commercially available, or have compatible ends for cloning. Use when selecting restriction enzymes for cloning or analysis.