genome-assembly/assembly-polishing/SKILL.md
Decides whether and how to polish a draft genome assembly to raise consensus accuracy (QV) with read-type-matched tools - Racon and medaka (ONT consensus), dorado polish, Polypolish and pypolca (Illumina, repeat-aware), Pilon (legacy short-read), NextPolish/NextPolish2, Hapo-G (haplotype-aware), ntEdit, and DeepPolisher/PEPPER-Margin-DeepVariant for human. Covers the do-not-polish-HiFi rule, the medaka basecaller-model footgun, held-out Merqury QV as the only honest stop signal, and the haplotype-collapse trap. Use when correcting homopolymer indels or residual SNPs in a long-read assembly, deciding if a HiFi assembly needs polishing, or choosing an ONT vs hybrid vs short-read polishing chain.
npx skillsauth add GPTomics/bioSkills bio-genome-assembly-assembly-polishingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: Racon 1.5+, medaka 2.0+, minimap2 2.26+, bwa 0.7.17+, samtools 1.19+, Polypolish 0.6+, pypolca 0.3+, Pilon 1.24+, Merqury 1.3+, meryl 1.4+.
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagsThe medaka consensus model matters more than the medaka binary version: the model must match the basecaller + pore chemistry + caller mode + version (-m r1041_e82_400bps_sup_v5.0.0); a mismatched model silently degrades the consensus. List options with medaka tools list_models. medaka's status is a moving target - ONT now steers toward dorado polish and has deprecated medaka's diploid-variant workflow in favour of Clair3; verify current guidance and the medaka_consensus wrapper vs newer subcommands. If code throws an error, introspect the installed tool and adapt rather than retrying.
"Polish my genome assembly" -> Decide whether base-level consensus correction is warranted, then apply a read-type-matched polisher and measure the gain with held-out k-mers - or, for an already-accurate assembly, decline.
racon reads.fq aln.sam draft.fa then medaka_consensus -i reads.fq -d draft.fa -o out -m <model> (ONT); polypolish polish draft.fa f1.sam f2.sam + pypolca run (hybrid/short); merqury.sh reads.meryl asm.fa out (measure QV before/after)Polishing exists to fix the characteristic error of noisy long reads: indels in homopolymers and low-complexity tracts (the pore/RT stutters on AAAAAA) plus residual substitutions. As read accuracy rose (PacBio HiFi ~Q40-50 reads, ONT R10.4.1 + Dorado sup/duplex), the assembly consensus is already near-perfect, and a mapping-based polisher's read-pileup step injects more errors than it removes. Two load-bearing rules dominate:
Polishing is transitional infrastructure: indispensable in the CLR/early-ONT error era, shrinking toward irrelevance as raw accuracy climbs upstream (better basecalling, hifiasm's own consensus). The skill's job is to say when not to polish as firmly as how to.
| Tool | Citation | Role | When |
|------|----------|------|------|
| Racon | Vaser 2017 Genome Res | POA read-to-assembly consensus; fast workhorse | first-pass ONT/CLR self-polish (1-4 rounds); needs minimap2 SAM/PAF |
| medaka | nanoporetech/medaka (software) | ONT neural-network consensus | one pass after Racon; model MUST match basecaller |
| dorado polish | ONT/dorado (software) | modern ONT consensus, owns basecalling context | the emerging medaka replacement; verify current status |
| Polypolish | Wick & Holt 2022 PLoS Comput Biol | Illumina polish using ALL alignments (bwa mem -a) | repeat-rich long-read assemblies; best-in-class short-read polish |
| pypolca | Bouras 2024 Microb Genom (POLCA: Zimin 2020) | fast Illumina SNP/indel correction | pair with Polypolish (modern bacterial practice) |
| Pilon | Walker 2014 PLoS ONE | all-in-one SNP/indel/local short-read fix | legacy; small genomes only; mismaps in repeats, ~1 GB heap/Mb |
| NextPolish / NextPolish2 | Hu 2020 Bioinformatics / Hu 2024 GPB | iterative LR+SR polish / HiFi repeat+phase-aware | NextPolish2 is the HiFi-safe successor |
| Hapo-G | Aury & Istace 2021 NARGAB | haplotype-aware short-read polish | heterozygous diploids; preserves both alleles |
| ntEdit | Warren 2019 Bioinformatics | Bloom-filter k-mer polish, no mapping | very large (>3 Gb) genomes; scales where alignment can't |
| DeepPolisher | Mastoras 2025 Genome Res | transformer correction (PHARAOH ONT phasing) | HiFi/T2T human; halves errors without overcorrection |
| PEPPER-Margin-DeepVariant | Shafin 2021 Nat Methods | haplotype-aware ONT consensus via variant calling | human ONT-only polishing |
| DeepConsensus | Baid 2023 Nat Biotechnol | improves CCS reads UPSTREAM of assembly | NOT a polisher - operates before assembly (common category error) |
| Scenario (data type) | Recommended | Why |
|----------------------|-------------|-----|
| PacBio HiFi only | Do NOT polish by default -> check Merqury QV first | already ~Q40-50; Racon/Pilon lower QV + collapse haplotypes |
| HiFi with a real Merqury QV deficit | NextPolish2 or DeepPolisher | repeat/phase-aware; won't overcorrect het sites |
| ONT-only, noisy (R9 / R10 hac) | Racon (1-4 rounds) -> medaka (1 pass), or dorado polish | canonical ONT chain; neural consensus on the ONT error model |
| ONT-only, human/diploid | PEPPER-Margin-DeepVariant or DeepPolisher | haplotype-aware; preserves both alleles |
| Long-read draft + Illumina (hybrid) | long-read pass, then Polypolish + pypolca | SR fixes residual homopolymer indels Racon/medaka missed |
| Short-read-only assembly (SPAdes) | usually no polishing needed | Illumina is already ~Q40+; structural sins survive anyway |
| HiFi + Illumina | use Illumina k-mers for evaluation, not correction | Merqury hybrid QV, not a polishing pass |
| Heterozygous / repeat-rich genome | Polypolish (-a), Hapo-G, NextPolish2 - or none | non-haplotype-aware polishers erase het / homogenize paralogs |
| Reads not yet QC'd / wrong basecaller | -> read-qc/quality-reports, -> long-read-sequencing/long-read-alignment | garbage-in or model-mismatch makes polishing worse than skipping |
| Complaint is "fragmented" not "low QV" | -> not polishing (scaffolding/gap-filling) | polishing fixes bases, not contiguity |
| Map reads before polishing | short -> read-alignment/bwa-alignment; long -> long-read-sequencing/long-read-alignment | polishers consume a BAM/SAM/PAF, not raw reads |
Racon does the bulk cheap consensus from the long reads themselves; it is a standalone consensus module and does NOT map - the SAM/PAF is supplied. Re-map every round (1-4 rounds; gains decay fast).
minimap2 -t 16 -ax map-ont draft.fasta reads.fq.gz > aln.sam # map-pb for PacBio CLR
racon -t 16 reads.fq.gz aln.sam draft.fasta > racon1.fasta
# re-map reads to racon1.fasta and repeat for round 2... measure QV each round, stop at plateau
medaka applies an ONT-trained neural model for the final consensus - exactly ONE pass after the Racon rounds (it is not an iterate-many-times tool; running medaka twice is a tell). For medaka mechanics and model tables see long-read-sequencing/medaka-polishing; this skill owns the strategy.
medaka_consensus -i reads.fq.gz -d racon_final.fasta -o medaka_out -t 16 \
-m r1041_e82_400bps_sup_v5.0.0 # model MUST match basecaller+chemistry+caller+version
Recent basecallers embed the model in the FASTQ so medaka auto-selects; if the data was basecalled with an old/unknown caller, pick manually from medaka tools list_models, and treat a stale/deprecated model name as a reason to rebasecall rather than proceed.
Polypolish aligns short reads to all locations (bwa mem -a) so it can disambiguate which repeat copy a read belongs to instead of forcing one placement - this is how it beats Pilon in repeats and why it almost never introduces errors.
bwa index draft.fasta
bwa mem -t 16 -a draft.fasta reads_1.fq.gz > aln_1.sam # -a = ALL alignments (the whole point)
bwa mem -t 16 -a draft.fasta reads_2.fq.gz > aln_2.sam
polypolish filter --in1 aln_1.sam --in2 aln_2.sam --out1 filt_1.sam --out2 filt_2.sam
polypolish polish draft.fasta filt_1.sam filt_2.sam > polypolish.fasta # --careful (v0.6+) for low depth
pypolca run -a polypolish.fasta -1 reads_1.fq.gz -2 reads_2.fq.gz -o pypolca_out -t 16 --careful
Bouras 2024 depth-tiered bacterial recommendation: depth <5x -> Polypolish --careful alone; 5-25x -> Polypolish --careful + pypolca --careful; >25x -> Polypolish (default) + pypolca --careful. Modern bacterial best practice is Polypolish + pypolca, NOT Pilon.
bwa mem -t 16 draft.fasta r1.fq r2.fq | samtools sort -o frags.bam
samtools index frags.bam
java -Xmx16G -jar pilon.jar --genome draft.fasta --frags frags.bam --output pilon --fix all --changes
--fix modes: snps, indels, bases (=snps+indels), gaps, local, all (default), none. Pilon needs a sorted+indexed BAM (route mapping to read-alignment/bwa-alignment). It is legacy: it uses best-placement alignments so it mismaps in repeats (miscorrects toward paralogs), and its ~1 GB heap per Mb of genome OOMs on large eukaryotes - both reasons Polypolish/pypolca displaced it.
Goal: Decide whether a polish actually helped, without fooling yourself.
Approach: Build a meryl k-mer DB from an independent / different-platform read set (k from Merqury's best_k.sh, not hardcoded), then run Merqury on the pre- and post-polish assemblies and compare QV. A polish that does not raise QV did not help; one that lowers it must be reverted.
K=$(sh $MERQURY/best_k.sh 5000000 | tail -n1 | awk '{print int($1+0.5)}') # K from genome size; round float->int
meryl count k=$K output reads.meryl illumina_reads.fq.gz # eval reads != polishing reads
merqury.sh reads.meryl draft.fasta qv_before # QV of the input
merqury.sh reads.meryl polished.fasta qv_after # QV after polishing
Use a held-out set or a different platform than was polished with (polish with ONT, evaluate with Illumina k-mers); the CHM13 effort triangulated with both HiFi and Illumina k-mers (Mc Cartney 2022). Do NOT use BUSCO as the polishing metric - it measures gene-space completeness and barely moves with the homopolymer-indel QV that polishing changes, so a flat BUSCO masks a silent QV drop. A per-gene internal-stop / frameshift count is a useful secondary readout. See assembly-qc for the full QV/spectra-cn workflow.
Trigger: specifying or defaulting to a model that does not match the actual basecaller+chemistry+version. Mechanism: the network applies corrections calibrated for errors that aren't there and misses the ones that are. Symptom: medaka completes with no warning, but Merqury QV is lower than the input. Fix: let medaka auto-detect from the FASTQ; if manual, derive the model from the real caller and confirm in medaka tools list_models; treat a stale model as a reason to rebasecall.
Trigger: running Racon/Pilon/GCpp on a hifiasm assembly. Mechanism: at het sites the pileup mixes both true alleles; the polisher overwrites toward the majority and homogenizes near-identical paralogs. Symptom: lost heterozygosity, haplotype-switch errors, QV unchanged or down. Fix: don't; if Merqury shows a real deficit, use NextPolish2/DeepPolisher only.
Trigger: measuring QV with the same read set used to polish. Mechanism: the polisher already maximized agreement with those reads. Symptom: every polish "improves QV," monotonically. Fix: evaluate with held-out / different-platform k-mers.
Trigger: "a few more rounds to be safe." Mechanism: once resolvable errors are fixed, extra rounds flip correct bases at het/repeat sites. Symptom: "N changes made" keeps reporting while QV oscillates or falls. Fix: track Merqury QV per round; stop at plateau. Treat a large change count on an accurate assembly as a risk signal, not success.
Trigger: Pilon on a repeat-rich genome. Mechanism: best-placement alignment forces a read onto one repeat copy; Pilon corrects that copy toward the wrong one. Symptom: repeat copies homogenized, paralog differences erased. Fix: Polypolish (bwa mem -a, uses all alignments).
Trigger: running DeepConsensus on the assembly. Mechanism: DeepConsensus improves CCS reads upstream, before assembly. Symptom: tool expects subreads/CCS, not a FASTA. Fix: run it in the read-prep stage; for assembly polishing use a post-assembly tool.
| Threshold | Source | Rationale |
|-----------|--------|-----------|
| HiFi assembly ~Q40-50 already | HiFi read accuracy | starting point too high for mapping-based polishing to help |
| Racon 1-4 rounds, stop at QV plateau | Vaser 2017 + practice | gains decay after ~1-2; extra rounds flip correct bases |
| medaka exactly 1 pass after Racon | medaka design | trained-model pass, not an iterative tool |
| Polypolish --careful <5x; +pypolca 5-25x; default >25x | Bouras 2024 Microb Genom | depth-tiered to avoid false-positive repeat edits |
| QV40 (~1 err/10 kb) "reference-grade"; Q50 (~1/100 kb) modern aspiration; CHM13 ~Q73 | field convention; Mc Cartney 2022 | QV = -10*log10(error rate); report measured QV + method |
| Merqury k from best_k.sh (k=21 human-scale) | Rhie 2020 Genome Biol | wrong k silently degrades QV/completeness |
| Pilon ~1 GB heap per Mb of genome | Walker 2014 | OOM risk on large eukaryotes; set -Xmx |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| QV drops after polishing a HiFi assembly | over-polishing an already-accurate assembly | stop; HiFi rarely needs short-read polish |
| medaka output worse than input, no warning | wrong/stale basecaller model | auto-detect or match model exactly; else rebasecall |
| Every polishing round "improves" QV | measured with the polishing reads (circular) | evaluate with held-out / different-platform k-mers |
| Lost heterozygosity / switch errors | non-haplotype-aware polisher on a diploid | Hapo-G / NextPolish2 / Polypolish -a, or skip |
| Repeat copies homogenized | Pilon best-placement mismapping | Polypolish with bwa mem -a |
| Pilon OOM / crash on large genome | ~1 GB/Mb heap blowup | raise -Xmx; prefer Polypolish/ntEdit |
| Polishing didn't fix fragmentation | wrong operation | fragmentation is scaffolding/gap-filling, not polishing |
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.