long-read-sequencing/basecalling/SKILL.md
Basecalls raw Oxford Nanopore signal (POD5/FAST5) into reads with Dorado, choosing the chemistry-matched model and accuracy tier (fast/hac/sup), requesting modified bases (5mCG_5hmCG, 6mA, m6A) at basecall time, and handling duplex, demultiplexing, trimming, and HERRO read correction. Covers why the model+version is an irreversible analysis decision, why methylation cannot be recovered later, and why downstream polish/variant models must match the basecaller. Use when converting POD5/FAST5 to reads, picking a Dorado model for R9/R10 or RNA004, enabling methylation calling, basecalling duplex, demultiplexing barcoded runs, or correcting reads for assembly.
npx skillsauth add GPTomics/bioSkills bio-long-read-sequencing-basecallingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: Dorado 1.0+, pod5 0.3+, samtools 1.19+, chopper 0.7+.
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagsResults depend on inputs that outlive the binary version - record them:
[email protected]) sets the entire error profile and must be propagated to every downstream tool. Pin it.[email protected]_5mCG_5hmCG@v3); the mod version can lag the simplex version - check dorado download --list.If code throws an error, introspect the installed tool (dorado --help, dorado basecaller --help) and adapt the example to the actual API rather than retrying.
"Basecall my Nanopore data" -> Convert raw signal (POD5) into reads with Dorado using the chemistry-matched model, deciding the accuracy tier and whether to call modifications now - because the model choice is baked irreversibly into the output.
dorado basecaller sup pod5s/ > calls.bam (simplex), dorado basecaller sup,5mCG_5hmCG pod5s/ > calls.bam (with methylation), dorado duplex sup pod5s/ > duplex.bam (duplex)PacBio note: PacBio "basecalling" (CCS -> HiFi reads) runs on-instrument/in SMRT Link; users receive HiFi BAMs already at Q20-Q30+. This skill is Oxford Nanopore / Dorado. HiFi assembly lives in genome-assembly/hifi-assembly.
Basecalling is not fixed preprocessing that yields a neutral FASTQ. The model and version chosen are an analysis decision written permanently into the BAM, with three consequences a naive user misses:
sup,5mCG_5hmCG) and KEEP the POD5. See nanopore-methylation.r1041_e82_400bps_sup_v500; medaka the dotted r1041_e82_400bps_sup_v5.2.0). A mismatched model silently degrades accuracy with no error. Propagate the basecaller model name to every downstream step.Dorado (one GPU-first executable) replaced Guppy, which is end-of-life. Bonito is ONT's research/training basecaller (not production); Rerio hosts research-release models (niche mods, bacterial methylation).
| Subcommand | Purpose | Canonical invocation |
|------------|---------|----------------------|
| basecaller | simplex basecalling | dorado basecaller hac pod5s/ > calls.bam |
| duplex | template+complement duplex | dorado duplex sup pod5s/ > duplex.bam |
| demux | barcode classification/split | dorado demux --kit-name SQK-NBD114-24 --output-dir out/ calls.bam |
| trim | standalone adapter/primer trim | dorado trim reads.bam > trimmed.bam |
| aligner | minimap2 alignment (carries MM/ML) | dorado aligner ref.mmi reads.bam > aln.bam |
| correct | HERRO single-read correction | dorado correct reads.fastq > corrected.fasta |
| summary | sequencing-summary TSV from BAM | dorado summary calls.bam > summary.tsv |
| download | model management | dorado download --model <name> / --list |
Format {analyte}_{pore}_{chemistry}_{speed}@v{ver} + optional mod suffix, e.g. [email protected]_5mCG_5hmCG@v3.
| Token | Meaning | Examples |
|-------|---------|----------|
| analyte | molecule | dna, rna004 |
| pore | flow-cell generation | r10.4.1 (current), r9.4.1 (legacy) |
| chemistry | kit chemistry | e8.2 (Kit 14) |
| speed | translocation speed -> sampling rate | 400bps (5 kHz DNA), 130bps (RNA004, 4 kHz) |
| tier | model size/accuracy | fast, hac, sup |
| version | model version | @v4.3.0, @v5.2.0, @v6.0.0 |
Passing the bare tier (sup) lets Dorado auto-detect chemistry from POD5 metadata and fetch the matching latest model; pin a version ([email protected]) or a full path for reproducibility. Append mods comma-separated (sup,5mCG_5hmCG,6mA); only one mod model per canonical base may be active.
| Scenario | Recommended | Why |
|----------|-------------|-----|
| Any analysis (variant/assembly/methylation) | sup + matched model, pinned version | fast/hac error profile leaks into calls |
| Live run / adaptive sampling / quick QC only | fast | speed; never for downstream analysis |
| Routine work, compute-limited | hac | strong accuracy/compute balance (v5.2 closed much of the gap to sup) |
| Methylation wanted now or maybe later | sup,5mCG_5hmCG (DNA), keep POD5 | mods are unrecoverable from a plain BAM -> nanopore-methylation |
| Per-molecule accuracy, low input, phasing | dorado duplex sup | ~Q30 reads, but expect <10% duplex yield |
| Diploid/phased T2T assembly from simplex | dorado correct (HERRO) before assembler | haplotype-aware Q22->Q40 -> genome-assembly/long-read-assembly |
| Barcoded multiplexed run | basecall --no-trim, then dorado demux | trimming first strips barcodes before demux sees them |
| Legacy R9.4.1 / RNA002 data | explicit archived model path | removed from Dorado v1.0 default downloads |
| PacBio data | already HiFi; no Dorado step | CCS runs on-instrument -> genome-assembly/hifi-assembly |
# Simplex, super-accuracy, auto-detected chemistry-matched model (BAM is the default output)
dorado basecaller sup pod5s/ > calls.bam
# Pin the model version for reproducibility
dorado basecaller [email protected] pod5s/ > calls.bam
# Call methylation AT basecall time (CpG 5mC + 5hmC); KEEP pod5s/ - mods are unrecoverable later
dorado basecaller sup,5mCG_5hmCG pod5s/ > calls.bam
dorado basecaller sup,6mA pod5s/ > calls.bam # all-context 6mA
# RNA004 direct RNA (cDNA CANNOT call mods - PCR erases the signal):
dorado basecaller [email protected],m6A_DRACH pod5s/ > rna_mods.bam
# FASTQ output and a per-read quality floor (relative filter, not a calibrated accuracy)
dorado basecaller sup pod5s/ --emit-fastq --min-qscore 10 > calls.fastq
# Duplex (needs raw POD5; cannot be recovered from simplex FASTQ); dx tag marks read types
dorado duplex sup pod5s/ > duplex.bam
# Demultiplex: basecall WITHOUT trimming, then demux (demux trims barcodes itself)
dorado basecaller sup pod5s/ --no-trim > calls.bam
dorado demux --kit-name SQK-NBD114-24 --output-dir demux/ calls.bam
dorado demux --kit-name SQK-NBD114-24 --barcode-both-ends --output-dir demux/ calls.bam # stringent
# HERRO read correction for diploid/phased assembly (input FASTQ of HAC/SUP R10 reads >=10kb -> FASTA)
dorado download --model herro-v1
dorado correct reads.fastq > corrected.fasta
POD5 is ONT's default raw format (faster random access than FAST5). Convert FAST5 first:
pod5 convert fast5 raw/*.fast5 --output pod5s/ # FAST5 is legacy; basecalling it directly is slow
pod5 view pod5s/ # summary table (replaces deprecated `pod5 inspect reads`)
pod5 merge pod5s/*.pod5 --output merged.pod5
Trigger: basecalling without a mod model, then wanting 5mC later. Mechanism: Remora infers mods from raw signal at basecall time; a plain BAM has only bases. Symptom: no MM/ML tags; modkit pileup returns nothing. Fix: re-basecall from POD5 with sup,5mCG_5hmCG; keep POD5 archives.
Trigger: default --trim all basecall, then a separate dorado demux. Mechanism: trimming removes the barcode before demux can read it. Symptom: most reads in unclassified.bam, low classification rate. Fix: basecall --no-trim, then demux (it trims barcodes itself).
Trigger: polishing/calling with a medaka/Clair3 model that doesn't match the basecaller model+version. Mechanism: per-model neural weights expect a specific error profile. Symptom: no error, just quietly worse consensus/calls. Fix: propagate the basecaller model name; use medaka tools resolve_model --auto_model; pick the matching Clair3 model dir.
Trigger: treating every read in a duplex BAM as an independent molecule. Mechanism: a simplex parent and its duplex offspring both appear. Symptom: inflated coverage/allele counts. Fix: the dx:i:-1 tag marks simplex parents of duplex reads - filter them when counting molecules (dx:i:1 = duplex, dx:i:0 = simplex-only).
Trigger: runs basecalled with different model versions joined for analysis. Mechanism: version-specific identity/indel error profiles confound a technical batch with biology. Symptom: spurious between-run differences. Fix: re-basecall the whole cohort with one model version.
| Threshold | Source | Rationale |
|-----------|--------|-----------|
| sup for any analysis | ONT model guidance | fast/hac error profiles contaminate variant/assembly/methylation calls |
| R10.4.1 SUP modal accuracy ~Q20 (99%) | Sereika 2022 | dual-reader head fixes homopolymers; enables nanopore-only near-finished genomes |
| Duplex read ~Q30; yield typically <10% of reads | community benchmarks | duplex is library-prep/loading-limited, not free accuracy |
| A "Q20" base errs at ~Q12.5 empirically | Delahaye 2021 | nanopore qscores >Q10 are overconfident posteriors; use for relative filtering only |
| HERRO input reads >=10 kbp, HAC/SUP R10 | Dorado correct docs | HERRO operates on 4096-bp chunks; shorter reads dropped |
| --min-qscore 10 as a permissive QC floor | convention | Q10 ~ 90% nominal; a starting filter, not a hard rule |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| "Failed to determine sequencing chemistry from data" | R9/RNA002 or non-standard kit; bare tier can't auto-resolve | pass an explicit model path; for legacy chemistry use an archived model |
| No MM/ML tags in BAM | basecalled without a mod model | re-basecall from POD5 with sup,5mCG_5hmCG |
| Most reads unclassified after demux | trimmed before demux | basecall --no-trim, then demux |
| --model sup errors | model is the positional arg, not a flag | dorado basecaller sup pod5s/ |
| dorado correct reads.bam fails | input is FASTQ(.gz), output FASTA | dorado correct reads.fastq > corrected.fasta |
| Out of GPU memory | batch too large for VRAM (sup is heaviest) | lower --batchsize; or drop to hac |
| cDNA m6A calling returns nothing | PCR erased native modifications | use direct RNA (RNA004), not cDNA |
-y to carry MM/ML tags through alignmenttools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.