epidemiological-genomics/variant-surveillance/SKILL.md
Assigns pathogen lineages (SARS-CoV-2 Pangolin via UShER mode; Nextclade clade + QC; pango-designation alias_key.json resolution) and tracks variant frequencies over time using Nextstrain (Augur + Auspice), wastewater deconvolution (Freyja, COJAC, alcov, lineagespot), lineage fitness modelling (Wenseleers / Bedford-Figgins multinomial logistic), and recombinant detection (3SEQ, RDP4, Bolotie). Covers Pangolin pangolin-data version pinning (mandatory for reproducibility), Nextclade dataset versioning (lineage-defining mutations change with dataset), Freyja barcode forward-only date constraint, ARTIC primer scheme version churn (V3 / V4 / V4.1 / V5.3.2 / Midnight 1200) with documented dropout regions, recombinant X-prefix Pango designation lag, GISAID vs INSDC dual-deposition tensions, and the Karthikeyan 2022 wastewater early-detection signal with explicit reproducibility caveats. Use when assigning Pango lineages and Nextclade clades to viral consensus sequences, building Nextstrain Augur surveillance pipelines, deconvolving wastewater pooled samples into lineage frequencies with Freyja, tracking lineage frequencies and growth advantages over time, pinning pangolin-data / Nextclade dataset versions for reproducibility, handling ARTIC primer dropouts (V4.1 amplicons 64 / 76 / 88-90), or running variant surveillance for SARS-CoV-2 / influenza / Mpox / RSV / H5N1 / measles.
npx skillsauth add GPTomics/bioSkills bio-epidemiological-genomics-variant-surveillanceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: pangolin 4.3+ (pangolin-data 1.30+), nextclade 3.8+, augur 24.0+, freyja 1.4+, cojac 0.9+, lineagespot 1.6+ (Bioconductor), usher 0.6+, matUtils 0.6+, samtools 1.20+, lofreq 2.1+, ivar 1.4+, ARTIC pipeline 1.3+, snakemake 8.5+, pandas 2.2+, BioPython 1.84+, jq 1.7+.
Before using code patterns, verify installed versions match. If versions differ:
pangolin --all-versions -- prints pangolin + pangolin-data + scorpio + constellations versionsnextclade dataset list --tag latest sars-cov-2 -- list current dataset tagsfreyja --version; freyja barcode-build --help (note: HYPHEN, not underscore; some legacy docs show barcode_build)nextclade run --help -- v3+ syntax replaced v2; old nextclade invocation no longer worksaugur --version; augur refine --help for current root-strategy flagsIf pangolin --inference usher is rejected, the flag is --analysis-mode usher (no --inference). If nextclade --input-dataset DIR works, the installed version may be v2; v3 accepts both but --dataset NAME is the modern form for built-in datasets. Pangolin and Nextclade output column names differ between major releases -- introspect rather than retry.
"Which lineages are circulating, and how fast are they growing?" -> Assign consensus or wastewater samples to a curated lineage / clade nomenclature, then track frequencies over time with explicit version pinning. The lineage assignment is NOT a stable property of the sequence; it is a property of the sequence interpreted by a specific pangolin-data / Nextclade-dataset version. Two labs running the same Pangolin binary with different pangolin-data versions can produce different calls on the same genome. For published or regulatory output, pin BOTH the executable AND the dataset version (pangolin --all-versions; nextclade dataset list --tag latest), and re-run the whole archive after every dataset update -- comparing today's BA.2.86 call to last month's "Unassigned" call is invalid.
pangolin sequences.fasta --analysis-mode usher --outfile lineage_report.csv -- UShER mode is the default since v4 (pangoLEARN deprecated mid-2023)nextclade run --input-dataset nc_dataset/sars-cov-2 --output-tsv nc.tsv sequences.fasta -- clade + Pango + QC + mutationsfreyja variants sample.bam --variants sample.variants.tsv --depths sample.depths.tsv --ref reference.fa then freyja demix sample.variants.tsv sample.depths.tsv --output sample.demix.tsv -- wastewater lineage deconvolutionaugur refine --tree tree.nwk --alignment aln.fasta --metadata meta.tsv --output-tree refined.nwk --root oldest --timetree -- Nextstrain time-scalingA SARS-CoV-2 sequence called BA.5 today might be called BA.5.2.1 next week and KP.3 a month after that. pangolin-data and nextclade-dataset are updated weekly; lineage definitions evolve through pango-designation GitHub issues, often days-to-weeks before pangolin-data releases include the lineage. During the lag window, the same genome submitted in lab A (older pangolin-data) and lab B (current) gets different calls. The cross-lab "different lineage" result is then misread as biology. For any report, pin BOTH the executable AND the dataset version with pangolin --all-versions and nextclade dataset list --tag latest recorded alongside the call. For longitudinal studies, re-run the WHOLE archive after every dataset update -- comparing today's BA.2.86 call against last month's "Unassigned" call is invalid. Second-order insight: Pangolin's pangoLEARN mode was officially deprecated mid-2023 in favour of UShER mode (Pongmoragot 2024 Virus Evol 10:vead085); cross-study comparison of XBB sub-lineage prevalence from 2022 - mid-2023 is contaminated by the pangoLEARN -> UShER mode switch even when the same pangolin-data version is used.
| Tool | Mechanism | Inputs | Output | Strength | Fails when | |------|-----------|--------|--------|----------|------------| | Pangolin UShER mode (O'Toole 2021 Virus Evol 7:veab064; Pongmoragot 2024 Virus Evol 10:vead085) | Parsimony placement on daily-updated UShER mutation-annotated tree | SARS-CoV-2 consensus | Pango lineage call | UShER is the default since v4; more accurate than pangoLEARN for recent / divergent lineages | Designation lag for emerging lineages; recombinants require manual Pango-X designation | | Pangolin pangoLEARN mode | Random-forest classifier trained on pangolin-data | SARS-CoV-2 consensus | Pango lineage call | Fast | DEPRECATED mid-2023; less accurate than UShER for novel sub-lineages | | Nextclade (Aksamentov 2021 JOSS 6:3773) | Reference-tree placement + clade assignment + mutation calling + QC | Viral consensus (multi-pathogen) | Clade + Pango + QC + mutations | Integrated alignment QC; mutation outliers; recombination indicators | Dataset version drift changes lineage-defining mutations | | Nextstrain Augur (Huddleston 2021 JOSS 6:2906) | Python CLI for subsampling + alignment + tree + ancestral-trait + time-tree | Genomes + metadata + sampling config | Auspice JSON for visualization | End-to-end pipeline for curated surveillance builds | Subsampling configuration drives results more than data; nextstrain.org subsamples ~3000-5000 of millions | | UShER + matUtils + matOptimize + RIPPLES (Turakhia 2021 Nat Genet 53:809) | Parsimony placement on daily MAT; SPR refinement; recombination detection | New consensus + existing MAT | Updated MAT, subtrees, recombinant calls | Pandemic-scale (millions of genomes) | Parsimony branch lengths systematically shorter than ML; re-estimate before downstream R_e | | Freyja (Karthikeyan 2022 Nature 609:101) | Depth-weighted LAD regression on barcode-matrix mutation frequencies | Wastewater BAM + barcode | Per-lineage abundance | Recovers expected abundances down to ~5%; quantitative | Lineages absent from barcode invisible; barcode is forward-only -- cannot deconvolve lineages designated after barcode date | | COJAC (Jahn 2022 Nat Microbiol 7:1151) | Co-occurrence of signature mutations on the same read pair | Wastewater BAM | Per-lineage presence / absence | More robust than per-site frequencies; detected Alpha 13 days before clinical | Single-read amplicons (no co-occurrence) cannot resolve; requires paired-end or long-read | | alcov | Lineage deconvolution similar paradigm to Freyja | Wastewater BAM | Per-lineage abundance | Alternative to Freyja | Less benchmarked | | lineagespot (Pechlivanis 2022 Sci Rep 12:2659) | R/Bioconductor lineage deconvolution from VCF + signature mutations | VCF + reference lineage mutations | Per-lineage abundance | R / Bioconductor integration | Less ML-driven; smaller community | | Wenseleers / Bedford-Figgins multinomial logistic (Abousamra, Figgins, Bedford 2024 PLoS Comput Biol 20:e1012443) | Multinomial logistic regression on lineage frequencies over time | Lineage frequencies + dates | Growth advantage per lineage with 95% CI | Standard for outbreak.info / cov-lineages.org | Marginal CI for one lineage hides covariance with all others; early estimates inflated | | 3SEQ (Boni 2007 Genetics 176:1035) | Triplet-based recombination detection | Aligned sequences | Recombinant candidates | General-purpose | High false-positive rate at low divergence | | RDP4 / RDP5 (Martin 2015 Virus Evol 1:vev003) | Multiple-method recombination detection | Aligned sequences | Recombinant candidates | Multi-method consensus | Slow; parameter-sensitive | | Bolotie (Varabyou 2021 Bioinformatics 37:2298) | SARS-CoV-2-specific recombination detection | SARS-CoV-2 consensus | Recombinant candidates | Tuned for SARS-CoV-2 sub-lineage divergence | Specialist tool |
| Scenario | Recommended | Why wrong choices fail |
|----------|-------------|------------------------|
| Assign lineage to a SARS-CoV-2 consensus | Pangolin with --analysis-mode usher (UShER default since v4) + Nextclade cross-check; pin pangolin-data and Nextclade dataset versions | pangoLEARN alone (deprecated); not pinning version (cross-lab calls diverge) |
| Detect emerging variants in wastewater | COJAC for early detection (co-occurrence on amplicon) + Freyja for quantitative tracking; pin Freyja barcode version | Naive site-frequency aggregation; comparing across barcode versions |
| Track lineage frequencies over time | Multinomial logistic regression (Wenseleers / Bedford-Figgins) OR Bayesian renewal equation; report covariance among lineages | Plotting raw counts without CI; reporting single-lineage growth advantage without covariance |
| Build a regional surveillance phylogeny | Nextstrain Augur pipeline; subsample to manageable size; TreeTime for dates; document subsampling | BEAST on raw 10k+ samples (intractable); not documenting subsampling |
| Compare wastewater results across labs | Same primer scheme + same Freyja barcode + same Pangolin / Nextclade version | Mixing primer schemes; mixing barcode versions; comparing across pangolin-data versions |
| QC a new SARS-CoV-2 genome | Nextclade (alignment QC; mutation outliers; recombination indicators) | Pangolin alone (passes confidently on bad genomes) |
| Detect recombinant lineages | Trust Pango-designation X-prefix assignments; for novel candidates use RDP5 / 3SEQ / Bolotie + manual review | Manual eyeballing of mutation patterns; ignoring designation lag |
| Phylogenetic context for outbreak | UShER + matUtils subtree extraction; re-estimate branch lengths via TreeTime for downstream R_e | Re-treeing from scratch every time |
| Estimate vaccine-escape risk | Lab assays (neutralisation, escape mutants) + structural prediction; genomic surveillance flags candidates | Pure genomic prediction without lab validation |
| Wastewater-to-cases conversion | Variant-specific shedding rate (Omicron BA.1 shed less per case than Delta); pin barcode + report uncertainty | Fixed RNA-to-cases ratio across variants is wrong; variant-specific shedding has been documented in the wastewater literature |
Methodology evolves; before any high-stakes lineage report, verify Pangolin's current default analysis-mode and Nextclade's bundled dataset against pango-designation issues for any emerging lineage.
Goal: Assign Pango lineages to SARS-CoV-2 consensus sequences using UShER mode (the default since v4; pangoLEARN deprecated mid-2023), with full pangolin-data version provenance preserved for reproducibility.
Approach: Always pass --analysis-mode usher; record pangolin --all-versions output alongside every lineage call; for published or regulatory output, pin pangolin-data to a specific release tag and re-run the whole archive whenever the version is updated.
pangolin sequences.fasta --analysis-mode usher --outfile lineage_report.csv
pangolin --all-versions > pangolin_versions.txt
pangolin --all-versions prints: pangolin executable version, pangolin-data version (weekly updated; mandatory pin for reproducibility), scorpio version, and constellations version. All four are version-sensitive; in published surveillance reports, pin all four.
Goal: Assign Nextstrain clade, Pango lineage, mutations, and QC flags to SARS-CoV-2 consensus sequences with explicit dataset version provenance.
Approach: Fetch the current dataset with nextclade dataset get --name sars-cov-2 --output-dir nc_dataset/sars-cov-2; record the pathogen.json tag / commit hash; run nextclade run --input-dataset on the pre-downloaded folder so the dataset version is locked in for the analysis.
nextclade dataset get --name sars-cov-2 --output-dir nc_dataset/sars-cov-2
NC_DATASET_TAG=$(jq -r '.tag // .version' nc_dataset/sars-cov-2/pathogen.json)
nextclade run \
--input-dataset nc_dataset/sars-cov-2 \
--output-tsv nextclade.tsv \
--output-json nextclade.json \
sequences.fasta
echo "nextclade_dataset_tag: ${NC_DATASET_TAG}" > nextclade.metadata
Different dataset versions assign different mutations as "lineage-defining" because internal-node placement can shift as the tree grows. Cross-version comparison of mutation reports is therefore method-dependent.
Goal: Estimate per-lineage abundance in a wastewater pooled sample with explicit handling of the barcode forward-only date constraint, primer-scheme awareness, and residual mass interpretation.
Approach: Confirm barcode date postdates sample collection; if not, freyja barcode-build from the current UShER tree; variant call with freyja variants then deconvolve with freyja demix; inspect the resid column (residual mass NOT assigned to known lineages; high resid indicates a novel lineage is invisible); apply primer-scheme-aware coverage masking; report variant-specific uncertainty.
freyja update
freyja variants \
sample.bam \
--variants sample.variants.tsv \
--depths sample.depths.tsv \
--ref reference.fa
freyja demix \
sample.variants.tsv \
sample.depths.tsv \
--output sample.demix.tsv
The Freyja --barcodes (or default bundled) date MUST postdate the sample collection date. Lineages designated after the barcode date cannot be detected -- the demixing silently fails and presents as elevated abundance of the closest parent lineage. For samples potentially containing emerging lineages, regenerate barcodes:
freyja barcode-build \
--pb-and-meta usher_tree.pb \
--output-dir custom_barcodes
Subsequent methodological extensions to Karthikeyan have appeared in the wastewater literature, and recent benchmarks comparing the major deconvolution tools (Freyja, COJAC, alcov, lineagespot, LCS) confirm Freyja and COJAC consistently perform well, with performance degrading at low coverage and for divergent lineages.
Goal: Detect emerging variants in wastewater earlier than per-site frequency methods by requiring co-occurrence of two signature mutations on the same amplicon (read pair).
Approach: COJAC checks read pairs for joint occurrence of variant-defining mutations; the inferential leap is more robust because a single site can have shared mutations across lineages, but two signature mutations on the same read pair strongly imply a single lineage. Detected Alpha 13 days before clinical samples in Swiss data (Jahn 2022 Nat Microbiol 7:1151).
cojac cooc-mutbamscan \
-a primer_scheme.bed \
-m variants_definitions.yaml \
-b sample.bam \
-o sample.cooc.tsv
Goal: Build a curated regional surveillance phylogeny with subsampling, alignment, tree, ancestral-trait inference, and time-scaling -- in Auspice-visualisable format. The Nextstrain platform was introduced by Hadfield 2018 Bioinformatics 34:4121; Augur is the Python CLI (Huddleston 2021 JOSS 6:2906).
Approach: Pull the latest official pathogen build from github.com/nextstrain/<pathogen>; subsample to manageable size (typically 3000-5000 genomes per global build; smaller regional); document subsampling configuration explicitly (it drives the result more than the underlying data per Hodcroft 2021).
augur align --sequences seqs.fasta --reference-sequence ref.gb --output aligned.fasta
augur tree --alignment aligned.fasta --output tree.nwk
augur refine \
--tree tree.nwk \
--alignment aligned.fasta \
--metadata meta.tsv \
--output-tree refined.nwk \
--output-node-data branch_lengths.json \
--timetree \
--root oldest \
--coalescent skyline
augur ancestral --tree refined.nwk --alignment aligned.fasta --output-node-data nt_muts.json
augur translate --tree refined.nwk --ancestral-sequences nt_muts.json --reference-sequence ref.gb --output-node-data aa_muts.json
augur traits --tree refined.nwk --metadata meta.tsv --columns country region --output-node-data traits.json
augur export v2 \
--tree refined.nwk \
--metadata meta.tsv \
--node-data branch_lengths.json nt_muts.json aa_muts.json traits.json \
--output auspice.json
Hodcroft 2021 Nature 591:30 documented that Nextstrain subsampling configurations drive lineage-time estimates more than the underlying data. Two researchers using the official pipeline with different subsampling can get different MRCA dates and migration patterns from the same raw genomes.
Trigger: Two labs submit the same consensus genome to Pangolin with different pangolin-data versions; the lineage call differs.
Mechanism: Lineage designation happens through pango-designation GitHub issues -- days-to-weeks before pangolin-data releases include the lineage. During the lag, the same genome is callable as the parent (older pangolin-data) or the child (current). pangolin-data is updated weekly.
Symptom: Cross-lab lineage prevalence comparisons over time show implausible jumps coinciding with pangolin-data release dates rather than biology.
Fix: Pin pangolin-data version explicitly with pangolin --all-versions recorded alongside every call. For published or regulatory output, re-run the WHOLE archive against a single pangolin-data version before reporting.
Trigger: Wastewater sample collected after a new lineage was designated; Freyja barcode built before that designation.
Mechanism: Freyja barcodes are built from the UShER tree at a specific date; lineages designated AFTER the barcode date cannot be detected. The demixing silently fails -- the new lineage's signal is misassigned to its closest parent.
Symptom: Wastewater sample shows implausibly high abundance of a single parent lineage; new lineage that should be present is reported as 0%.
Fix: Run freyja update regularly; for samples potentially containing emerging lineages, regenerate barcodes with freyja barcode-build from the current UShER tree. Report resid (residual mass not assigned to known lineages); high resid indicates a novel lineage is being missed.
Trigger: SARS-CoV-2 surveillance using ARTIC V4.1 amplicons; new variant has mutation at primer site; amplicons 64 / 76 / 88-90 silently drop out.
Mechanism: When a primer fails to bind, the amplicon doesn't amplify; consensus calling produces N's or reference-derived calls in that region. This LOOKS LIKE a deletion in downstream analysis but is actually missing data. Itokawa 2020 PLoS ONE 15:e0239403 documented primer interactions specifically.
Symptom: "Deletion" calls cluster in known dropout amplicons; Pangolin / Nextclade lineage call shifts when masked positions are filled with reference.
Fix: Inspect per-amplicon coverage with samtools depth -aa; mask consensus positions in dropped amplicons (use Ns -- Pangolin and Nextclade handle Ns gracefully). Document primer scheme version (V3 / V4 / V4.1 / V5.3.2 / Midnight) per isolate.
Trigger: A SARS-CoV-2 recombinant (e.g., XEC = KS.1.1 x KP.3.3) emerges; pango-designation has not yet issued the X-prefix designation; Pangolin assigns to one of the parents.
Mechanism: Pangolin in either mode assigns a recombinant to one parent lineage if no Pango-X designation exists yet. Identifying recombinants requires breakpoint detection (3SEQ, Bolotie, RDP4) and manual designation through pango-designation; the designation can lag emergence by weeks-to-months for novel recombinants.
Symptom: Outbreak interpretation conflates a recombinant lineage with its parent; transmissibility / immune-escape claims are wrong.
Fix: For any candidate emerging lineage with unusual mutations, run Bolotie or 3SEQ for recombination detection; cross-check Pangolin vs Nextclade lineage call; submit candidate recombinants to cov-lineages issue tracker if novel.
Trigger: Pangolin run with --analysis-mode pangolearn (or via legacy Docker image that defaults to pangoLEARN); user reports the call.
Mechanism: Pongmoragot 2024 Virus Evol 10:vead085 demonstrated UShER mode is significantly more accurate for recent / divergent lineages. pangoLEARN was officially deprecated mid-2023.
Symptom: Cross-lab comparison reveals one lab using pangoLEARN (legacy) and another using UShER; calls differ at borderline lineages.
Fix: Switch to --analysis-mode usher (default since v4). For longitudinal datasets crossing the mid-2023 mode-switch, re-run the historical archive against UShER mode.
Trigger: Script written from older Freyja documentation using freyja barcode_build (underscore).
Mechanism: Current Freyja versions use barcode-build (hyphen); the underscore form may not be recognised.
Symptom: Subprocess fails with "unrecognized command".
Fix: Use freyja barcode-build (hyphen). Verify with freyja --help.
Trigger: Reporting a single lineage's growth advantage 95% CI from a multinomial logistic regression.
Mechanism: The CI for any single lineage is conditional on all other lineages being held at their estimated growth rates; the marginal CI hides covariance among lineages. Early growth-advantage estimates are systematically too large; they shrink as more time passes (alternative explanations become identifiable).
Symptom: Initial published growth advantage > later refined estimate; "outlier-fast" lineages later moderated.
Fix: Report the full multinomial covariance matrix or at minimum the rank-ordered growth advantages with simultaneous CIs. Cite Abousamra 2024 PLoS Comput Biol 20:e1012443.
Trigger: Nextstrain Augur build with default subsampling at 3000-5000 genomes from millions; user interprets the tree topology as authoritative.
Mechanism: Hodcroft 2021 Nature 591:30 commented that subsampling decisions drive lineage-time estimates more than the underlying data. Two researchers using the official Nextstrain pipeline with different subsampling configurations get different MRCA dates and migration patterns from the same raw genomes.
Symptom: Published Nextstrain tree differs from another analysis on the same raw data; conclusions sensitive to subsampling.
Fix: Document subsampling configuration explicitly in any Nextstrain build; run sensitivity analysis with alternative subsampling; treat MRCA dates and migration calls with appropriate uncertainty.
| Pattern | Likely cause | Action |
|---------|--------------|--------|
| Pangolin "BA.2.86", Nextclade clade "23I" | Equivalent at different resolutions -- BA.2.86 is within 23I | Report both; Pango lineage for sub-clade resolution |
| Pangolin "BA.5.2", Nextclade "Unassigned" | Nextclade dataset older than pangolin-data; OR Nextclade QC failed | Update Nextclade dataset; re-run; inspect QC fields |
| Pangolin UShER and pangoLEARN disagree | pangoLEARN is the deprecated decision-tree classifier | Trust UShER call |
| Freyja shows 0% of expected lineage | Lineage absent from current barcode (post-dates barcode) | Rebuild barcodes (freyja barcode-build); confirm lineage is in the UShER tree the barcode is built from |
| Freyja confidence < 0.7 on dominant lineage | Sub-100x coverage OR amplicon dropout | Inspect per-amplicon coverage; consider re-sequencing; report as indeterminate |
| Nextclade and Pangolin disagree on recombinant | Recombinants inherently ambiguous; depends on which parent's SNPs dominate | Report as recombinant candidate; submit to cov-lineages if novel |
| Two consecutive pangolin-data releases call the same consensus differently | Lineage definitions revised between releases | Pin pangolin-data; record version + date alongside lineage |
| COJAC detects a variant Freyja does not | COJAC's co-occurrence requirement more sensitive at low abundance | Trust COJAC for early detection; Freyja for quantitative tracking |
| Wastewater Freyja result conflicts with clinical lineage prevalence | Barcode staleness; primer dropout in wastewater; faecal shedding rate varies by variant | Update barcode; check per-amplicon coverage; flag variant-specific shedding |
| Quantity | Threshold | Source / rationale |
|----------|-----------|--------------------|
| Pangolin min coverage for lineage call | >=50% genome coverage (~14kb) | Pangolin convention |
| Nextclade QC stop-codon threshold | Per dataset; check pathogen.json | Nextclade dataset-specific |
| Freyja minimum coverage per site | >=10x typical | Freyja convention; per-site weighting accounts for variance |
| Freyja resid flag threshold | Project-specific; >0.1 typically indicates novel lineage missed | Freyja documentation |
| COJAC early-detection lead time vs clinical | Up to 13 days in Swiss data | Jahn 2022 Nat Microbiol 7:1151 |
| Karthikeyan 2022 wastewater Omicron lead | 11 days before clinical detection (San Diego) | Most-favourable configuration; subsequent retrospective analyses produced detection lags ranging from -5 to +3 days |
| ARTIC V4.1 known chronic dropouts | Amplicons 64, 76, 88-90 | Itokawa 2020 / community documentation |
| Augur subsampling typical | 3000-5000 genomes per global build | Nextstrain convention; document explicitly |
| Multinomial logistic growth-advantage early estimate inflation | Systematically too large; shrinks over time | Abousamra 2024 PLoS Comput Biol 20:e1012443 |
| GISAID 2024-2025 weekly submission rate | ~5,000-20,000/week (down from ~500,000/week peak early 2022) | Community-documented; emerging-lineage detection lag increased |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| pangolin --inference usher rejected | Flag is --analysis-mode usher | --analysis-mode usher |
| nextclade run --input-dataset DIR rejected on v2 | v2 used --input-dataset differently | Verify nextclade --version; v3+ accepts pre-downloaded dataset folder |
| freyja barcode_build rejected | Current is barcode-build (hyphen) | Use hyphen form |
| Pangolin output column not present | Column names changed between major releases | Introspect output schema; pangolin --all-versions |
| Freyja silently misassigns new lineage | Barcode predates lineage designation | Rebuild barcodes; check resid |
| Nextclade and Pangolin disagree | Different versions; recombinant; QC | Update both; reconcile per table |
| ARTIC consensus has Ns clustered in one region | Primer dropout in that amplicon | Mask the amplicon; document scheme version |
| Lineage frequency shows implausible jump | pangolin-data version drift | Pin version; re-run archive |
| Augur tree topology changes between runs | Subsampling randomness | Pin random seed; document subsampling |
| Augur refine requires --root | Shallow tree without explicit root strategy | --root best / oldest / residual |
| Wastewater Freyja result differs from clinical | Barcode staleness or primer-scheme mismatch | Update barcode; document scheme; check coverage |
| COJAC misses a known variant | Single-read amplicons (no co-occurrence) | Re-sequence with paired-end or long-read |
| Pushback | Response |
|----------|----------|
| "Pangolin version?" | pangolin --all-versions recorded; pinned for the analysis; archive re-run on dataset update |
| "Nextclade dataset version?" | Dataset tag from pathogen.json recorded; pre-downloaded folder used to lock the version |
| "Why UShER not pangoLEARN?" | pangoLEARN deprecated mid-2023 (Pongmoragot 2024); UShER default since v4 |
| "How were ARTIC dropouts handled?" | Per-amplicon coverage checked; failed amplicons masked; primer scheme documented per isolate |
| "Were recombinants checked for?" | Bolotie / 3SEQ run on candidates; cross-checked Pangolin vs Nextclade; submitted to cov-lineages for novel candidates |
| "Wastewater barcode date?" | Barcode date postdates sample collection; freyja barcode-build from current UShER tree if needed |
| "Wastewater-to-cases conversion?" | Variant-specific shedding rate flagged in the wastewater literature; not assumed constant |
| "Lineage growth-advantage CI?" | Multinomial covariance reported; early estimates noted as inflated (Abousamra 2024) |
| "Nextstrain subsampling?" | Configuration explicit; sensitivity analysis run; MRCA / migration treated with uncertainty (Hodcroft 2021) |
development
Find restriction enzyme cut sites in DNA sequences using Biopython Bio.Restriction. Search with single enzymes, batches of enzymes, or commercially available enzyme sets. Returns cut positions for linear or circular DNA. Use when finding restriction enzyme cut sites in sequences.
development
Create restriction maps showing enzyme cut positions on DNA sequences using Biopython Bio.Restriction. Visualize cut sites, calculate distances between sites, and generate text or graphical maps. Use when creating or analyzing restriction maps.
development
Analyze restriction digest fragments using Biopython Bio.Restriction. Predict fragment sizes, get fragment sequences, simulate gel electrophoresis patterns, and perform double digests. Use when analyzing restriction digest fragment patterns.
development
Select restriction enzymes by criteria using Biopython Bio.Restriction. Find enzymes that cut once, don't cut, produce specific overhangs, are commercially available, or have compatible ends for cloning. Use when selecting restriction enzymes for cloning or analysis.