epidemiological-genomics/transmission-inference/SKILL.md
Infers person-to-person transmission from pathogen genomes using outbreaker2 (Campbell 2018), TransPhylo (Didelot 2017), phybreak (Klinkenberg 2017), BadTrIP (De Maio 2018), SCOTTI (De Maio 2016), BEASTLIER (Hall 2015), and SNP-distance / cluster-picker approaches (HIV-TRACE for HIV; transcluster). Defines outbreak clusters using pathogen-specific SNP thresholds (NOT a universal cutoff -- TB <=12 SNPs / Walker 2013; MRSA <=15 / Coll 2017; C. difficile <=2 / Eyre 2013; Klebsiella <=21 / Snitkin 2012), models within-host diversity and transmission bottlenecks (Worby-Lipsitch-Hanage 2014; McCrone 2018; Sobel Leonard 2017; Lythgoe 2021 SARS-CoV-2), integrates contact-tracing data, distinguishes generation interval from serial interval (Britton & Scalia Tomba 2019; Ali 2020), attributes source via Bayesian source attribution (Mather 2013 DT104; islandR), and reconciles transmission-network reconstruction with epi metadata. Use when investigating outbreaks for who-infected-whom, defining SNP-cluster outbreak definitions, accounting for unsampled intermediates, choosing between outbreaker2 (rich epi data) and TransPhylo (genomic-only after dated phylogeny), running source attribution between host populations, calling HIV-TRACE thresholds appropriate to the local subtype, or distinguishing recent transmission from reactivation in TB / chronic HIV.
npx skillsauth add GPTomics/bioSkills bio-epidemiological-genomics-transmission-inferenceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: TransPhylo 1.4+ (R), outbreaker2 1.2+ (R), phybreak 0.5+ (R), BadTrIP via BEAST 2.7+ package manager, BEASTLIER via BEAST 1.10+, transcluster 1.0+ (R), HIV-TRACE 1.5+, snp-dists 0.8+, ape 5.8+ (R), igraph 1.6+ (R), TreeTime 0.11+, BactDating 1.1+ (R), BEAST 2.7.6+, lofreq 2.1+, deepSNV via Bioconductor 3.18+, pandas 2.2+, BioPython 1.84+.
Before using code patterns, verify installed versions match. If versions differ:
packageVersion('TransPhylo'); ?inferTTree to confirm arg namespackageVersion('outbreaker2'); ?create_config -- iteration count is set via n_iter in the config object, NOT as iters to outbreaker()pip show lofreq; check whether deep variant calling supports the target MAFsnp-dists --help; hiv-trace --helpIf R rejects an argument, the function signature changed between minor releases; ?function_name is authoritative.
"Who infected whom in this outbreak, and is this even an outbreak?" -> Pick the question first (cluster definition vs WIWS who-infected-whom vs source attribution), then the method that fits the data (rich epi + dense sampling -> outbreaker2; sparse sampling + good dated tree -> TransPhylo; longitudinal within-host samples -> BEASTLIER / BadTrIP; rapid surveillance triage -> SNP-distance with pathogen-tuned threshold). Genomic distance is necessary but not sufficient for direction: two isolates 3 SNPs apart could be A->B, B->A, A->Unknown->B, or two-from-one common source. Direction inference requires temporal data, within-host diversity, contact-tracing data, or all three.
outbreaker2::outbreaker(data=outbreaker_data(dates=..., dna=..., w_dens=..., f_dens=..., ctd=...), config=create_config(n_iter=1e6)) -- dense outbreak with contact dataTransPhylo::inferTTree(ptree, mcmcIterations=1e5, w.shape=1.3, w.scale=10) -- sparse outbreak from a dated treesnp-dists -c gubbins.filtered_polymorphic_sites.fasta > pairwise.csv -- pairwise SNP for cluster triagehiv-trace --threshold 0.015 -- HIV cluster definition at the US-CDC default (subtype B); reconsider for non-B subtypesThe pathogen-specific SNP threshold varies by 10x across taxa (TB <=12 SNPs, C. difficile <=2, MRSA <=15, Salmonella cgMLST <=5, Klebsiella <=21, SARS-CoV-2 not defined by SNP alone). Substitution rate, recombination, generation time, within-host diversity, and (for Mpox) APOBEC3 editing all vary by 100x. Walker 2013 Lancet Infect Dis 13:137 derived the TB <=12 SNP cutoff from UK Oxfordshire (low-transmission, contact-traced, household settings); applying the same threshold in Cape Town or Mumbai inflates apparent recent-transmission rates 2-5x because clonal isolates linked through long-past common ancestors get pooled with truly recent transmissions. Worby, Lipsitch & Hanage 2014 PLoS Comput Biol 10:e1003549 formally showed that within-host bacterial diversity puts an irreducible upper bound on the resolution of SNP-distance transmission-network reconstruction even with repeated sampling. Always cite the pathogen-specific source AND its derivation population; never apply a threshold outside its validated context without an explicit caveat. For TB / HIV / chronic infections, naive SNP cutoffs fail because of reactivation and within-host coalescence -- use TransPhylo or outbreaker2 with within-host-aware priors.
| Tool | Mechanism | Inputs | Output | Strength | Fails when | |------|-----------|--------|--------|----------|------------| | Pairwise SNP threshold (snp-dists; cluster picker) | Count SNPs between pairs; threshold + linkage | Core-SNP alignment | Adjacency at threshold | Fast; intuitive; standard for surveillance triage | Pathogen-specific cutoff; convergent evolution and recombination violate distance assumptions | | HIV-TRACE (Kosakovsky Pond 2018 Mol Biol Evol 35:1812) | TN93 pairwise distance + threshold (default 1.5%) | HIV-1 pol or other gene | Cluster membership | CDC standard for US HIV surveillance | 1.5% threshold is US-CDC subtype B specific; under-clusters subtype C in southern Africa | | outbreaker2 (Campbell 2018 BMC Bioinformatics 19:363) | MCMC; sequence + generation-interval + sampling-time + contact-tracing | Dated genomes + epi data | Posterior WIWS + unsampled intermediates + R_e | Integrates epi data explicitly; modular likelihood | ~100-200 cases practical limit; assumes one infection event per case (no within-host populations) | | TransPhylo (Didelot 2017 Mol Biol Evol 34:997) | Coalescent within-host + birth-death between-host; colours a dated tree | Time-scaled tree + sampling dates | Posterior transmission tree + R_t + unsampled cases | Works from a tree, not raw genomes; scales to ~1000 tips; explicit within-host coalescence | Sensitive to within-host effective population size prior; requires good dated phylogeny | | phybreak (Klinkenberg 2017 PLoS Comput Biol 13:e1005495) | Joint phylogeny + transmission inference via MCMC | Dated genomes | Posterior transmission tree | Proper within-host handling; fast for small outbreaks | <=100 cases; less benchmarked than outbreaker2/TransPhylo | | BadTrIP (De Maio 2018 PLoS Comput Biol 14:e1006117) | Bayesian; explicit handling of multi-strain infections | Dated genomes | Posterior transmission tree with strain-level resolution | Handles within-host diversity / mixed infections (TB, HIV) | Slow; specialist tool | | SCOTTI (De Maio 2016 PLoS Comput Biol 12:e1005130) | Structured-coalescent transmission inference (BEAST 2 package) | Dated genomes | Posterior transmission tree under structured coalescent | Sampling-aware; correctly models unsampled intermediates | Computationally heavy; specialist setup | | BEASTLIER (Hall 2015 PLoS Comput Biol 11:e1004613) | Joint phylogeny + transmission partitioning | Dated genomes; ideally with multiple isolates per host | Posterior transmission tree with within-host partition | Postdoc-grade identifiability with within-host samples | Single-isolate-per-host data is under-identified | | transcluster (Stimson 2019 Mol Biol Evol 36:587) | Per-pair posterior probability under SNP + time prior | Dated genomes | Per-pair cluster membership probability | Probabilistic; pathogen-tuned priors | Pair-level only; no full transmission tree | | Sobel Leonard 2017 J Virol 91:e00171-17 beta-binomial bottleneck | Estimate transmission bottleneck size from donor-recipient deep sequencing | Donor + recipient deep-sequence allele frequencies | Bottleneck Nb posterior | Estimates an otherwise unobservable quantity | Requires deep-sequenced donor-recipient pairs | | islandR / Bayesian source attribution (Mather 2013 Science 341:1514) | Bayesian per-population allele-frequency model | Reference collections per host source + query genome | Per-source posterior probability | Standard in Salmonella / Campylobacter food-safety surveillance | Source-attribution circularity: trained-on-distribution reproduces that distribution |
| Scenario | Recommended approach | Why wrong choices fail |
|----------|----------------------|------------------------|
| "Is this even an outbreak?" routine surveillance triage | snp-dists after Gubbins; pathogen-tuned threshold (Walker 2013 for TB, Eyre 2013 for C. diff, EFSA cgMLST <=5 for Salmonella); cross-check cgMLST distance | Universal SNP threshold across pathogens (10x variation) |
| Densely sampled outbreak with contact-tracing data | outbreaker2 with ctd contact matrix + generation-time prior + sampling-time prior | TransPhylo without epi data (loses information from contacts); naive SNP threshold (ignores within-host diversity) |
| Sparsely sampled, longer-time-scale outbreak | TransPhylo on a BactDating-derived dated tree | outbreaker2 (sampling-completeness assumption broken); SNP threshold inflates clusters with unsampled intermediates |
| TB outbreak with possible reactivation | TransPhylo + transcluster with TB-tuned priors; long within-host coalescent matters | SNP cutoff insufficient -- reactivation can have 0 SNPs from years-old strains |
| Hospital outbreak with possible mixed infection | BadTrIP / SCOTTI | Consensus-only methods (SNP distance, outbreaker2) ambiguous on mixed-strain |
| Multi-site outbreak with import suspected | TransPhylo + MASCOT-derived migration; source-attribution as separate analysis | Source attribution needs phylogeographic component beyond TransPhylo alone |
| Food-vehicle / environmental source attribution | islandR / Bayesian source attribution (Mather 2013 framework); manual cluster + phylogeographic plot | Naive phylogenetic placement loses the per-source priors |
| Sub-sampled outbreak (<50% cases sequenced) | outbreaker2 (handles unsampled cases explicitly with pi sampling parameter) | Raw SNP cutoff -- unsampled intermediates break SNP-distance reasoning |
| Recombining pathogen (S. pneumo, E. coli STEC, K. pneumoniae) | Gubbins / ClonalFrameML mask FIRST; then any of the above | Recombination inflates apparent SNP distance and creates false convergent transmission inference |
| HIV cluster definition | HIV-TRACE 1.5% for subtype B (US-CDC standard); reconsider for non-B subtypes | Applying 1.5% threshold globally without subtype caveat |
| Estimate transmission bottleneck | Sobel Leonard 2017 beta-binomial on deep-sequenced donor-recipient pairs | Consensus-only sequences cannot quantify bottleneck size |
Methodology evolves; before any high-stakes who-infected-whom claim, web-search "outbreak transmission inference benchmark <pathogen> 2025" for current best practice.
Goal: Infer who-infected-whom posterior for a densely sampled outbreak with epi metadata, jointly estimating generation interval and unsampled-case proportion.
Approach: Build outbreaker_data with sampling dates, DNA alignment, generation-time density w_dens, sampling-time density f_dens, and contact-tracing matrix ctd; configure MCMC via create_config(n_iter=N); run; summarise posterior over WIWS.
library(outbreaker2)
library(ape)
dna <- read.dna('alignment.fasta', format='fasta')
dates <- read.csv('sampling_dates.csv')
ctd_matrix <- as.matrix(read.csv('contact_matrix.csv', row.names=1))
w_dens <- dgamma(1:30, shape=2.5, scale=2) # generation time prior
f_dens <- dgamma(1:30, shape=2, scale=3) # sampling-time prior
data <- outbreaker_data(dates=dates$collection_date, dna=dna,
w_dens=w_dens, f_dens=f_dens, ctd=ctd_matrix)
cfg <- create_config(n_iter=1e6, sample_every=200, find_import=TRUE)
res <- outbreaker(data=data, config=cfg)
summary(res)
w_dens is the generation-time distribution (time from infection of A to infection of B) -- NOT the serial interval (time between symptom onsets); using one in place of the other biases inference. Britton & Scalia Tomba J R Soc Interface 16:20180670 (2019) formalised the bias for emerging epidemics; for SARS-CoV-2 with substantial pre-symptomatic transmission (Ali 2020 Science 369:1106), the serial interval shortened from 7.8 to 2.2 days under NPI, and naive SI-based inference was biased.
Goal: Infer transmission tree posterior from a time-scaled phylogeny when raw genomes are not directly usable or when the outbreak is too large for outbreaker2 (>200 cases).
Approach: Time-scale the tree first (BactDating after Gubbins for bacteria; BEAST or TreeTime for viruses); convert to TransPhylo ptree with ptreeFromPhylo; run inferTTree with generation-time prior and within-host effective population size prior; summarise via medTTree (medoid transmission tree) and posterior probabilities per WIWS pair.
library(TransPhylo)
library(ape)
tree <- read.nexus('dated_tree.nexus')
date_last_sample <- 2024.95
ptree <- ptreeFromPhylo(tree, dateLastSample=date_last_sample)
w.shape <- 1.3
w.scale <- 10
ws.shape <- 1.1
ws.scale <- 7
neg <- 0.5
res <- inferTTree(ptree, mcmcIterations=1e5,
w.shape=w.shape, w.scale=w.scale,
ws.shape=ws.shape, ws.scale=ws.scale,
startNeg=neg, dateT=date_last_sample + 0.1)
med_tree <- medTTree(res)
pairs <- extractTTree(med_tree)$ttree
w.* is the generation-time Gamma prior; ws.* is the sampling-time Gamma prior. Both must reflect the pathogen's biology (e.g., TB w.scale = months; SARS-CoV-2 w.scale = days). Wrong priors silently bias the transmission-tree posterior.
Goal: Define outbreak clusters from a recombination-masked core-SNP alignment using the published pathogen-specific threshold, with the threshold's source population caveated.
Approach: Snippy -> snippy-core -> Gubbins on core.full.aln for bacteria -> snp-dists -> single-linkage clustering at the pathogen-specific threshold; cite Walker 2013 (TB), Eyre 2013 (C. diff), Coll 2017 (MRSA), Snitkin 2012 (Klebsiella) per organism; flag any extrapolation outside the threshold's validation population.
snippy-core --ref reference.fa --prefix core snippy_out/*
run_gubbins.py --prefix gubbins core.full.aln
snp-dists -c gubbins.filtered_polymorphic_sites.fasta > pairwise.csv
import pandas as pd
import numpy as np
from scipy.cluster.hierarchy import linkage, fcluster
dist = pd.read_csv('pairwise.csv', index_col=0)
condensed = dist.values[np.triu_indices(len(dist), k=1)]
THRESHOLD_TB = 12 # Walker 2013 Lancet Infect Dis 13:137 -- UK low-transmission
THRESHOLD_MRSA = 15 # Coll 2017 Clin Infect Dis 65:1781
THRESHOLD_CDIFF = 2 # Eyre 2013 NEJM 369:1195
THRESHOLD_KPNEUMO = 21 # Snitkin 2012 Sci Transl Med 4:148ra116
linkage_matrix = linkage(condensed, method='single')
clusters = fcluster(linkage_matrix, t=THRESHOLD_TB, criterion='distance')
Trigger: Walker 2013 UK 5/12-SNP TB threshold applied to Cape Town or Mumbai high-transmission settings.
Mechanism: Walker 2013 Lancet Infect Dis 13:137 calibrated the 5/12 SNP threshold on Oxfordshire community / household contact-traced data (low-transmission). In high-prevalence settings, clonal isolates linked through long-past common ancestors fall within the threshold without recent direct transmission.
Symptom: Country-level Mtb genomic-epi report shows 60-80% of cases in "transmission clusters", far exceeding clinical contact-tracing rates.
Fix: Cite the threshold's source population; for high-prevalence settings, derive a local threshold from epidemiologically-anchored case pairs in the local cohort rather than importing a UK-low-transmission cutoff. For transmission-direction claims, supplement with TransPhylo / outbreaker2.
Trigger: Outbreak report concluding "A -> B" because A has earlier sampling date and 3 SNPs from B.
Mechanism: A 3-SNP pairwise difference is consistent with A->B, B->A, Unknown->both, or A->Unknown->B. Worby, Lipsitch & Hanage 2014 PLoS Comput Biol 10:e1003549 formalised the irreducible uncertainty. Earlier sampling date does not establish earlier infection date because of within-host evolution and asymptomatic carriage.
Symptom: Outbreak conclusions claim directionality without within-host data or contact tracing; reviewers from the Didelot / Worby groups push back.
Fix: Use "transmission consistent with genomics" not "transmission demonstrated". For direction claims, require within-host samples (BEASTLIER), contact-tracing data (outbreaker2 with ctd), or both. Cite Worby 2014 as the upper bound on what SNP distance can establish.
Trigger: Outbreak with <50% sequencing coverage; transmission inference assumes all cases sampled.
Mechanism: When sampling is incomplete, inferred A->B "direct" transmissions are routinely A->Unknown->B chains. This systematically inflates inferred R_e (longer chains compressed), underestimates generation interval, and biases topology toward bushy trees.
Symptom: Inferred R_e is implausibly high (each "tip" appears to spawn extra children once unsampled intermediates collapse into apparent direct links); generation interval estimate is implausibly short; topology appears bushier than expected.
Fix: Use outbreaker2 with explicit pi (sampling proportion) parameter, or TransPhylo / SCOTTI which model unsampled intermediates explicitly. Cite the unsampled-intermediates caveat in every transmission-inference report.
Trigger: Consensus-genome transmission-pair inference for a pathogen with documented narrow bottleneck (influenza 1-2 virions per McCrone 2018 eLife 7:e35962; SARS-CoV-2 <10 virions per Lythgoe 2021 Science 372:eabg0821).
Mechanism: When the transmission bottleneck is narrow, donor and recipient consensus genomes are near-identical by default -- the bottleneck strips most within-host diversity. Near-identity therefore does NOT discriminate direct transmission from infection by an unsampled intermediate or from a shared common source. Naive coalescent intuition predicts that "more transmissions = more divergence"; the opposite is true under a narrow bottleneck.
Symptom: Most pairs in a dense outbreak appear identical or 1 SNP apart; SNP-distance-based cluster definitions become uninformative; transmission-direction claims based on consensus difference are unfalsifiable.
Fix: For narrow-bottleneck pathogens, supplement consensus-based methods with deep within-host variant calling (lofreq / deepSNV / VarScan2 at MAF >= 1%) on donor-recipient pairs; estimate bottleneck size explicitly via Sobel Leonard 2017 J Virol 91:e00171-17 beta-binomial estimator; report transmission claims as "consistent with" rather than "demonstrated by" consensus identity. Pair-level resolution requires within-host data; without it, claim only cluster membership, not direction.
Trigger: outbreaker2 / EpiNow2 / similar tools fed the serial-interval distribution (w_dens set from symptom-to-symptom data) when the model wants generation-interval (infection-to-infection).
Mechanism: Generation interval = time from infection of A to infection of B; serial interval = time from symptom onset of A to symptom onset of B. They differ when incubation periods vary or pre-symptomatic transmission is substantial. Britton & Scalia Tomba 2019 J R Soc Interface 16:20180670 formalised the bias for emerging epidemics; Ali 2020 Science 369:1106 showed for SARS-CoV-2 the SI shortened from 7.8 to 2.2 days under NPI.
Symptom: Inferred R_e is biased; comparison to case-based R_t (also often SI-based) shows compounding bias.
Fix: Document which distribution w_dens actually encodes. For SARS-CoV-2 with substantial pre-symptomatic transmission, generation interval is ~5 days in the ancestral-strain literature; serial interval was ~4-5 days early but shortened to 2-3 under NPI. Cite Britton 2019.
Trigger: HIV-TRACE run on subtype C sequences from southern Africa with the default 1.5% TN93 threshold.
Mechanism: Kosakovsky Pond et al 2018 Mol Biol Evol 35:1812 documented HIV-TRACE methodology; the 1.5% threshold is the US-CDC default tuned for subtype B in MSM cohorts. Subtype C in southern Africa has higher diversity per unit time and more recent epidemics; the 1.5% threshold under-clusters there.
Symptom: Cluster definitions in southern African subtype C HIV surveillance under-detect transmission; comparison to US surveillance literature shows incompatible cluster sizes.
Fix: Tune threshold for the local subtype and population; cite the local validation. UKHSA / ECDC use different thresholds; document which.
Trigger: Bayesian source attribution model (Mather 2013 Science 341:1514 framework) trained on a reference collection that over-represents one host population.
Mechanism: Source-attribution models reproduce the host-distribution of their training data unless explicitly corrected. If 80% of training isolates are from cattle, the model will tend to attribute new isolates to cattle even when the true source is poultry.
Symptom: Source attribution reproduces the sampling intensity of the reference collection; conclusions are circular.
Fix: Weight by inverse sampling intensity per source category; use rarefied reference collections; report attribution alongside the reference-collection composition as a caveat.
Trigger: SARS-CoV-2 outbreak comparison across samples sequenced with different ARTIC primer schemes (V3 / V4 / V4.1 / V5.3.2); "differences" concentrated in one amplicon are interpreted as real SNPs.
Mechanism: ARTIC primer dropouts produce N's or reference-derived consensus calls in failed amplicons (Itokawa 2020 PLoS ONE 15:e0239403); these LOOK LIKE deletions or reference matches in downstream analysis but are missing data. Cross-scheme comparison without masking failed amplicons produces spurious transmission differences.
Symptom: Cluster definitions differ implausibly between ARTIC-V3 and ARTIC-V4.1 samples; "differences" cluster in known dropout amplicons (V4.1 amplicons 64, 76, 88-90).
Fix: Mask failed amplicons per sample (samtools depth + per-amplicon coverage); document primer scheme version per isolate; for transmission inference, exclude positions in any sample's dropout regions.
| Pattern | Likely cause | Action | |---------|--------------|--------| | outbreaker2 and TransPhylo disagree on WIWS | Different sampling-completeness assumptions; outbreaker2 expects ~dense sampling, TransPhylo handles sparse | Pick the method whose assumption matches the data; cite the choice | | SNP threshold cluster and outbreaker2 cluster differ | SNP threshold ignores temporal data and contacts | Trust outbreaker2 (integrates more evidence); SNP cluster is triage only | | Two consecutive Pangolin versions give different lineage for a "transmission pair" | Lineage definitions revised | Re-run both samples against a single Pango / pangolin-data version | | TB cluster definition flips between 5 and 12 SNP threshold | Walker 2013 ambiguous range | Run TransPhylo for transmission-direction posterior; report SNP-distance with cluster picker certainty | | HIV cluster differs between HIV-TRACE 1.5% and 2.0% | Threshold sensitivity at boundary | Subtype-specific calibration; cite the chosen threshold's validation | | Source attribution differs between islandR runs with different reference panels | Sampling-intensity bias | Re-run with rarefied or inverse-weighted reference; report multiple scenarios |
| Pathogen | "Outbreak cluster" threshold | Source / rationale | |----------|------------------------------|--------------------| | Mycobacterium tuberculosis (whole-genome core SNP) | <=12 SNPs (likely transmission); <=5 SNPs (recent transmission) | Walker 2013 Lancet Infect Dis 13:137 (UK low-transmission setting) | | Staphylococcus aureus (core genome) | <=15 SNPs (within hospital outbreak); <=40 SNPs (broader temporal cluster) | Coll 2017 Clin Infect Dis 65:1781 | | Klebsiella pneumoniae (KPC outbreak) | <=21 SNPs | Snitkin 2012 Sci Transl Med 4:148ra116 | | Salmonella enterica (cgMLST EnteroBase) | <=5 allelic differences (cluster); <=7 (extended cluster) | EnteroBase / EFSA harmonised | | Listeria monocytogenes (PulseNet cgMLST) | <=4 allelic differences | PulseNet protocol convention | | E. coli (cgMLST, EnteroBase) | <=10 allelic differences (STEC outbreak) | EnteroBase convention | | Neisseria gonorrhoeae | <=25 core SNPs (transmission) | UKHSA STI framework | | Clostridioides difficile (core SNP, recombination-masked) | <=2 SNPs (likely direct); <=10 (plausible within 6 months) | Eyre 2013 NEJM 369:1195 | | SARS-CoV-2 (whole-genome) | No fixed cutoff; 0-2 SNPs + epi link + sampling window | Lythgoe 2021 Science 372:eabg0821 | | HIV-1 subtype B (TN93 distance) | 1.5% genetic distance (HIV-TRACE default; US-CDC standard) | Kosakovsky Pond 2018 Mol Biol Evol 35:1812 | | Mpox clade IIb | <=2 SNPs cluster threshold; APOBEC3 editing inflates apparent distance | Mpox 2022 outbreak APOBEC3-editing literature | | Transmission bottleneck -- influenza | ~1-2 virions (narrow) | McCrone 2018 eLife 7:e35962 | | Transmission bottleneck -- SARS-CoV-2 | <10 virions (tight) | Lythgoe 2021 Science 372:eabg0821 | | Generation interval -- SARS-CoV-2 ancestral | ~5 days | SARS-CoV-2 ancestral-strain literature |
CRITICAL: a number from one pathogen does NOT transfer to another. Always cite the source population.
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| outbreaker2 rejects iters arg | Iterations set via n_iter in the config object | create_config(n_iter=N) |
| TransPhylo MCMC fails to converge | Within-host Ne prior misspecified; bad input tree | Tune startNeg; verify tree dating quality |
| Cluster definition flips between linkage methods | Single-linkage vs complete-linkage on borderline pairs | Document; sensitivity analysis |
| outbreaker2 estimates implausible R_e | Sampling proportion mis-specified | Set pi based on epi knowledge or estimate within outbreaker2 |
| Transmission inferred between two distant lineages | Recombination unmasked | Run Gubbins on core.full.aln first |
| HIV-TRACE clusters incompatible across labs | Different subtype calibration | Document subtype; use locally validated threshold |
| Source attribution always pointing at one host | Reference-collection bias | Re-weight or rarify reference panel |
| snp-dists -t rejected | -t flag doesn't exist; default IS tab; -c for CSV | Use -c for CSV; default for TSV |
| Snippy outputs disagree across samples | Different reference; reference mismatch silently shifts SNP coordinates | Always document reference; use same reference cross-lab |
| Pushback | Response |
|----------|----------|
| "What SNP threshold and on what population?" | Cite Walker 2013 / Eyre 2013 / Coll 2017 per pathogen; caveat the population if extrapolating |
| "Were unsampled intermediates handled?" | outbreaker2 pi parameter or TransPhylo / SCOTTI explicit modelling; never a raw SNP-distance method on sub-sampled data |
| "Direction of transmission inference?" | Within-host samples + contact tracing required for direction; otherwise "consistent with" phrasing |
| "Generation interval vs serial interval?" | Documented w_dens source; cite Britton 2019 if SI used as approximation for GI |
| "Why TransPhylo / outbreaker2 / phybreak?" | Decision tree based on sampling completeness, dataset size, contact-tracing availability |
| "Was within-host diversity considered?" | TransPhylo's within-host coalescent OR BadTrIP for mixed-strain; bottleneck size from Sobel Leonard 2017 if relevant |
| "HIV-TRACE 1.5% threshold outside subtype B?" | Acknowledged US-CDC subtype B origin; either use locally validated threshold or document caveat |
| "Source attribution sampling-intensity bias?" | Re-weighted reference collection or rarified; cite Mather 2013 limitation |
| "Was forward simulation run as a sanity check?" | SLiM / FAVITES / SEEDY if claims are high-stakes; routinely under-done in published transmission inference |
development
Find restriction enzyme cut sites in DNA sequences using Biopython Bio.Restriction. Search with single enzymes, batches of enzymes, or commercially available enzyme sets. Returns cut positions for linear or circular DNA. Use when finding restriction enzyme cut sites in sequences.
development
Create restriction maps showing enzyme cut positions on DNA sequences using Biopython Bio.Restriction. Visualize cut sites, calculate distances between sites, and generate text or graphical maps. Use when creating or analyzing restriction maps.
development
Analyze restriction digest fragments using Biopython Bio.Restriction. Predict fragment sizes, get fragment sequences, simulate gel electrophoresis patterns, and perform double digests. Use when analyzing restriction digest fragment patterns.
development
Select restriction enzymes by criteria using Biopython Bio.Restriction. Find enzymes that cut once, don't cut, produce specific overhangs, are commercially available, or have compatible ends for cloning. Use when selecting restriction enzymes for cloning or analysis.