phasing-imputation/foundations/SKILL.md
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
npx skillsauth add GPTomics/bioSkills phasing-imputation/foundationsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: bcftools 1.19+.
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
This skill is the conceptual spine for the category; it sequences the pipeline and routes the per-stage work to the sibling skills rather than running the engines, so the binding versions are the bcftools/Beagle/SHAPEIT5/Minimac4/GLIMPSE stack the siblings share. A reference panel is a dated release on a specific genome build (GRCh37 vs GRCh38); record the panel name, build, and release date with every result, because the same code against a different panel release imputes different sites.
"Phase and impute my genotypes" -> Decide the genotyping strategy, order the stages, and carry a dosage (not a hard call) downstream - because phasing and imputation are one statistical machine whose output is a posterior, and every stage of the pipeline fails silently when misordered or misaligned.
Scope: the meta-decisions - which genotyping strategy (array vs low-coverage WGS), the stage ordering, why dosages carry the uncertainty, why the panel is the prior, and where each stage silently fails. Panel acquisition/selection -> reference-panels. Statistical phasing engines -> haplotype-phasing. Imputation engines and dosage fields -> genotype-imputation. Quality metrics and filtering -> imputation-qc. Read-backed / molecular single-sample phasing (long reads, Hi-C, 10x) -> long-read-sequencing/haplotype-phasing. The input VCF and its QC -> variant-calling/vcf-basics. The GWAS test on the dosages -> population-genetics/association-testing.
There is no separate "phasing algorithm" and "imputation algorithm" at the core: both model a target haplotype as an imperfect mosaic of reference haplotypes, where switching which reference is copied is a recombination (an HMM transition whose rate is set by the genetic map and Ne) and a site where the copy disagrees is a mutation (an emission whose rate shrinks as the panel grows) (Li & Stephens 2003 Genetics 165:2213). Phasing asks which haplotype pair explains the diploid genotypes; imputation aligns those haplotypes onto the panel and reads the copied allele at untyped sites. Three facts drive every decision:
E[#alt] in [0,2] (= P(het) + 2*P(hom-alt)), and the per-variant INFO/R2/DR2 is the model's self-estimate of how much real information that dosage carries (1 = perfect, 0 = nothing beyond allele frequency). Hard-thresholding the dosage into 0/1/2 and analyzing it as observed throws away the uncertainty imputation exists to quantify - it loses power at exactly the rare, low-R2 variants and, when imputation quality differs across batches, manufactures false associations that fail to replicate. Carry dosage (or the full genotype-probability triple) into a dosage-aware regression.A new haplotype is modeled as copying a panel of K reference haplotypes, switching templates along the chromosome. The hidden state at each marker is which template is being copied; the transition probability of switching rises with the genetic distance between markers (in centimorgans, NOT base pairs) and with Ne through the rho ~ 4*Ne recombination scaling; the emission allows a mismatch with a probability ~ theta/(theta+K) that shrinks as K grows (a bigger panel almost always has a near-perfect template, which is the quantitative root of "bigger panels impute better"). The forward-backward algorithm gives, at every site, the posterior over which template is copied; imputation sums that posterior to a dosage and reports its variance ratio as INFO/R2. IMPUTE/IMPUTE2, MaCH/minimac, Beagle, SHAPEIT, Eagle, and GLIMPSE are all this model with different approximations (which haplotypes to condition on, how PBWT makes the K-quadratic match-finding near-linear, EM vs MCMC, genotype-likelihood vs hard-call emission) - the differences are engineering, not different theories.
The upstream design decision, made before any engine is chosen, is how to generate the genotypes that will be imputed. It is not primarily an accuracy question; it is an ascertainment question.
| | SNP array + pre-phase + impute | Low-coverage WGS (~0.5-4x) + impute from genotype likelihoods | |---|---|---| | Input to the HMM | hard genotype calls (array error is tiny) | genotype LIKELIHOODS (PL/GL); a hard call at 1x is mostly noise | | Ascertainment | FIXED - only the SNPs the array was designed for, biased to the design population | UNBIASED - whatever is in the genome is observed | | Common variants | excellent and cheap | excellent | | Rare variants | limited by the array scaffold and panel | matches or beats dense arrays (Rubinacci 2021 Nat Genet 53:120) | | Under-represented ancestry | poor (no good array, panel-mismatched) | the main route around array/panel bias | | Tools | Beagle / Minimac4 / IMPUTE5 -> genotype-imputation | GLIMPSE2 (panel) / STITCH (no panel) -> genotype-imputation |
The judgment: common-variant GWAS in a well-paneled ancestry -> array plus imputation is cheap and adequate; rare variants, under-represented ancestry, or a need for unbiased genome-wide ascertainment -> low-coverage WGS plus genotype-likelihood imputation, the direction the field is moving as sequencing costs fall. Low-coverage WGS is not magic: it is only as good as its panel and its likelihoods (bad mapping, contamination, or damage produce garbage GLs that impute garbage).
| Tool / panel | Citation | Role | Owns / route to | |--------------|----------|------|------------------| | Beagle | Browning 2018 Am J Hum Genet 103:338 | one tool that BOTH phases and imputes; bref3 panel format | haplotype-phasing, genotype-imputation | | SHAPEIT5 | Hofmeister 2023 Nat Genet 55:1243 | scalable phasing incl. rare/singleton; split into phase_common/phase_rare/ligate | haplotype-phasing | | Eagle2 | Loh 2016 Nat Genet 48:1443 | PBWT reference-based phasing; the Michigan-server phaser | haplotype-phasing | | Minimac4 | Das 2016 Nat Genet 48:1284 | imputation engine; msav/m3vcf; the Michigan/TOPMed server engine | genotype-imputation | | IMPUTE5 | Rubinacci 2020 PLoS Genet 16:e1009049 | PBWT-based imputation for large panels | genotype-imputation | | GLIMPSE2 | Rubinacci 2023 Nat Genet 55:1088 | low-coverage WGS imputation from genotype likelihoods | genotype-imputation | | STITCH | Davies 2016 Nat Genet 48:965 | low-coverage imputation with NO reference panel (latent panel via EM) | genotype-imputation | | PBWT | Durbin 2014 Bioinformatics 30:1266 | the data structure making biobank-scale haplotype matching near-linear | (underlies the phasers/imputers) | | 1000G / HRC / TOPMed | 1000G 2015 Nature 526:68; McCarthy 2016 Nat Genet 48:1279; Taliun 2021 Nature 590:290 | the reference panels (the prior) | reference-panels |
| Scenario | Recommended | Why | |----------|-------------|-----| | SNP array, common-variant GWAS, well-paneled ancestry | array -> pre-phase -> impute against a matched panel | mature, cheap, accurate for common variants | | Rare variants, under-represented ancestry, or unbiased ascertainment wanted | low-coverage WGS -> GLIMPSE2 (or STITCH if no panel) -> genotype-imputation | removes array ascertainment bias; wins at rare variants | | Moderate cohort, want one tool for phase and impute | Beagle -> haplotype-phasing, genotype-imputation | does both with the bref3 panel format | | Biobank-scale cohort (100k+), phasing | SHAPEIT5 -> haplotype-phasing | scales; phases rare variants on a common scaffold | | Choosing/preparing the panel for a target ancestry | -> reference-panels | panel ancestry-match is the first-order accuracy lever | | Filtering imputed output before analysis | -> imputation-qc | INFO/R2/DR2 + MAF floor on the imputer's own field | | Single individual, long reads / Hi-C / linked reads | -> long-read-sequencing/haplotype-phasing | read-backed phasing, a physically different signal | | The association test on the dosages | -> population-genetics/association-testing | the GWAS test is downstream | | Establishing target ancestry for panel choice | -> population-genetics/population-structure | PCA, not phasing |
Default when uncertain: for a common-variant GWAS in a well-served ancestry, array plus imputation against a large matched panel; otherwise consider low-coverage WGS plus GLIMPSE2.
Phase information comes from three non-interchangeable physical sources; they resolve different things and the modern answer is to combine, not rank, them. This skill and its phasing sibling own the first; the third is a mode of the same tools; the second is a different category routed out.
Trigger: imputing GRCh37 array data against a GRCh38 panel, or leaving A/T and C/G (palindromic) SNPs unflipped. Mechanism: positions or alleles silently disagree with the panel, so the HMM copies the wrong templates. Symptom: imputation accuracy near zero over whole regions, with no error and a plausible-looking VCF. Fix: confirm the build, lift over if needed, and align strand/alleles to the panel (bcftools +fixref or the imputation-server check tool) before phasing -> reference-panels.
Trigger: thresholding DS into 0/1/2 and analyzing those as observed genotypes. Mechanism: thresholding discards the posterior uncertainty, worst at low-R2 rare variants. Symptom: lost power read as a true null, or false hits from batch-differential imputation quality. Fix: regress on dosage or the genotype-probability triple -> imputation-qc, population-genetics/association-testing.
Trigger: reporting a confident INFO/R2 for a sample whose ancestry the panel barely contains. Mechanism: R2 is the model grading its own posterior; it cannot see that no panel haplotype resembled the target. Symptom: high self-graded quality on systematically wrong copies. Fix: match the panel ancestry, prefer low-coverage WGS for under-served ancestries, and validate with masked/held-out truth stratified by frequency -> reference-panels, imputation-qc.
Trigger: reporting phasing by tool name with no switch-error rate. Mechanism: a switch error changes which haplotype an allele sits on without changing any genotype, so per-site QC is structurally blind to it. Symptom: genotype-concordant data that is phased garbage; a single switch inverts a compound-het (cis vs trans) conclusion. Fix: report a switch-error rate against a trio or read-backed truth set for any phase-dependent claim -> haplotype-phasing.
Trigger: assuming a large panel "recovers" ultra-rare or sample-private variants. Mechanism: a private allele is carried by no panel haplotype, so there is no template to copy - it is un-imputable by construction. Symptom: the variant is absent or imputed at near-zero R2. Fix: sequence (do not impute) private variants; use larger panels only for the merely-rare.
| Threshold | Source | Rationale | |-----------|--------|-----------| | Dosage range E[#alt] in [0,2], not {0,1,2} | Li & Stephens 2003 Genetics 165:2213; Marchini & Howie 2010 Nat Rev Genet 11:499 | the dosage is the posterior mean alt-count; analyze on dosage to propagate uncertainty | | INFO/R2 filter at 0.3 (common variants), 0.7-0.8 stricter | Marchini & Howie 2010 Nat Rev Genet 11:499 (framing) | R2<0.3 means the dosage carries little real information; the exact cutoff is a study choice, not a law -> imputation-qc | | Genetic distance in centimorgans (1 cM ~ 1% recombination/meiosis; 1 Morgan = 100 cM) | Li & Stephens 2003 Genetics 165:2213 | LS transitions use genetic, not physical, distance; a wrong-build or flat map silently degrades phasing | | Low-coverage WGS sweet spot ~0.5-4x (often ~1x) with a large panel | Rubinacci 2021 Nat Genet 53:120 | GLIMPSE operating range; below ~0.5x accuracy degrades, above ~4x hard-calling becomes viable | | HRC accurate to MAF ~0.1% | McCarthy 2016 Nat Genet 48:1279 | 64,976 haplotypes; larger panels (TOPMed) push the frequency floor lower | | Pre-phase the cohort once, re-impute against any panel | Howie 2012 Nat Genet 44:955 | imputation collapses from the diploid (K^2) to the haploid (K) problem; phasing is committed, so phase well |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| Imputation accuracy near zero across a region | genome build or strand mismatch to the panel | confirm build, lift over, align strand before phasing -> reference-panels |
| Hits do not replicate across batches | hard-called dosages plus batch-differential imputation quality | analyze dosages; impute batches together -> imputation-qc |
| All R2 low for a cohort | panel ancestry mismatch or wrong-build genetic map | use a matched panel; check the map build -> reference-panels |
| Rare-variant burden differs across cohorts | a single global R2 cutoff removes more rare variants than common, differently per panel | use frequency-stratified accuracy; match panels -> imputation-qc |
| Engine errors or mis-imputes at a site | multiallelic / non-biallelic input | bcftools norm -m -any to split and left-align first -> variant-calling/variant-normalization |
| Expected variant missing after imputation | private / panel-absent allele | un-imputable by construction; sequence it |
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.
development
End-to-end pooled and single-cell CRISPR screen analysis from FASTQ to hit genes. Orchestrates library design QC, guide counting, six-stage screen QC (plasmid Gini, replicate Pearson, CEGv2 PR-AUC, copy-number artifact), method-appropriate hit calling across MAGeCK RRA/MLE, BAGEL2, drugZ, JACKS, and Chronos, cancer-cell-line copy-number correction (CRISPRcleanR / Chronos), batch correction for multi-batch screens, and the specialized branches for combinatorial paralog screens, single-cell Perturb-seq, base-editor variant-function screens, prime-editor screens, and in vivo bottleneck-aware screens. Use when analyzing any pooled CRISPR screen end-to-end, choosing the correct hit-calling method by experimental design, integrating copy-number correction into the pipeline, or branching the workflow for single-cell, combinatorial, base-editor, prime-editor, or in vivo variants.