phasing-imputation/reference-panels/SKILL.md
Selects and prepares the reference panel that phasing/imputation copies haplotypes from (1000 Genomes, HRC, TOPMed, HGDP+1kGP/gnomAD, CAAPA), matching panel ancestry to the target, reconciling genome build and chromosome naming, and running the strand/allele harmonization gate. Covers why ancestry-match beats panel size (imputation can only copy haplotypes the panel contains), why palindromic A/T and C/G SNPs flip strand without erroring, why liftover is a strand-flip generator in between-build inverted regions, that HRC is SNP-only and TOPMed is never downloadable (governance can override accuracy), and panel formats (msav, bref3, imp5). Use when choosing a panel for a target ancestry, preparing or converting a panel, aligning study data, or deciding between downloadable and server-only panels. Phasing is haplotype-phasing; imputation is genotype-imputation; strategy is foundations; PCA for ancestry is population-genetics/population-structure; HLA panels are clinical-databases/hla-typing.
npx skillsauth add GPTomics/bioSkills bio-phasing-imputation-reference-panelsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: bcftools 1.19+, PLINK 1.9+.
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
A reference panel is DATA, not a tool: it has a dated release and a fixed genome build (GRCh37 or GRCh38). Record the exact panel name, version, and build with every result; "1000 Genomes" without a version is unreproducible because Phase 3 (GRCh37, low-coverage) and the high-coverage NYGC 3202 release (GRCh38, 30x) are different call sets. HRC and TOPMed are server-only (not downloadable); panel-build tools (Minimac4 --compress-reference, bref3.jar, imp5Converter) and the Will Rayner harmonization check are separate downloads.
"Which reference panel should I use, and how do I prepare it?" -> Match the panel's ancestry to the target population, reconcile build and strand, then convert to the engine's format - because imputation can only copy haplotypes the panel contains, so the panel IS the prior, and a mismatched ancestry or a flipped strand corrupts the result without any error.
bcftools norm -m -any -f ref.fa then the strand/allele harmonization check against the panel sites, then minimac4 --compress-reference / bref3.jar / imp5Converter to build the engine formatScope: panel selection (ancestry-match, build, access), strand/allele harmonization, build/liftover, and format conversion. The pipeline order and strategy fork -> foundations. The phasing engine that consumes the panel -> haplotype-phasing. Imputation -> genotype-imputation. PCA to establish the target ancestry -> population-genetics/population-structure. Classical HLA-allele imputation needs a dedicated HLA panel -> clinical-databases/hla-typing. VCF normalization mechanics -> variant-calling/variant-normalization.
The reflex "TOPMed has 97k samples, HRC has 32k, so use TOPMed" is right for a European cohort and wrong for an ancestry-mismatched one, and the reason is mechanical, not statistical: imputation copies haplotype segments from panel samples that resemble the target, so if no panel sample carries the target population's haplotypes there is nothing to copy, and adding ten thousand more European haplotypes does nothing for an East African sample (Marchini & Howie 2010 Nat Rev Genet 11:499). The binding resource is not panel size but how many panel samples share the target's ancestry. Three facts follow:
Numbers are routinely misquoted; these are the verified figures. State the exact version and build in any method.
| Panel | Samples | Build | Indels | Diversity | Access | |-------|---------|-------|--------|-----------|--------| | 1000G Phase 3 (Auton 2015) | 2,504 | GRCh37 (GRCh38 lift) | yes | 26 pops, broad but shallow | public download | | 1000G high-cov NYGC (Byrska-Bishop 2022) | 3,202 (incl. 602 trios) | GRCh38 | yes | same 26 pops, 30x | public download | | HRC r1.1 (McCarthy 2016) | 32,470 (64,940 haps) | GRCh37 only | NO - SNP-only, MAF floor ~5e-4 | European-heavy | server-only (Michigan) | | TOPMed r2 (Taliun 2021) | 97,256 (~308M sites) | GRCh38 only | yes | very diverse (large AA, Hispanic) | server-only, never downloadable | | HGDP+1kGP / gnomAD (Koenig 2024) | 4,094 (76-80 pops) | GRCh38 | yes | maximally diverse per-sample | public download | | CAAPA (Mathias 2016) | 883 African-ancestry | GRCh37 | (SNP) | African / African-American | server-supported |
The "1000G" trap: Phase 3 (Auton 2015, GRCh37, low-coverage) and the high-coverage NYGC 3202 release (Byrska-Bishop 2022, GRCh38, 30x, with 602 trios) are different call sets; the NYGC release is strictly better for rare variants. State which.
| Scenario | Recommended | Why | |----------|-------------|-----| | Study is GRCh37 | HRC r1.1, 1000G Phase 3, or CAAPA (all GRCh37) | match the build; avoid liftover | | Study is GRCh38 | TOPMed, 1000G NYGC, or HGDP+1kGP (all GRCh38) | match the build; avoid liftover | | Build mismatch unavoidable | lift over ONCE, strand-aware, then re-run the harmonization check | liftover flips strand in inverted regions (see failure modes) | | European cohort, common-variant GWAS | HRC (server) or 1000G | large and European-rich | | African / admixed / Hispanic / multi-ancestry | TOPMed (server) | diversity wins; contains the matching haplotypes | | Need a downloadable, diverse, local panel | HGDP+1kGP (gnomAD) | global, jointly-called, and not server-gated | | Data cannot leave the institution / country | downloadable panels only (1000G, HGDP+1kGP) | governance overrides accuracy (see failure modes) | | Need indels imputed | 1000G, TOPMed, or HGDP+1kGP | HRC is SNP-only | | Classical HLA alleles | -> clinical-databases/hla-typing (dedicated HLA panel) | standard SNP panels cannot impute HLA alleles | | Establish the target ancestry first | -> population-genetics/population-structure | PCA, not a panel operation |
The governing principle: ancestry match beats panel size, and governance (can the data be uploaded to a US server?) often narrows the field before accuracy does.
This is the step that silently corrupts results when skipped. The job: align every study variant's alleles to the panel REF/ALT, fix strand, and drop the variants that cannot be safely resolved. Normalize first (bcftools norm -m -any -f ref.fa), because the same indel represented two ways will not match.
The field-standard gate is Will Rayner's check (HRC-1000G-check-bim.pl and its bgen/VCF variants): it compares a QC'd PLINK .bim plus an allele-frequency file against the panel's sites list and EMITS a Run-plink.sh that updates positions, ref/alt, and strand, removes unresolvable SNPs, and splits by chromosome. The check diagnoses; the script fixes - both are required, and a surprising number of pipelines run the check and never execute the script.
The allele-frequency concordance plot (study AF vs panel AF) is the visual gate, not decoration: a tight diagonal is good; points on the y = 1 - x anti-diagonal are strand flips; a general smear is sample mislabeling or the wrong panel ancestry. Compare against the ancestry-MATCHED sub-panel's frequencies - an African cohort vs a European panel's AF smears even with perfect strand. Read this plot before uploading, every time.
| Engine | Native format | Build command |
|--------|---------------|---------------|
| Minimac4 | .msav (current) or legacy .m3vcf | minimac4 --compress-reference ref.vcf.gz > ref.msav (legacy: Minimac3 --processReference) |
| Beagle 5.x | .bref3 or plain VCF | java -jar bref3.jar ref.vcf.gz > ref.bref3 (needs fully phased, non-missing, |-separated, per chromosome) |
| IMPUTE5 | .imp5 or VCF/BCF | imp5Converter --h ref.vcf.gz --r chr20 --o ref.chr20.imp5 |
A panel is half the input; phasing/imputation also needs a genetic (recombination) map matched to the panel build. A panel VCF comes site-only (the legend, used for the harmonization check) or full (the haplotypes, used to impute) - obtain both. The map is build-specific: an hg19 map with a GRCh38 panel silently mis-places recombination rates.
Trigger: keeping strand-ambiguous SNPs without a frequency-based strand check. Mechanism: A/T and C/G alleles are their own reverse complements, so opposite-strand study and panel still "match" on alleles; the variant passes every join and imputes cleanly while allele-swapped. Symptom: flipped effect direction at that locus and everything imputed in LD with it; no error. Fix: resolve strand by allele frequency; drop palindromic SNPs with MAF > 0.4 (cannot be disambiguated near 0.5); treat "I kept all palindromic SNPs" as proof strand was never checked.
Trigger: running liftOver to reach a panel in the other build and imputing without re-checking. Mechanism: ~2-5 Mb of the genome is inverted between GRCh37 and GRCh38 (BBIS regions); lifting a variant there changes its strand, and the allele-based check cannot see it on a palindrome (Sheng & Chiang 2023 HGG Adv 4:100159). Symptom: silent allele-swaps in inverted regions; the TOPMed server's own internal conversion had this bug. Fix: prefer a panel native to the study build; if forced, lift once with a strand-aware method, then re-run the harmonization check against the new build.
Trigger: 1 vs chr1 between study and panel. Mechanism: GRCh38/TOPMed use chr prefixes, GRCh37 panels do not, so the join matches nothing. Symptom: "0 variants matched" and a wasted day; does not corrupt, just fails. Fix: bcftools annotate --rename-chrs before any check.
Trigger: imputing an indel against HRC, or a variant rarer than the panel floor. Mechanism: HRC is SNP-only; its MAC>=5 cutoff means nothing below MAF ~5e-4 is in the panel; un-present variants cannot be imputed at any quality. Symptom: the variant is absent or near-zero R2, misread as "imputed poorly." Fix: use a panel that contains the variant class (1000G/TOPMed/gnomAD for indels; a WGS panel for rarer variants).
Trigger: planning to use TOPMed/HRC for data that cannot be uploaded. Mechanism: TOPMed is never downloadable and both are server-only; consent/data-residency/IRB rules may forbid uploading participant genotypes to a US server. Symptom: the best panel is legally unusable. Fix: use a downloadable panel (1000G, HGDP+1kGP) and impute locally.
| Threshold | Source | Rationale | |-----------|--------|-----------| | Drop palindromic (A/T, C/G) SNPs with MAF > 0.4 | Rayner check default | strand unresolvable from alleles and frequency too near 0.5 to disambiguate | | Allele-frequency concordance flag > 0.2 (stringent 0.1) | Rayner check default | a large study-vs-panel AF gap signals a strand, build, or ancestry problem | | HRC MAF floor ~5e-4 (MAC>=5 / 32,470) | McCarthy 2016 Nat Genet 48:1279 | nothing rarer is in the panel and cannot be imputed | | Male nonPAR chrX coded haploid | biological ploidy | a het call in male nonPAR is an error and mis-models every male | | Genetic map must match the panel build | Li & Stephens 2003 Genetics 165:2213 | an hg19 map on a GRCh38 panel mis-places recombination silently | | Match AF comparison to the ancestry-matched sub-panel | Marchini & Howie 2010 Nat Rev Genet 11:499 | true frequencies differ by ancestry; a mismatch smears the plot even with perfect strand |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| "0 variants matched" the panel | chr naming (1 vs chr1) | bcftools annotate --rename-chrs |
| Flipped effect direction at some loci | unresolved palindromic strand | run the harmonization check; drop A/T,C/G MAF>0.4; execute Run-plink.sh |
| Indels missing after imputation | HRC is SNP-only | use 1000G/TOPMed/gnomAD |
| AF concordance plot smears off-diagonal | wrong panel ancestry or sample mislabel | compare to the matched sub-panel; check sample labels |
| Cannot download HRC/TOPMed (403) | server-only / access-controlled | use the imputation server, or a downloadable panel |
| Engine errors building bref3 | unphased or missing genotypes in the panel VCF | bref3 needs fully phased, non-missing, |-separated input |
| Imputation degraded in one region after liftover | BBIS inverted region strand flip | use a native-build panel; re-check after any liftover |
| Tempted to impute ancestry subgroups of one cohort separately | re-creates the differential-imputation confound (batch-differential quality) | impute all samples together against one large diverse panel (TOPMed or HGDP+1kGP), not per-stratum -> imputation-qc |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.