skills/tooluniverse-population-genetics/SKILL.md
Population genetics analysis — allele frequencies (gnomAD, 1000 Genomes), Hardy-Weinberg equilibrium testing, Fst between populations, GWAS associations, evolutionary constraint scores. Use for cross-population variant comparison, ancestry-aware allele frequency lookups, and population-level evolutionary analysis.
npx skillsauth add mims-harvard/tooluniverse tooluniverse-population-geneticsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
MC Strategy: Population genetics MC questions often test whether you know a specific theorem or result. COMPUTE the answer first (use popgen_calculator.py or write Python), then match to options. Don't try to reason about which option "sounds right."
Analyze population-level genetic variation, allele frequencies, GWAS associations, clinical significance, and evolutionary constraints using ToolUniverse tools.
Activate this skill when the user asks about:
Query gnomAD/1000Genomes/GWAS Catalog FIRST for allele frequencies and associations. Preferred: use the PopGen_hwe_test, PopGen_fst, PopGen_inbreeding, and PopGen_haplotype_count tools for HWE, Fst, inbreeding, and haplotype calculations. Fallback: run popgen_calculator.py directly. For theoretical problems (delta-q, drift, LD decay), apply the formulas in the Theoretical Reasoning section below.
| Tool | Key Parameters | Notes |
|------|---------------|-------|
| gnomad_search_variants | query (REQUIRED) | Resolve rsID to variant_id format "CHR-POS-REF-ALT" |
| gnomad_get_variant | variant_id (REQUIRED), dataset | Population frequencies. Default dataset: gnomad_r3; use gnomad_r4 for latest |
| gnomad_get_gene_constraints | gene_symbol (REQUIRED) | pLI, o/e ratios. May timeout -- retry once |
| MyVariant_query_variants | query (REQUIRED) | Aggregated: ClinVar + dbSNP + gnomAD + CADD. Uses hg19 coordinates |
| EnsemblVEP_annotate_rsid | variant_id (REQUIRED) | Functional consequence, SIFT, PolyPhen. Param is "variant_id" NOT "rsid" |
| EnsemblVEP_variant_recoder | variant_id (REQUIRED) | Convert between rsID/HGVS/VCF/SPDI |
| gwas_get_snps_for_gene | gene_symbol (REQUIRED) | All GWAS SNPs for a gene |
| gwas_search_associations | query (REQUIRED) | GWAS for a disease/trait (NOT gene name -- use gwas_get_snps_for_gene for genes) |
| gwas_get_variants_for_trait | trait (REQUIRED) | Variants associated with a trait |
| ClinVar_search_variants | gene, condition, significance | At least one filter required |
| RegulomeDB_query_variant | rsid (REQUIRED) | Regulatory scoring (1a=strongest to 7=minimal) |
"CHR-POS-REF-ALT" (no "chr" prefix). Always resolve rsIDs via gnomad_search_variants first.gwas_get_snps_for_gene for gene-based lookups.gwas_get_snps_for_gene instead.{data, metadata}, or {error}). Handle all three.Variant frequency: gnomad_search_variants -> gnomad_get_variant(dataset="gnomad_r4") -> MyVariant_query_variants (1000G pop breakdowns) -> EnsemblVEP_annotate_rsid
GWAS for disease: gwas_search_associations -> gwas_get_variants_for_trait -> gnomad_get_variant for top hits -> EuropePMC_search_articles
Gene characterization: gnomad_get_gene_constraints -> gwas_get_snps_for_gene -> ClinVar_search_variants -> PubMed_search_articles
Pathogenicity assessment: EnsemblVEP_annotate_rsid -> MyVariant_query_variants (CADD, ClinVar) -> gnomad_get_variant (frequency) -> RegulomeDB_query_variant (if non-coding)
These formulas are needed for quantitative population genetics problems. Work through step by step, showing intermediate values.
For a recessive deleterious allele (fitness: AA=1, Aa=1, aa=1-s):
delta_q = -s * q^2 * p / (1 - s * q^2)
where p = freq(A), q = freq(a), s = selection coefficient.
For dominant deleterious (AA=1, Aa=1-s, aa=1-s):
delta_q = -s * q * p / (1 - s * q * (2 - q))
For heterozygote advantage (AA=1-s1, Aa=1, aa=1-s2):
equilibrium: q_hat = s1 / (s1 + s2)
Example: plug in s1 and s2 from the question; q_hat = s1/(s1+s2).
Selection against recessives is slow at low q because most a alleles hide in heterozygotes. Time to reduce q from q0 to qt: t ~ (1/qt - 1/q0) / s generations.
Variance in allele frequency per generation: Var(delta_p) = pq / (2Ne)
Probability of fixation of a new neutral mutation: 1/(2*Ne)
Time to fixation (given it fixes): ~4*Ne generations for neutral alleles
Heterozygosity decay: H_t = H_0 * (1 - 1/(2*Ne))^t
After t generations, fraction of heterozygosity lost ~ 1 - e^(-t/(2*Ne))
Effective population size (Ne) adjustments:
Drift vs selection: Drift dominates when |s| < 1/(2*Ne). A variant with s=0.01 behaves neutrally in a population of Ne < 50.
D = freq(AB) - freq(A)*freq(B), where A and B are alleles at two loci.
Decay with recombination: D_t = D_0 * (1 - r)^t, where r = recombination fraction, t = generations.
Half-life of LD: t_half = -ln(2) / ln(1-r) ~ 0.693/r generations (for small r).
r-squared (normalized LD): r^2 = D^2 / (pA * pa * pB * pb). Range 0-1.
Expected r^2 in finite population at equilibrium: E[r^2] = 1 / (1 + 4Ner) (for drift-recombination balance).
Practical implications:
For alleles A (freq p) and a (freq q=1-p): expected genotypes AA=p^2, Aa=2pq, aa=q^2.
Chi-square test: df=1 (2 alleles). Preferred: use PopGen_hwe_test tool. Fallback: popgen_calculator.py --type hwe --AA N1 --Aa N2 --aa N3.
Causes of HWE departure: non-random mating, selection, migration, drift, genotyping error. Excess homozygotes -> inbreeding or population structure (Wahlund effect). Excess heterozygotes -> overdominant selection or negative assortative mating.
For n SNPs between two inbred (homozygous) strains:
python3 -c "...") for these counts. Never enumerate by hand.Equilibrium frequency of a deleterious allele:
PopGen_fst tool. Fallback: popgen_calculator.py --type fst --p1 X --p2 Y --n1 N1 --n2 N2For any genetics cross problem, follow these steps IN ORDER. Do not skip steps.
For bacterial conjugation and Hfr mapping problems:
These are specific patterns that have caused reasoning failures in hard genetics questions. Review before answering genetics MCQs.
For "necessarily true" questions about PGS and heritability: a statement is necessarily true only if it holds when V_D=0 AND when V_D=V_G. Test the extremes.
Do NOT guess path signs from general knowledge. Signs may differ from well-known systems. Follow this protocol:
Compute chi-square from the expected ratio given in the question. Compare to chi-square-critical at df = (number of phenotype classes - 1). Pick the answer choice with the highest chi-square, but also check which pattern is biologically diagnostic of the alternative hypothesis.
LD block boundaries at recombination hotspots are a source of GWAS false localization — strong signal in the block does not guarantee the causal variant is in the block.
Duplex sequencing (unique molecular identifiers + double-strand consensus) detects alleles at 0.01% frequency — far below standard NGS even at 80X depth. Simply increasing read depth does NOT help for ultra-rare variants because the Illumina error rate (~0.1%) masks variants rarer than ~1% regardless of depth. Error correction methods (UMIs, duplex consensus) are needed to distinguish true rare variants from sequencing errors.
Script: skills/tooluniverse-population-genetics/scripts/popgen_calculator.py
Preferred: Use ToolUniverse tools (via MCP/SDK) instead of the script when possible:
PopGen_hwe_test tool -- HWE chi-square test. Fallback: popgen_calculator.py --type hwePopGen_fst tool -- Weir-Cockerham Fst. Fallback: popgen_calculator.py --type fstPopGen_inbreeding tool -- Inbreeding coefficient from pedigree. Fallback: popgen_calculator.py --type inbreedingPopGen_haplotype_count tool -- Expected haplotype diversity. Fallback: popgen_calculator.py --type haplotypesFallback script modes (all require --type):
hwe: --AA N --Aa N --aa N -- chi-square HWE test with p-valuefst: --p1 F --p2 F --n1 N --n2 N -- Weir-Cockerham Fstinbreeding: --pedigree TYPE --generations G -- F from pedigree (self, full-sib, half-sib, first-cousin, etc.)haplotypes: --snps N --generations G --recomb_rate R -- expected haplotype diversitytools
PCR / qPCR primer and oligo design — design forward/reverse primers for a target region (SantaLucia nearest-neighbor thermodynamics), compute melting temperature (Tm) and annealing temperature (Ta), check GC content, and screen an oligo for hairpins and primer-dimers. Use when you need primers for a sequence, want to QC an existing primer pair, or need the Tm of an oligo. Covers the primer-design rules (Tm matching, GC clamp, 3'-end, length) and the tools' constraint quirks.
tools
Pharmacokinetic (PK) analysis of concentration-time data — non-compartmental analysis (NCA) for Cmax, Tmax, AUC (0-t and 0-∞), terminal half-life, clearance (CL), volume of distribution (Vd), MRT, and absolute bioavailability (F). Also one-compartment fitting. Use when you have plasma/serum drug concentrations over time after a dose and need PK parameters, or to compute bioavailability from IV + oral AUCs. NOT for ADMET property prediction from structure (use tooluniverse-admet-prediction).
tools
Molecular cloning assembly design — Gibson Assembly (overlap design for seamless multi-fragment joining) and Golden Gate Assembly (Type IIS / BsaI / BbsI design with unique 4-bp fusion overhangs). Use when you need to plan how to join DNA fragments into a construct, design assembly overlaps/overhangs, or decide between cloning methods. Covers the domestication (internal-site removal), overhang-uniqueness, and overlap-Tm rules. For PCR primers to generate the fragments, see tooluniverse-primer-design.
tools
Meta-analysis / evidence synthesis — pool effect sizes across studies (odds ratios, risk ratios, hazard ratios, mean differences, correlations, GWAS betas) with fixed- or random-effects models, quantify heterogeneity (Q, I², τ²), and build a forest plot. Use when you have results from MULTIPLE studies and need a single pooled estimate, or to synthesize evidence from a systematic review / multiple GWAS / replicated experiments. Handles the error-prone effect-size + standard-error preparation (converting OR/HR/CI, two-group means±SD, proportions, and correlations into the (effect, SE) the pooling step needs).