plugin/skills/tooluniverse-population-genetics/SKILL.md
Population genetics analysis — allele frequencies (gnomAD, 1000 Genomes), Hardy-Weinberg equilibrium testing, Fst between populations, GWAS associations, evolutionary constraint scores. Use for cross-population variant comparison, ancestry-aware allele frequency lookups, and population-level evolutionary analysis.
npx skillsauth add mims-harvard/tooluniverse tooluniverse-population-geneticsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
MC Strategy: Population genetics MC questions often test whether you know a specific theorem or result. COMPUTE the answer first (use popgen_calculator.py or write Python), then match to options. Don't try to reason about which option "sounds right."
Analyze population-level genetic variation, allele frequencies, GWAS associations, clinical significance, and evolutionary constraints using ToolUniverse tools.
Activate this skill when the user asks about:
Query gnomAD/1000Genomes/GWAS Catalog FIRST for allele frequencies and associations. Preferred: use the PopGen_hwe_test, PopGen_fst, PopGen_inbreeding, and PopGen_haplotype_count tools for HWE, Fst, inbreeding, and haplotype calculations. Fallback: run popgen_calculator.py directly. For theoretical problems (delta-q, drift, LD decay), apply the formulas in the Theoretical Reasoning section below.
| Tool | Key Parameters | Notes |
|------|---------------|-------|
| gnomad_search_variants | query (REQUIRED) | Resolve rsID to variant_id format "CHR-POS-REF-ALT" |
| gnomad_get_variant | variant_id (REQUIRED), dataset | Population frequencies. Default dataset: gnomad_r3; use gnomad_r4 for latest |
| gnomad_get_gene_constraints | gene_symbol (REQUIRED) | pLI, o/e ratios. May timeout -- retry once |
| MyVariant_query_variants | query (REQUIRED) | Aggregated: ClinVar + dbSNP + gnomAD + CADD. Uses hg19 coordinates |
| EnsemblVEP_annotate_rsid | variant_id (REQUIRED) | Functional consequence, SIFT, PolyPhen. Param is "variant_id" NOT "rsid" |
| EnsemblVEP_variant_recoder | variant_id (REQUIRED) | Convert between rsID/HGVS/VCF/SPDI |
| gwas_get_snps_for_gene | gene_symbol (REQUIRED) | All GWAS SNPs for a gene |
| gwas_search_associations | query (REQUIRED) | GWAS for a disease/trait (NOT gene name -- use gwas_get_snps_for_gene for genes) |
| gwas_get_variants_for_trait | trait (REQUIRED) | Variants associated with a trait |
| ClinVar_search_variants | gene, condition, significance | At least one filter required |
| RegulomeDB_query_variant | rsid (REQUIRED) | Regulatory scoring (1a=strongest to 7=minimal) |
"CHR-POS-REF-ALT" (no "chr" prefix). Always resolve rsIDs via gnomad_search_variants first.gwas_get_snps_for_gene for gene-based lookups.gwas_get_snps_for_gene instead.{data, metadata}, or {error}). Handle all three.Variant frequency: gnomad_search_variants -> gnomad_get_variant(dataset="gnomad_r4") -> MyVariant_query_variants (1000G pop breakdowns) -> EnsemblVEP_annotate_rsid
GWAS for disease: gwas_search_associations -> gwas_get_variants_for_trait -> gnomad_get_variant for top hits -> EuropePMC_search_articles
Gene characterization: gnomad_get_gene_constraints -> gwas_get_snps_for_gene -> ClinVar_search_variants -> PubMed_search_articles
Pathogenicity assessment: EnsemblVEP_annotate_rsid -> MyVariant_query_variants (CADD, ClinVar) -> gnomad_get_variant (frequency) -> RegulomeDB_query_variant (if non-coding)
These formulas are needed for quantitative population genetics problems. Work through step by step, showing intermediate values.
For a recessive deleterious allele (fitness: AA=1, Aa=1, aa=1-s):
delta_q = -s * q^2 * p / (1 - s * q^2)
where p = freq(A), q = freq(a), s = selection coefficient.
For dominant deleterious (AA=1, Aa=1-s, aa=1-s):
delta_q = -s * q * p / (1 - s * q * (2 - q))
For heterozygote advantage (AA=1-s1, Aa=1, aa=1-s2):
equilibrium: q_hat = s1 / (s1 + s2)
Example: plug in s1 and s2 from the question; q_hat = s1/(s1+s2).
Selection against recessives is slow at low q because most a alleles hide in heterozygotes. Time to reduce q from q0 to qt: t ~ (1/qt - 1/q0) / s generations.
Variance in allele frequency per generation: Var(delta_p) = pq / (2Ne)
Probability of fixation of a new neutral mutation: 1/(2*Ne)
Time to fixation (given it fixes): ~4*Ne generations for neutral alleles
Heterozygosity decay: H_t = H_0 * (1 - 1/(2*Ne))^t
After t generations, fraction of heterozygosity lost ~ 1 - e^(-t/(2*Ne))
Effective population size (Ne) adjustments:
Drift vs selection: Drift dominates when |s| < 1/(2*Ne). A variant with s=0.01 behaves neutrally in a population of Ne < 50.
D = freq(AB) - freq(A)*freq(B), where A and B are alleles at two loci.
Decay with recombination: D_t = D_0 * (1 - r)^t, where r = recombination fraction, t = generations.
Half-life of LD: t_half = -ln(2) / ln(1-r) ~ 0.693/r generations (for small r).
r-squared (normalized LD): r^2 = D^2 / (pA * pa * pB * pb). Range 0-1.
Expected r^2 in finite population at equilibrium: E[r^2] = 1 / (1 + 4Ner) (for drift-recombination balance).
Practical implications:
For alleles A (freq p) and a (freq q=1-p): expected genotypes AA=p^2, Aa=2pq, aa=q^2.
Chi-square test: df=1 (2 alleles). Preferred: use PopGen_hwe_test tool. Fallback: popgen_calculator.py --type hwe --AA N1 --Aa N2 --aa N3.
Causes of HWE departure: non-random mating, selection, migration, drift, genotyping error. Excess homozygotes -> inbreeding or population structure (Wahlund effect). Excess heterozygotes -> overdominant selection or negative assortative mating.
For n SNPs between two inbred (homozygous) strains:
python3 -c "...") for these counts. Never enumerate by hand.Equilibrium frequency of a deleterious allele:
PopGen_fst tool. Fallback: popgen_calculator.py --type fst --p1 X --p2 Y --n1 N1 --n2 N2For any genetics cross problem, follow these steps IN ORDER. Do not skip steps.
For bacterial conjugation and Hfr mapping problems:
These are specific patterns that have caused reasoning failures in hard genetics questions. Review before answering genetics MCQs.
For "necessarily true" questions about PGS and heritability: a statement is necessarily true only if it holds when V_D=0 AND when V_D=V_G. Test the extremes.
Do NOT guess path signs from general knowledge. Signs may differ from well-known systems. Follow this protocol:
Compute chi-square from the expected ratio given in the question. Compare to chi-square-critical at df = (number of phenotype classes - 1). Pick the answer choice with the highest chi-square, but also check which pattern is biologically diagnostic of the alternative hypothesis.
LD block boundaries at recombination hotspots are a source of GWAS false localization — strong signal in the block does not guarantee the causal variant is in the block.
Duplex sequencing (unique molecular identifiers + double-strand consensus) detects alleles at 0.01% frequency — far below standard NGS even at 80X depth. Simply increasing read depth does NOT help for ultra-rare variants because the Illumina error rate (~0.1%) masks variants rarer than ~1% regardless of depth. Error correction methods (UMIs, duplex consensus) are needed to distinguish true rare variants from sequencing errors.
Script: skills/tooluniverse-population-genetics/scripts/popgen_calculator.py
Preferred: Use ToolUniverse tools (via MCP/SDK) instead of the script when possible:
PopGen_hwe_test tool -- HWE chi-square test. Fallback: popgen_calculator.py --type hwePopGen_fst tool -- Weir-Cockerham Fst. Fallback: popgen_calculator.py --type fstPopGen_inbreeding tool -- Inbreeding coefficient from pedigree. Fallback: popgen_calculator.py --type inbreedingPopGen_haplotype_count tool -- Expected haplotype diversity. Fallback: popgen_calculator.py --type haplotypesFallback script modes (all require --type):
hwe: --AA N --Aa N --aa N -- chi-square HWE test with p-valuefst: --p1 F --p2 F --n1 N --n2 N -- Weir-Cockerham Fstinbreeding: --pedigree TYPE --generations G -- F from pedigree (self, full-sib, half-sib, first-cousin, etc.)haplotypes: --snps N --generations G --recomb_rate R -- expected haplotype diversitytools
Post-market safety surveillance and recall/adverse-event RETRIEVAL across the full spectrum of FDA-regulated products that are NOT covered by the drug-AE signal skills: medical devices, food / dietary supplements / cosmetics, veterinary drugs, and drug supply (shortages). Orchestrates openFDA endpoints (MAUDE device adverse events + device recalls + 510(k), CAERS food/supplement/ cosmetic adverse events, veterinary adverse events, drug shortages, and cross-product enforcement/recall reports). USE WHEN the user asks: "are there adverse events for [device / pacemaker / infusion pump / insulin pump]", "device recalls for [firm/product]", "supplement / vitamin / cosmetic adverse reactions", "is [drug] in shortage", "what injectables are on shortage", "veterinary / animal adverse events for [drug] in [dog/cat/horse]", "food recall for listeria", "MAUDE report for [device]", "CAERS reactions for [brand]". DO NOT USE for drug adverse-event SIGNAL detection or disproportionality (PRR / ROR / IC) or drug-AE association scoring — that is `tooluniverse-pharmacovigilance` / `tooluniverse-adverse-event-detection`. This skill is multi-product surveillance and retrieval, not drug-AE statistical signal mining.
tools
--- name: tooluniverse-phewas description: Cross-ancestry / cross-biobank phenome-wide association (PheWAS) and replication. Given ONE variant (rsID) or ONE gene, look up every phenotype it associates with across European/UK (UKB-TOPMed), Finnish (FinnGen), Japanese (BioBank Japan), and Taiwanese (TPMI) biobanks, plus exome-wide gene-burden PheWAS (Genebass), then judge whether an association replicates across ancestries or is population-specific. Use whenever the user asks "what else is this va
tools
Dereplicate a putative natural product and assign its chemical taxonomy. Use to answer "is [compound] a known natural product", "what microbe/organism produces [compound]", "what chemical class is [compound]", "dereplicate this metabolite (by formula/exact mass/InChIKey/SMILES)", or "classify this molecule into ChemOnt". Searches NPAtlas for known microbial natural products (producing organism + literature reference), assigns the ChemOnt kingdom→superclass→class→subclass hierarchy via ClassyFire, resolves systematic IUPAC names to structure via OPSIN, and cross-references identity in PubChem. NOT for general drug/compound identity or ADMET (use tooluniverse-chemical-compound-retrieval / tooluniverse-small-molecule-discovery) and NOT for metabolomics pathway/enrichment analysis (use tooluniverse-metabolomics skills).
tools
Genome-ASSEMBLY discovery, QC, and replicon mapping for any organism (bacteria, archaea, fungi, and beyond) using NCBI Datasets. Resolves an organism name or taxid to assemblies, picks the reference/representative or best-quality assembly, pulls assembly QC metrics (total length, contig/scaffold N50, contig count, GC%, assembly level, RefSeq category), enumerates chromosomes and plasmids via per-replicon sequence reports, and compares candidate assemblies on quality. Use for "what genomes are available for [organism]", "assembly stats / N50 / GC content for [GCF_/GCA_ accession]", "how many plasmids does [strain] have", "compare assemblies for [species]", "find the reference genome for [taxon]", "is this assembly Complete Genome or just contigs". NOT for gene-level orthology/synteny (use tooluniverse-comparative-genomics), plant gene structure (use tooluniverse-plant-genomics), de novo assembly from raw reads (no tool exists), or taxonomy-only name/lineage lookups.