clinical-databases/variant-prioritization/SKILL.md
Prioritizes rare-disease variants from trio/quad WES/WGS with de novo (DeNovoGear, Triodenovo), compound-heterozygous phasing (WhatsHap), mosaic VAF tiering, phenotype-driven ranking (Exomiser, Phen2Gene, AMELIE), ClinGen gene-disease validity gating, and ACMG SF v3.2 secondary findings reporting. Use when running diagnostic exome / genome pipelines, identifying candidate Mendelian disease genes, screening for incidental findings, or auditing VUS reclassification cycles. The ACMG/AMP classification framework (PVS1 decision tree, Pejaver PP3/BP4 calibration, Tavtigian point system) is in clinical-databases/acmg-classification.
npx skillsauth add GPTomics/bioSkills bio-clinical-databases-variant-prioritizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: pandas 2.2+, cyvcf2 0.30+, pyhgvs 0.12+, Exomiser 14.0+ (Smedley 2015), Phen2Gene 1.2+ (Zhao 2020), DeNovoGear 1.1.1+ (Ramu 2013), WhatsHap 2.0+ (Patterson 2015), HPO 2024+ (Human Phenotype Ontology). ACMG Secondary Findings list is v3.2 (Miller 2023): 81 genes.
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --versionIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying. Phenotype-driven prioritization REQUIRES high-quality HPO terms; without rich phenotypic input Exomiser/AMELIE degrade significantly.
'Prioritize candidate disease-causing variants from this trio exome' -> Filter to rare + functional + inheritance-consistent variants; rank by phenotype concordance; flag ACMG SF v3.2 incidental findings; report tiers with classification logic deferred to acmg-classification.
exomiser --analysis hiPHIVE-prioritised.ymlwhatshap phase --indels for singletons; trio-based for familieshttps://cspec.genome.network/cspec/ui/svi/allTypical trio exome enters as 40,000-100,000 variants per individual; reaches diagnostic candidate list of 1-10 variants through cascading filters:
| Stage | Filter | Variant count (typical trio) |
|-------|--------|------------------------------|
| Raw joint-called | -- | 100k-150k |
| QC filter (PASS, depth, GQ, missingness) | GATK best practices + Hail QC | 80k-120k |
| Population frequency | gnomAD grpmax_faf95 < 0.0001 (or disease-specific Whiffin max-credible-AF) | 5k-15k |
| Functional consequence | Coding / splice / regulatory | 1k-3k |
| Inheritance pattern | de novo / AR-hom / AR-compoundhet / X-linked / mosaic | 50-500 |
| Phenotype concordance | Exomiser hiPHIVE / Phen2Gene / AMELIE score | 5-50 |
| ACMG classification | Defer to acmg-classification | 1-10 |
| ACMG SF v3.2 cross-check | Miller 2023 (81 genes) | Separate output |
| Pattern | Filter | |---------|--------| | De novo (DNV) | Variant in proband, absent in both parents; needs trio | Apply DeNovoGear / Triodenovo / GATK PossibleDeNovo; visual IGV inspection (~10-30% false-positive rate without) | | Autosomal recessive; homozygous | Hom-alt in proband; het in both parents | gnomAD grpmax_faf95 < 0.005 recessive threshold (Whiffin formula) | | Autosomal recessive; compound het | Two het variants in same gene on opposite alleles | Trio-phased OR read-based phasing via WhatsHap (works within ~500 bp; longer needs parents or long-read) | | X-linked recessive | Male proband hemizygous; carrier mother het | chrX coords; check Klinefelter / mosaic XXY | | X-linked dominant | Het in affected; consider XCI skewing in females | Report XCI status if relevant | | Mitochondrial heteroplasmy | mtDNA variant present at varying heteroplasmy across tissues | Use MITOMAP + HmtVar; ACMG criteria do not apply directly | | Mosaic | Sub-clonal VAF in proband; absent in inherited transmissions | VAF 5-30% suggestive; tissue-dependent (blood vs buccal vs affected tissue) |
Goal: Identify variants present in proband but absent in both parents with high specificity.
Approach: Use specialized DNV callers; supplement with manual IGV inspection.
| Tool | Approach | Use case | |------|----------|----------| | DeNovoGear (Ramu 2013 Nat Methods) | Bayesian, considers parent-of-origin | Standard for trio WES | | Triodenovo (Wei 2015) | Bayesian + family-aware | Alternative | | GATK PossibleDeNovo annotation | Hard filter | Quick prefilter; not standalone | | DeNovoCNN (2024) | Deep learning trio caller | Most accurate as of 2024-2026 |
False-DNV rate: ~10-30% without manual IGV inspection; concentrated in:
| Tool | Approach | Performance (typical benchmark) | Fails when | |------|----------|--------------------------------|-----------| | Exomiser (Smedley 2015 Nat Protoc) | hiPHIVE: phenotype + interactome + sequence damage | 74% top-1; 94% top-5 (Cipriani 2020) | Sparse HPO (< 5 specific terms); novel-disease gene | | Phen2Gene (Zhao 2020 NARGAB) | HPO-to-gene mapping; faster than Exomiser | Similar top-5 | Phenotype-only filtering insufficient | | AMELIE (Birgmeier 2020 Sci Transl Med) | Literature-mining + phenotype | Best when literature is rich | New / rare disease without literature; specific patient HPO unmatched | | Phenolyzer (Yang 2015 Nat Methods) | Phenotype-based gene scoring | Legacy | Modern multi-feature tools (Exomiser, AMELIE) preferred | | GADO (Deelen 2019 Nat Commun) | Gene Network-based; HPO-free option | When HPO is sparse | Phenotype-rich cases where Exomiser hiPHIVE wins | | CADA (Peng 2021) | Cross-species gene prioritization | Animal model integration | Genes without orthologs; rare-disease without animal model |
Critical requirement: all phenotype-driven tools degrade significantly with sparse HPO terms. Capture 5-10 specific HPO terms; avoid generic "intellectual disability" alone.
Strong et al. 2017 AJHG + ClinGen ongoing curation: Limited / Moderate / Strong / Definitive evidence per gene-disease pair.
| Category | When to apply | |----------|---------------| | Definitive | Strong literature evidence + functional / population genetic evidence | Apply full ACMG framework | | Strong | -- | Apply full framework | | Moderate | -- | Apply framework but flag | | Limited | Single case report or weak segregation | Treat candidate cautiously; PP2 / BP1 should not apply | | Disputed | Contradicting evidence | Do not call pathogenic without VCEP curation | | No Known Disease Relationship | Gene not associated with the queried disease | Do not call |
Many commercial panels include genes with only Limited validity. ClinGen-curated https://search.clinicalgenome.org/kb/gene-validity is the authoritative directory.
81 genes for opt-in/opt-out reporting on clinical exome/genome. Growth: 56 -> 59 -> 73 -> 78 -> 81. v3.2 additions: CALM1, CALM2, CALM3 (calmodulinopathy; long QT / CPVT; high actionability via beta-blockade + ICD).
Inclusion criteria: ClinGen Strong or Definitive gene-disease validity + ClinGen ADWG actionability scoring.
ACMG_SF_V3_2_GENES = [
# Cardiomyopathies
'ACTA2', 'ACTC1', 'COL3A1', 'DES', 'FBN1', 'FLNC', 'GLA', 'LMNA', 'MYBPC3',
'MYH11', 'MYH7', 'MYL2', 'MYL3', 'PRKAG2', 'PKP2', 'RBM20', 'SCN5A', 'SMAD3',
'TGFBR1', 'TGFBR2', 'TMEM43', 'TNNI3', 'TNNT2', 'TPM1', 'TTN',
# CALM v3.2 additions (calmodulinopathies)
'CALM1', 'CALM2', 'CALM3',
# Arrhythmias and channelopathies
'CACNA1S', 'KCNH2', 'KCNQ1', 'RYR1', 'RYR2',
# Vascular
'ACVRL1', 'ENG',
# Cancer predisposition
'APC', 'ATM', 'BAP1', 'BMPR1A', 'BRCA1', 'BRCA2', 'BRIP1', 'CDH1', 'CDKN2A',
'CHEK2', 'GREM1', 'HOXB13', 'MAX', 'MEN1', 'MLH1', 'MSH2', 'MSH6', 'MUTYH',
'NF2', 'PALB2', 'PMS2', 'PTEN', 'RAD51C', 'RAD51D', 'RB1', 'RET', 'SDHAF2',
'SDHB', 'SDHC', 'SDHD', 'SMAD4', 'STK11', 'TMEM127', 'TP53', 'TSC1', 'TSC2',
'VHL', 'WT1',
# Other
'FH', 'GAA', 'HFE', 'HNF1A', 'LDLR', 'NTRK1', 'OTC', 'PCSK9', 'TTR'
]
# Note: above list is illustrative; pin to Miller 2023 supplement for exact set.
| Scenario | Recommended path | Why | |----------|------------------|-----| | Trio WES, suspected Mendelian | Full pipeline with DeNovoGear + Exomiser + HPO | Standard rare-disease workflow | | Singleton WES | WhatsHap read-based phasing + AR-hom + AR-compoundhet candidates | Compound het hard without trio | | Suspected mosaic | Lower VAF threshold (2-30%); deep coverage (>200x) | Standard tools miss mosaic | | Long-read genome | Add SV calling + STR repeat expansion | SVs miss in short-read | | Newborn screening (BabyScreen+) | 605-gene Mendelian panel with current ACMG SF v3.2 | Lynch / Brett 2025 Nat Med | | Cancer predisposition | ClinGen Hereditary Cancer VCEPs + ACMG SF cancer subset | Use VCEP CSpec | | Cardiomyopathy / arrhythmia | ClinGen HCM / DCM / LQT VCEPs | Strict gene-disease validity | | Population screening | ACMG SF v3.2 (81 genes) opt-in/opt-out | Miller 2023 |
Goal: From a trio joint-called VCF, output ranked candidate variants with inheritance pattern, phenotype concordance, and ACMG SF flags.
Approach: Cascading filters with QC, population frequency, functional consequence, inheritance, phenotype.
from cyvcf2 import VCF
import pandas as pd
from pathlib import Path
# Quality + population frequency filter (apply first)
def filter_qc_and_frequency(vcf_path, max_grpmax_faf95=0.0001, min_dp=10, min_gq=20):
'''Stage 1: QC + frequency filter. Reduces 100k -> ~5-15k variants.'''
vcf = VCF(vcf_path)
samples = vcf.samples # e.g., [proband, mother, father]
rows = []
for v in vcf:
if v.FILTER is not None:
continue
if min(v.gt_depths) < min_dp:
continue
if v.QUAL is not None and v.QUAL < min_gq:
continue
gnomad = (v.INFO.get('grpmax_faf95') or v.INFO.get('AF_grpmax') or
v.INFO.get('AF_popmax') or 0)
if gnomad > max_grpmax_faf95:
continue
rows.append({
'chrom': v.CHROM, 'pos': v.POS, 'ref': v.REF, 'alt': v.ALT[0],
'genotypes': dict(zip(samples, v.gt_types.tolist())),
'depth': dict(zip(samples, v.gt_depths.tolist())),
'gnomad_faf95': gnomad,
'consequence': v.INFO.get('CSQ', '').split('|')[1] if v.INFO.get('CSQ') else None
})
return pd.DataFrame(rows)
def call_de_novo(df, proband, mother, father):
'''Stage 2: identify DNV candidates: hom-ref both parents, het/hom-alt proband.
Implements Mendelian-violation logic; supplement with DeNovoGear or DeNovoCNN
for production (this implementation has 10-30% false-positive rate without IGV).
'''
is_dnv = []
for _, row in df.iterrows():
gts = row['genotypes']
if gts[mother] == 0 and gts[father] == 0 and gts[proband] in (1, 3):
# Mother hom-ref AND father hom-ref AND proband het OR hom-alt
# Confidence boost: depth at parent sites should be >= 10 to trust hom-ref
if row['depth'][mother] >= 10 and row['depth'][father] >= 10:
is_dnv.append(True)
continue
is_dnv.append(False)
df['is_de_novo_candidate'] = is_dnv
return df
def call_compound_het(df, proband, mother, father, gene_col='gene'):
'''Stage 3: identify compound het: two het variants in same gene, one from each parent.
Trio phasing is gold standard; singletons require WhatsHap read-based phasing.
'''
het_in_proband = df[df['genotypes'].apply(lambda gts: gts[proband] == 1)]
candidate_genes = []
for gene in het_in_proband[gene_col].unique():
if pd.isna(gene):
continue
gene_variants = het_in_proband[het_in_proband[gene_col] == gene]
# Need >= 2 variants; one inherited from each parent
maternal_het = gene_variants[gene_variants['genotypes'].apply(
lambda gts: gts[mother] == 1 and gts[father] == 0)]
paternal_het = gene_variants[gene_variants['genotypes'].apply(
lambda gts: gts[father] == 1 and gts[mother] == 0)]
if len(maternal_het) >= 1 and len(paternal_het) >= 1:
candidate_genes.append(gene)
df['is_compound_het_candidate'] = df[gene_col].isin(candidate_genes)
return df
def flag_acmg_sf(df, acmg_sf_genes, gene_col='gene', clnsig_col='clinvar_sig'):
'''Stage: flag ACMG Secondary Findings (Miller 2023 v3.2; 81 genes).
Only P/LP variants in SF genes are reportable as secondary findings.
'''
df['is_acmg_sf_candidate'] = (
df[gene_col].isin(acmg_sf_genes) &
df[clnsig_col].astype(str).str.contains('athogenic', na=False)
)
return df
def filter_by_clingen_validity(df, validity_table, gene_col='gene',
min_validity='Moderate'):
'''Gate on ClinGen gene-disease validity. Limited or Disputed -> low confidence.
validity_table: DataFrame from `https://search.clinicalgenome.org/kb/gene-validity`
'''
rank = {'No Known Disease Relationship': 0, 'Disputed': 0, 'Limited': 1,
'Moderate': 2, 'Strong': 3, 'Definitive': 4}
min_rank = rank[min_validity]
df_merged = df.merge(validity_table, on=gene_col, how='left')
df_merged['validity_rank'] = df_merged['gene_validity'].map(rank).fillna(0)
df_merged['pass_validity'] = df_merged['validity_rank'] >= min_rank
return df_merged
def phenotype_score_with_exomiser_yml(yml_path, vcf_path, hpo_terms, output_dir):
'''Emit Exomiser command for phenotype-driven ranking.
HPO terms (e.g., HP:0001250 for seizures) must be SPECIFIC.
Sparse generic HPO degrades Exomiser hiPHIVE accuracy significantly.
'''
return (f'java -jar exomiser-cli-14.0.0.jar --analysis {yml_path} '
f'--vcf {vcf_path} --hpo {",".join(hpo_terms)} '
f'--output-dir {output_dir}')
1. De novo with false-positive rate 10-30%
2. Compound het without phasing
3. Limited-validity gene reported as diagnostic
4. Sparse HPO terms degrading Exomiser
5. ACMG SF v3.1 used instead of v3.2
6. Mosaic variants below standard VAF threshold
7. ClinVar P variant in Limited-validity gene
8. VUS reclassification gaps
9. Inheritance pattern assumed wrong
| Pattern | Likely cause | Action | |---------|-------------|--------| | Exomiser ranks low; ClinVar says P | Sparse or wrong HPO terms; rare disease in atypical gene | Re-run with full HPO; manual review | | ClinVar P + ClinGen Limited validity | Variant-level vs gene-disease tension | Treat as candidate; require VCEP curation or functional evidence | | DeNovoGear high posterior; trio coverage uneven | Parental mosaicism or mapping error | IGV review; consider parent-of-origin testing | | Compound het in phasing-ambiguous gene | Distance > 500 bp; can't phase from reads | Trio phasing; long-read confirmation | | SF gene with V3.1 list; missing CALM | Miller 2023 v3.2 update | Re-run with v3.2 (81 genes) | | Phenotype tool disagrees with clinical | Tool-specific phenotype model; literature gap | Cross-check with AMELIE for literature-mining alternative | | Mosaic suspected but standard pipeline negative | VAF below 30% threshold | Deep targeted sequencing or affected tissue |
| Threshold | Convention | Source | |-----------|-----------|--------| | Rare-disease frequency filter | grpmax_faf95 < 0.0001 | ClinGen SVI | | Recessive disease filter | grpmax_faf95 < 0.005 | ClinGen SVI | | Whiffin gene-specific max-credible-AF | Computed per gene + disease | Whiffin 2017 | | DNV minimum parental coverage | >= 10x both parents | Standard | | DNV manual IGV review | Required for all reportable DNVs | Standard | | Compound het phasing | <= 500 bp read-based; trio gold standard | WhatsHap | | Exomiser top-1 diagnostic rank | 74%; top-5 94% (with rich HPO) | Cipriani 2020 | | ACMG SF v3.2 genes | 81 (Miller 2023) | Miller 2023 Genet Med | | VUS reclassification cycle | Median 5 years for active genes; up to 10 for orphan | Harrison 2017 follow-up | | Mosaic VAF threshold | 2-30% | Convention | | ClinGen gene-disease validity gate | Moderate or Strong minimum for diagnostic reporting | ClinGen SVI |
| Symptom | Cause | Solution |
|---------|-------|----------|
| Too many candidate variants (>50) | Frequency filter too loose | Tighten to grpmax_faf95 < 0.0001 (dominant) or 0.005 (recessive) |
| No DNV candidates in obvious DNV phenotype | False-negative DNV calling | DeNovoGear / DeNovoCNN; check parental sample swap |
| Compound het in gene known AD only | Phasing not validated | Confirm phase via trio or long-read |
| Exomiser top hit unrelated to phenotype | HPO too generic or wrong | Add specific HPO; check ontology version |
| Mosaic disease missed | VAF threshold too high | Deep coverage; affected tissue sampling; VAF 2-5% |
| SF gene match flagged but variant benign | Wrong variant classification | Apply ACMG framework via acmg-classification skill |
| Genotype-phenotype discordance | Locus heterogeneity OR multi-gene contribution | Run digenic / oligogenic analysis tools |
| Pushback | Standard response |
|----------|-------------------|
| "Why grpmax_faf95 instead of AF?" | grpmax_faf95 is the Whiffin 2017 ClinGen-recommended frequency; excludes bottleneck groups; per ACMG SVI specifications. |
| "Compound het without phase confirmation" | Trio phased; if singleton, WhatsHap read-based for variants within 500 bp; long-read otherwise. |
| "DNV call without IGV review?" | All reportable DNVs underwent IGV inspection; we report posterior probability + parental coverage. |
| "ClinGen Limited validity gene" | Excluded per gate; we require Moderate or higher for reportable diagnostic candidates. |
| "Why ACMG SF v3.2 not v3.1?" | v3.2 (Miller 2023) added CALM1/2/3 calmodulinopathies (high actionability). We use current. |
| "Phenotype-driven prioritization with single HPO term?" | We submit 5-10 specific HPO terms; sparse input degrades Exomiser. |
| "ACMG classification logic?" | Variant prioritization (this skill) outputs candidates; ACMG classification (PVS1 / PP3 / BS1 / etc.) is in acmg-classification skill. |
| "Why not VarSome / Franklin automated ACMG?" | We report aggregated annotations via myvariant.info; ACMG classification per acmg-classification skill using Tavtigian point system + Pejaver 2022 calibration. |
https://search.clinicalgenome.org/kb/gene-validityhttps://hpo.jax.org/https://www.gimjournal.org/article/S1098-3600(23)00879-1/fulltexttools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.