clinical-databases/clinvar-lookup/SKILL.md
Queries ClinVar for variant pathogenicity classifications, ClinGen VCEP curations, and somatic-vs-germline interpretations via REST API, weekly VCF, or bulk XML. Use when determining clinical significance, triangulating conflicting interpretations, or aggregating evidence against the ACMG/AMP framework with ClinGen SVI specifications.
npx skillsauth add GPTomics/bioSkills bio-clinical-databases-clinvar-lookupInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: requests 2.31+, cyvcf2 0.30+, pandas 2.2+, bcftools 1.19+, Entrez Direct 21.0+, lxml 5.0+ (for v2 XML schema).
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying. ClinVar XML schema v2 (rolled out in 2024) replaces <ClinVarSet> with <VariationArchive> as the top-level anchor; XSLT or parsers targeting the legacy element silently emit zero records.
'Look up the clinical significance of this variant' -> Retrieve ClinVar VCV-level aggregate, SCV-level submissions, ClinGen Variant Curation Expert Panel (VCEP) overrides, and conflict-resolution status.
requests.get() against the E-utilities clinvar databasecyvcf2.VCF('clinvar.vcf.gz') for batch queries against the weekly snapshotbcftools annotate -a clinvar.vcf.gz -c INFO/CLNSIG,INFO/CLNREVSTAT,INFO/CLNDNhttps://reg.clinicalgenome.org/| Level | Format | What it aggregates | When to use | Fails when |
|-------|--------|--------------------|-------------|-----------|
| SCV | SCVxxxxxxxxx.N | One submitter, one variant, one condition (atomic submission unit) | Auditing who said what; conflict triangulation | Aggregated reporting (use VCV); cross-condition analysis |
| RCV | RCVxxxxxxxxx.N | All SCVs for a single (variant, condition) pair | Condition-stratified analysis; legacy aggregation | Variant-level reporting across all conditions (use VCV) |
| VCV | VCVxxxxxxxxx.N | All RCVs for one variant across all conditions | Canonical anchor since 2017; default API entrypoint | Condition-specific clinical action (use RCV); CLNSIG collapses multi-condition |
Operational footgun: the clinvar.vcf.gz CLNSIG field is the variant-level (VCV) aggregate. A variant Pathogenic for disease A but VUS for disease B collapses to "Pathogenic/Conflicting". For condition-stratified analysis, parse RCV-level XML, never CLNSIG alone.
2024 XML schema overhaul: ClinVar v2 XML separates GermlineClassification, SomaticClinicalImpact, and OncogenicityClassification under one <VariationArchive> anchor. The legacy <ClinicalSignificance> element is gone. Pipelines built before September 2024 against <ClinVarSet> silently emit zero records on new XML. The dual-release period ended December 2024.
| Stars | Review status | What it means operationally | |-------|--------------|---------------------------| | 4 | Practice guideline | ACMG/CAP CFTR-level (vanishingly rare) | | 3 | Expert panel reviewed (ClinGen VCEP) | FDA-recognized tier; overrides lower-star records for clinical action | | 2 | Multiple submitters, criteria provided, no conflicts | Reliable aggregate | | 1 | Single submitter OR conflicting interpretations (often mis-reported as 2-star) | Use with scrutiny | | 0 | No assertion criteria provided | Literature-only or legacy submissions |
ClinVar does NOT retract or hide lower-star records when a VCEP publishes; a variant can simultaneously display "Pathogenic (3-star VCEP)" and "Conflicting interpretations (1-star)". Tools handle this differently (VarSeq, Franklin, GenoOx each pick a winner via different rules); this is a major source of inter-tool disagreement.
As of 2025, ~80-90 VCEPs are approved or in progress across RASopathies, hereditary cancer (ENIGMA BRCA1/2, InSiGHT MMR), cardiomyopathy (sarcomere genes), hearing loss, RPE65/IRD, inborn errors of metabolism, and FH. The current count is moving; the authoritative directory is the Criteria Specification Registry at https://cspec.genome.network/cspec/ui/svi/all.
Each VCEP publishes a gene-disease-specific CSpec that re-weights ACMG/AMP criteria. The Hearing Loss VCEP downgrades PM2 to supporting by default and upgrades PS3 thresholds for OTOF. Treating "ACMG/AMP" as a single rubric across all genes is the most common error in non-specialist tooling.
The Richards 2015 28-criterion framework is the foundation, but every modern automated classifier (InterVar, GeneBe, Franklin, VarSome) implements the Tavtigian 2018/2020 Bayesian point system, not the original combining rules. Strengths map to points: Supporting=1, Moderate=2, Strong=4, Very Strong=8 (benign codes negative). Final categories: P >=10, LP 6-9, VUS 0-5, LB -1 to -6, B <=-7.
For variant interpretation framework details, calibrated in-silico thresholds, and PVS1 decision-tree logic, defer to clinical-databases/acmg-classification. This skill focuses on querying ClinVar; it intentionally does not re-implement classification.
Harrison 2017 Genet Med 19:1096 (PMID 28301460) showed 87% of inter-lab conflicts were resolvable by reassessment plus data sharing. As of 2024, only 3.8% of conflicting BRCA1 missense VUS reached consensus despite years of effort; conflict resolution is slow even in best-curated genes.
Submission staleness is non-trivial: ClinVar does not push reclassifications to submitters; a 2017 SCV can persist on an active label in 2026 if the lab has not re-submitted. Genome Alert! (Yauy 2022 Genet Med) was built specifically to detect classification drift between weekly releases. The median delta is ~1,247 classification changes per month with potential clinical impact.
| Scenario | Recommended path | Why |
|----------|------------------|-----|
| Single variant, known gene/condition | E-utilities esummary against clinvar DB | Lowest latency, returns VCV-level summary |
| Batch (10-1000 variants) by HGVS or rsID | myvariant.info with fields=clinvar | Aggregated, includes ClinVar review status |
| Batch (>1000) or coordinate-based | Local clinvar.vcf.gz with bcftools annotate or cyvcf2 | No rate limits; weekly snapshot |
| Condition-stratified (variant in disease A vs B) | Bulk XML VariationArchive parsing | RCV is the only level that preserves per-condition classification |
| Cross-database join with gnomAD / dbSNP / COSMIC | ClinGen Allele Registry CA ID | Build-agnostic, transcript-agnostic canonical identifier |
| Reproducible analysis with citable date | First-Thursday-of-month archive on FTP | Only monthly snapshots are archived; weekly releases disappear |
Goal: Retrieve VCV-level ClinVar summary for a single variant by ID, gene, or HGVS.
Approach: Hit esummary.fcgi or esearch.fcgi against db=clinvar, parse JSON, then optionally hydrate to full record with efetch.
import requests
EUTILS = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils'
def clinvar_summary(variation_id):
'''Retrieve VCV-level summary by ClinVar VariationID (do not confuse with CA ID).
The germline / somatic / oncogenicity classification nesting shown below
follows the ClinVar 2024 eSummary v2 schema described in the data-access
documentation. Field names have changed between API versions -- inspect
the actual JSON returned by eSummary for the live ClinVar version before
pinning these key paths in production.
'''
r = requests.get(f'{EUTILS}/esummary.fcgi',
params={'db': 'clinvar', 'id': variation_id, 'retmode': 'json'},
timeout=30)
r.raise_for_status()
record = r.json()['result'][str(variation_id)]
return {
'vcv': record.get('accession'),
'name': record.get('title'),
'germline_class': record.get('germline_classification', {}).get('description'),
'germline_review_status': record.get('germline_classification', {}).get('review_status'),
'somatic_clinical': record.get('clinical_impact_classification', {}).get('description'),
'oncogenicity': record.get('oncogenicity_classification', {}).get('description'),
'last_evaluated': record.get('germline_classification', {}).get('last_evaluated')
}
def clinvar_search_gene(gene, pathogenic_only=False, retmax=500):
term = f'{gene}[gene]'
if pathogenic_only:
term += ' AND (clinsig_pathogenic[Properties] OR clinsig_likely_pathogenic[Properties])'
r = requests.get(f'{EUTILS}/esearch.fcgi',
params={'db': 'clinvar', 'term': term, 'retmax': retmax, 'retmode': 'json'},
timeout=30)
return r.json()['esearchresult']['idlist']
Goal: Annotate or look up thousands of variants without rate limits.
Approach: Download the weekly clinvar.vcf.gz (note: only first-Thursday-of-month is archived; for longitudinal stability pin to monthly archives), query by genomic coordinates with cyvcf2 or annotate VCFs with bcftools.
mkdir -p clinvar/$(date +%Y%m); cd clinvar/$(date +%Y%m)
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz.tbi
from cyvcf2 import VCF
clinvar = VCF('clinvar.vcf.gz')
def lookup(chrom, pos, ref, alt):
'''Look up by GRCh38 coords. Returns variant-level (VCV) aggregate; not RCV.'''
for v in clinvar(f'{chrom}:{pos}-{pos}'):
if v.REF == ref and alt in v.ALT:
info = v.INFO
return {
'vcv_id': info.get('ALLELEID'),
'clnsig': info.get('CLNSIG'),
'clnsig_conf': info.get('CLNSIGCONF'),
'clnrevstat': info.get('CLNREVSTAT'),
'clndn': info.get('CLNDN'),
'clnvc': info.get('CLNVC'),
'clnhgvs': info.get('CLNHGVS'),
'clndisdb': info.get('CLNDISDB'),
'oncdn': info.get('ONCDN'),
'scidn': info.get('SCIDN')
}
return None
bcftools annotate \
-a clinvar.vcf.gz \
-c INFO/CLNSIG,INFO/CLNREVSTAT,INFO/CLNDN,INFO/CLNVC,INFO/CLNHGVS,INFO/CLNSIGCONF \
input.vcf.gz -O z -o annotated.vcf.gz
bcftools index -t annotated.vcf.gz
ClinGen Allele Registry (https://reg.clinicalgenome.org/) computes a build-agnostic, transcript-agnostic CA ID (format CA######) for any allele projectable onto NCBI references (GRCh37, GRCh38, T2T-CHM13, any RefSeq transcript). The Registry covers ~700M+ alleles, vastly more than ClinVar. CA ID and ClinVar VariationID are one-to-one when a variant exists in ClinVar.
def car_id(hgvs_g):
'''Resolve HGVS-g to canonical ClinGen Allele Registry CA ID.'''
r = requests.put(f'https://reg.clinicalgenome.org/allele',
headers={'Content-Type': 'text/plain'},
data=hgvs_g, timeout=30)
return r.json().get('@id', '').split('/')[-1] if r.ok else None
Use CA ID for any join touching non-ClinVar resources (gnomAD, dbSNP, COSMIC, MAVEdb). VariationID was renumbered during the 2017 ClinVar schema redesign; treating it as a stable cross-build identifier is unsafe.
1. Treating CLNSIG as gospel for condition-specific work
CLNSIG=Pathogenic from clinvar.vcf.gz and report variant as pathogenic for the patient's specific phenotype.<RCVAccession> per condition); cross-check CLNDN and report per-condition classifications.2. Parsing legacy XML against 2024 schema
<ClinVarSet> or <ClinicalSignificance>.<VariationArchive> + germline/somatic/oncogenicity tripartite classifications.<VariationArchive> and read GermlineClassification, SomaticClinicalImpact, OncogenicityClassification separately.3. Counting variant_summary.txt rows naively
wc -l variant_summary.txt to estimate variant count.awk -F'\t' '$17=="GRCh38"' variant_summary.txt | wc -l.4. Trusting VariationID as a stable cross-build identifier
5. Ignoring star-rating override hierarchy
review_status rank (4>3>2>1>0); use the highest-star record. For ties, sort by date.6. Aggregating "Conflicting" without inspecting the conflict
CLNSIG=Conflicting interpretations as VUS.CLNSIGCONF to see exact conflict; weight by submitter star.7. Missing somatic interpretations
CLNSIG.ONCDN, SCIDN, CLNSIGSOMATIC) since 2024.ONCDN (oncogenicity disease name), SCIDN (somatic clinical impact disease name), and the somatic-specific significance fields.| Pattern | Likely cause | Action |
|---------|-------------|--------|
| ClinVar P vs gnomAD AF > 1% | Variant is true founder allele in unstratified gnomAD subset, OR ClinVar P is a stale low-star assertion | Check grpmax_faf95 excluding bottleneck groups; check ClinVar star rating |
| ClinVar P vs AlphaMissense < 0.1 | Variant in NMD-escape region, alternative isoform, or ClinVar P is mis-curated | Check Pejaver 2022 calibration in acmg-classification skill; cross-check VCEP |
| VCEP 3-star P vs commercial-lab 1-star B | VCEP supersedes for clinical action | Use VCEP; flag submitter for resubmission |
| ClinVar VCV-level P vs RCV-level VUS for actual condition | VCV averages across conditions | Always report at RCV level for clinical action |
| ClinVar P vs LOVD/HGMD discordant | LOVD/HGMD use different classification systems; HGMD "DM" != ACMG P | Triangulate against published evidence; do not auto-translate labels |
| ClinVar P missing for a known disease variant | Submission lag (~6-12 months typical for new findings) | Check published literature; flag for ClinVar submission |
| Threshold | Convention | Source |
|-----------|-----------|--------|
| Star >= 2 | Acceptable confidence for clinical action without further review | ClinGen SVI operational guidance |
| Star = 3 | VCEP-curated; supersedes lower-star records | ClinGen FDA Recognition 2018 |
| CLNSIG includes 'Pathogenic' OR 'Likely_pathogenic' | Treat as actionable for ACMG | ClinVar field schema |
| CLNSIGCONF present | Multiple SCVs disagree; do NOT auto-action | ClinVar field schema |
| Monthly archive | Use first-Thursday-of-month FTP snapshot for reproducible analyses | NCBI FTP retention policy |
| Submission staleness | Re-check classification annually for active diagnostic variants | Yauy 2022 Genet Med (Genome Alert!) |
| AF > 5% in gnomAD | BA1 standalone benign per ClinGen SVI default (VCEP overrides exist) | Richards 2015; SVI specs |
| 1247 changes/month | Median variants with classification change per release | Yauy 2022 |
The 2024 schema separates three orthogonal classifications, each with its own ReviewStatus and DateLastEvaluated:
A single VCV can carry all three with distinct evaluations; the legacy "Pathogenic" label is now ambiguous if not qualified by classification type.
| Symptom | Cause | Solution |
|---------|-------|----------|
| Empty result from efetch db=clinvar | rsID passed where VariationID expected | Use esearch first to resolve rsID to VariationID |
| CLNSIG is _None or comma-separated mess | Variant has multi-condition RCVs; VCF collapses them | Parse RCV XML for per-condition values |
| Variant present in ClinVar XML but absent from VCF | Variant lacks GRCh38 coordinates (legacy GRCh37-only submission) | Check <VariationArchive><SequenceLocation> per assembly |
| 2024-format XML parser silently emits zero records | XML schema v2 incompatibility | Re-target to <VariationArchive> |
| Conflicting interpretations with same star rating across two submitters | True scientific disagreement; sometimes resolved by VCEP later | Apply Tavtigian point system to manually reconcile; flag for VCEP review |
| Variant has CA ID but no VariationID | Variant in Allele Registry but never submitted to ClinVar | Use AlleleRegistry as canonical; submit to ClinVar if novel pathogenic |
| CLNSIG says Pathogenic but no associated condition CLNDN | Orphan classification (older submissions) | Treat as low confidence; cross-check publication |
| Variants pulled by gene return only some isoforms | RefSeq transcript priority differences | Use MANE Select transcript explicitly; cross-check with VEP --mane_select |
| Pushback | Standard response |
|----------|-------------------|
| "Why is this pathogenic variant 1-star?" | We report star rating per submission; clinical action requires star >=2 OR VCEP curation per ClinGen SVI 2018. |
| "ClinVar says P but gnomAD AF = 2%" | Reconciled via Whiffin FAF95 max-credible-AF framework; bottleneck-group rule applied. |
| "This VCV count differs from ClinVar.gov" | We pulled from the monthly archive (first-Thursday-of-month) for reproducibility; the live web is post-most-recent-weekly. |
| "Why wasn't the somatic variant flagged?" | Pre-2024 XML schema had no separate somatic field; we now read ONCDN/SCIDN/SomaticClinicalImpact per v2 schema. |
| "VarSome says LP but this says VUS" | Tool-specific aggregation rule differences; VarSome auto-applies PP3+PM2 by default per Tavtigian point system; we apply VCEP-specific PP3 calibration per CSpec. |
| "rsID match returned wrong variant" | rsID is a cluster identifier; multi-allelic rsIDs require allele-level resolution; we use SPDI or CA ID. |
| "Why retest a 2022-curated variant?" | Submission staleness median 5-year reclassification cycle in active genes (Harrison 2017); ClinGen recommends annual re-review for active diagnostic variants. |
https://reg.clinicalgenome.org/docs/cg-car/https://cspec.genome.network/cspec/ui/svi/alltools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.