clinical-databases/myvariant-queries/SKILL.md
Queries myvariant.info BioThings aggregator for ClinVar, gnomAD, dbSNP, dbNSFP, COSMIC, CADD, and CIViC annotations in batched, version-tracked requests. Use when annotating variant lists from multiple databases simultaneously without managing per-source APIs, and when reproducibility-grade analyses require recording source data versions via _meta.
npx skillsauth add GPTomics/bioSkills bio-clinical-databases-myvariant-queriesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: myvariant 1.0.0+, requests 2.31+, pandas 2.2+. myvariant.info aggregates >=21 sources; the operative version of each source is queryable via the _meta field and the /v1/metadata endpoint.
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturesIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying. dbNSFP version drift is the dominant staleness vector: AlphaMissense was added to dbNSFP v4.4 (~2024); querying dbnsfp.alphamissense.score returns whatever version of dbNSFP is currently loaded; check _meta.src.dbnsfp.version.
'Annotate my variants with ClinVar + gnomAD + CADD + AlphaMissense in one batch' -> Query the BioThings myvariant.info aggregator with field selection and version tracking, then parse nested responses.
myvariant.MyVariantInfo().getvariant(hgvs_or_rsid, fields=['clinvar', 'gnomad_exome', 'dbnsfp'])mv.getvariants(ids_list, fields=...); up to 1000 IDs per requestmv.query('clinvar.gene.symbol:BRCA1 AND clinvar.clinical_significance:Pathogenic')GET https://myvariant.info/v1/variant/{hgvs_or_id}?fields=...POST https://myvariant.info/v1/variant with comma-separated IDsmyvariant.info is one of three flagship BioThings APIs (with MyGene.info and MyChem.info). All three share the BioThings SDK, which auto-deploys an Elasticsearch index from heterogeneous source files via per-source dataloaders. The 2022 paper formalized the SDK; the architecture itself is older (Xin 2016 Genome Biol).
dbnsfp.cadd.phred:>20)_id field is canonical HGVS-g per record (e.g., chr7:g.117199644G>A)| Source | What | Notes | |--------|------|-------| | ClinVar | Pathogenicity | Weekly refresh | | gnomAD v4 exomes + genomes | Population AF | grpmax_faf95 surfaced | | dbSNP Build 156 | rsID + alleles | RsMergeArch resolved | | dbNSFP v4.x | Meta-aggregator of 40+ in silico predictors | Includes AlphaMissense, REVEL, BayesDel | | CADD | Deleteriousness | Genome-wide | | CIViC | Cancer interpretation | Per-disease | | COSMIC | Somatic variants | Catalogue of Somatic Mutations | | EVS | Exome Variant Server | Legacy (deprecated by gnomAD) | | ExAC | ExAC frequencies | Legacy (superseded by gnomAD) | | GRASP | GWAS associations | -- | | GWAS Catalog | Curated GWAS | -- | | Wellderly | Disease-resistant elderly cohort | -- | | EMV | -- | -- | | DOCM | Database of Curated Mutations | -- | | ICGC | International cancer | -- | | MutDB | -- | -- | | GO | Gene Ontology | -- | | Snpeff | snpEff annotations | -- | | GeneReviews | Disease/gene reviews | -- | | MutPred | Functional impact | -- |
dbNSFP is itself an aggregator. Querying dbnsfp.alphamissense.score returns the version that dbNSFP loaded, not AlphaMissense direct. The lag from publication (Cheng 2023 Science) to integration into myvariant.info is typically 6-18 months via dbNSFP.
| Endpoint | Method | Use |
|----------|--------|-----|
| /v1/variant/{id} | GET | Single canonical-ID lookup |
| /v1/variant | POST (batched IDs) | Batch lookup, up to 1000 IDs |
| /v1/query?q={lucene} | GET | Flexible Elasticsearch search |
| /v1/metadata | GET | Per-source versions |
Scopes (the scopes parameter on /v1/query POST) specifies which fields to match an input ID against: hgvs, rsid, dbsnp.rsid, dbnsfp.genename, chrom, _id. The _id is canonical HGVS-g.
_meta FieldEvery record carries _meta.src showing per-source version:
mv = myvariant.MyVariantInfo()
record = mv.getvariant('chr7:g.140453136A>T', fields=['_meta', 'clinvar', 'dbnsfp.alphamissense'])
print(record['_meta']['src']['dbnsfp']['version']) # e.g., '4.7a'
print(record['_meta']['src']['clinvar']['version']) # e.g., '20250901'
For reproducibility, record per-source versions in analysis output alongside results.
| Tool | Approach | When to use | |------|----------|-------------| | myvariant.info | Cloud aggregator, ES-backed | Quick batch annotation, no local setup | | OpenCRAVAT (Pagel 2020 Cancer Res) | Local install, modular annotators | Offline / PHI-sensitive | | VarSome (Kopanos 2019 Bioinformatics 35:1978; commercial) | Hosted, 22 sources | 82% ACMG criteria auto-application (highest); clinical labs | | Franklin / Genoox | Commercial hosted | 59 data sources; family/cohort analysis | | GeneBe.net (Stawinski 2024 Clin Genet) | Open-source web + API | Free Tavtigian-point-system-based ACMG; comparable to VarSome | | ANNOVAR / VEP / snpEff | Local annotation tools | Pipeline integration, batch annotation, no ACMG |
myvariant.info does NOT produce ACMG calls; it is purely an annotation aggregator. Pair with InterVar, GeneBe, or the acmg-classification skill for classification.
| Scenario | Recommended path | Why |
|----------|------------------|-----|
| Single variant batch annotation | getvariant(hgvs, fields=...) | One call, all aggregated sources |
| 10-1000 variants | getvariants(list, fields=...) | Batch endpoint, up to 1000 |
| > 1000 variants | Chunk to 1000 + sleep | Rate limit + JSON size |
| Search by gene + pathogenicity | mv.query('clinvar.gene.symbol:BRCA1 AND clinvar.clinical_significance:Pathogenic', size=200) | Elasticsearch Lucene |
| ACMG-grade pipeline | myvariant for annotation -> InterVar / GeneBe for classification | myvariant does not produce ACMG calls |
| Offline / PHI-sensitive | OpenCRAVAT or VEP locally | myvariant requires HTTP |
| Reproducibility | Always record _meta.src.<source>.version | dbNSFP version is the dominant staleness vector |
| Source-specific deep dive | Use the source-specific skill (clinvar-lookup, gnomad-frequencies) | myvariant is aggregator-grade, not source-deep |
Goal: Annotate a list of variants with the canonical clinical fields for downstream prioritization.
Approach: Batch getvariants with explicit field list; record _meta versions; convert to DataFrame.
import myvariant
import pandas as pd
mv = myvariant.MyVariantInfo()
CLINICAL_FIELDS = [
'clinvar.clinical_significance',
'clinvar.review_status',
'clinvar.variant_id',
'gnomad_exome.faf95',
'gnomad_exome.af.af',
'gnomad_exome.an.an',
'gnomad_genome.faf95',
'gnomad_genome.af.af',
'dbsnp.rsid',
'dbnsfp.alphamissense.score',
'dbnsfp.alphamissense.pred',
'dbnsfp.revel.score',
'dbnsfp.cadd.phred',
'dbnsfp.spliceai.master_pred',
'dbnsfp.spliceai.ds_max',
'cosmic.cosmic_id',
'civic.openCravatUrl',
'_meta'
]
def annotate_variant_list(hgvs_list):
'''Batch-annotate variants with ClinVar / gnomAD / dbNSFP / COSMIC / CIViC fields.'''
chunked = [hgvs_list[i:i+1000] for i in range(0, len(hgvs_list), 1000)]
rows = []
versions = None
for chunk in chunked:
results = mv.getvariants(chunk, fields=CLINICAL_FIELDS)
for r in results:
if versions is None and r.get('_meta'):
versions = {src: meta.get('version') for src, meta in r['_meta'].get('src', {}).items()}
clinvar = r.get('clinvar', {}) or {}
gnomad_e = r.get('gnomad_exome', {}) or {}
gnomad_g = r.get('gnomad_genome', {}) or {}
dbnsfp = r.get('dbnsfp', {}) or {}
faf95 = (gnomad_e.get('faf95', {}) or gnomad_g.get('faf95', {})) or {}
rows.append({
'variant': r.get('query'),
'clinvar_sig': clinvar.get('clinical_significance'),
'clinvar_review': clinvar.get('review_status'),
'gnomad_grpmax_faf95': faf95.get('popmax'),
'grpmax_ancestry': faf95.get('popmax_population'),
'gnomad_af': gnomad_e.get('af', {}).get('af') or gnomad_g.get('af', {}).get('af'),
'rsid': r.get('dbsnp', {}).get('rsid'),
'alphamissense': dbnsfp.get('alphamissense', {}).get('score'),
'revel': dbnsfp.get('revel', {}).get('score'),
'cadd_phred': dbnsfp.get('cadd', {}).get('phred'),
'spliceai_ds_max': dbnsfp.get('spliceai', {}).get('ds_max')
})
return pd.DataFrame(rows), versions
Goal: Search beyond canonical IDs; e.g., all pathogenic variants in a gene, all variants in a genomic region with CADD > 20.
Approach: Lucene syntax in mv.query(); support boolean operators, ranges, wildcards.
def find_pathogenic_in_gene(gene_symbol, max_results=500):
'''Find ClinVar P/LP variants in a gene.'''
query = f'clinvar.gene.symbol:{gene_symbol} AND '\
'clinvar.clinical_significance:(Pathogenic OR "Likely pathogenic")'
hits = mv.query(query, size=max_results, fields=['_id', 'clinvar.clinical_significance',
'clinvar.review_status'])
return hits.get('hits', [])
def find_high_cadd_in_region(chrom, start, end, min_cadd=25):
'''Find variants in region with CADD phred above threshold.'''
query = f'chrom:{chrom} AND hg19.start:[{start} TO {end}] AND '\
f'dbnsfp.cadd.phred:>{min_cadd}'
return mv.query(query, size=500, fields=['_id', 'dbnsfp.cadd.phred', 'clinvar.clinical_significance'])
def find_alphamissense_pathogenic(gene, min_score=0.564):
'''Find AlphaMissense pathogenic missense in a gene.
Note: Cheng 2023 dev cutoff is 0.564 BUT this is NOT the Pejaver-style calibrated
PP3 threshold. ClinGen has not endorsed AlphaMissense thresholds as of May 2026;
use AlphaMissense as supporting evidence only.
'''
query = f'dbnsfp.genename:{gene} AND dbnsfp.alphamissense.score:>{min_score}'
return mv.query(query, size=500, fields=['_id', 'dbnsfp.alphamissense', 'clinvar.clinical_significance'])
1. Stale dbNSFP version
dbnsfp.alphamissense.score and trust as current._meta.src.dbnsfp.version; for cutting-edge predictions query AlphaMissense API directly.2. Treating AlphaMissense dev threshold as PP3-calibrated
clinical-databases/acmg-classification for calibrated thresholds.3. Stacking REVEL + BayesDel + AlphaMissense as independent evidence
4. Rate-limit ignorance
getvariant().getvariants(chunk, fields=...) with chunk size 1000; sleep ~0.5s between chunks.5. Field-path errors silently return None
gnomad_exome.faf95.popmax but typo as gnomad_exome.faf or gnomad.exomes.faf95.print(mv.getvariant(test_id)) first to inspect actual field structure; check /v1/metadata/fields.6. Multi-allelic rsID returns one variant only
rs12345 and treat returned variant as the variant of interest.7. Sample overlap between sources
| Pattern | Likely cause | Action |
|---------|-------------|--------|
| dbNSFP REVEL != ClinVar PP3 strength | Different curation cohort | Use Pejaver 2022 calibrated thresholds (see acmg-classification) |
| ClinVar P + AlphaMissense benign | NMD-escape region, alternative isoform, ClinVar P stale | Cross-check with conservation, splicing predictions |
| gnomAD AF differs across exome vs genome | Sample sizes differ; exome has 730k, genome 76k | Use exome FAF95 when available; genome as fallback |
| COSMIC + ClinVar overlap | True dual-classification (germline + somatic) | Report both contexts |
| Variant missing from one source | Source-specific coverage gaps | Cross-check directly with primary source skill (clinvar-lookup, gnomad-frequencies) |
| Threshold | Convention | Source |
|-----------|-----------|--------|
| Batch endpoint cap | 1000 IDs per POST | myvariant.info docs |
| Rate limit | ~1000 req/sec aggregate; lower per IP | myvariant.info docs |
| dbNSFP refresh lag | 6-18 months from primary source release | dbNSFP release history |
| _meta.src field | Per-source version is always available | BioThings SDK convention |
| Lucene escape | Special chars need \ (e.g., chr7\:140453136) | Elasticsearch convention |
| Multi-allelic rsID | ~6-8% of dbSNP rsIDs are multi-allelic | Phan 2025 NAR |
| AlphaMissense PP3 calibration | NOT yet ClinGen-endorsed (as of May 2026) | ClinGen SVI |
| REVEL PP3_Strong calibration | >= 0.932 per Pejaver 2022 | Pejaver 2022 AJHG |
| Symptom | Cause | Solution |
|---------|-------|----------|
| KeyError: 'gnomad_exome' | Variant absent from gnomAD exome dataset | Use .get('gnomad_exome', {}) defensively |
| None for AlphaMissense on rare variants | dbNSFP coverage gap; variant in alt-spliced isoform | Query AlphaMissense API directly, or accept None |
| Search returns 0 hits despite known matches | Lucene escape on : in chr coords | Quote the chrom-position term or escape : |
| Batch returns < input IDs | Some IDs not in any source | Check notfound field in response |
| Different AF in myvariant vs gnomAD browser | dbNSFP version != current gnomAD release | Check _meta.src.gnomad_exome.version |
| 503 on bulk query | Rate limit | Reduce chunk to 500; sleep 1s between |
| _id doesn't match input | myvariant uses canonical HGVS-g; input was rsID or non-canonical | Re-query by _id after first resolution |
| Pushback | Standard response |
|----------|-------------------|
| "myvariant.info is just an aggregator; why not query sources directly?" | Aggregator avoids per-source API setup; sufficient for single-source unique queries we defer to source-specific skills. |
| "This annotation differs from VarSome" | VarSome uses its own ACMG implementation; myvariant.info does NOT produce ACMG calls; we pair with acmg-classification. |
| "dbNSFP REVEL differs from REVEL website" | dbNSFP version is on the order of 1 year behind primary; check _meta.src.dbnsfp.version. |
| "AlphaMissense calibration thresholds were missed" | AlphaMissense is integrated via dbNSFP; PP3 calibration is in clinical-databases/acmg-classification skill. |
| "Why not OpenCRAVAT?" | OpenCRAVAT requires local install; myvariant is faster for batch annotation. Switch to OpenCRAVAT for PHI-sensitive or offline workflows. |
https://docs.myvariant.info/en/latest/https://myvariant.info/v1/metadata/fieldstesting
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.