clinical-databases/dbsnp-queries/SKILL.md
Resolves rsIDs, navigates RsMergeArch/SNPHistory merge chains, and converts between rsID, SPDI, HGVS, and VCF representations using the dbSNP Build 156 JSON architecture. Use when normalizing variant identifiers, joining variant databases by cluster ID, or tracking deprecated rsIDs through historical merges.
npx skillsauth add GPTomics/bioSkills bio-clinical-databases-dbsnp-queriesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: myvariant 1.0+, requests 2.31+, biopython 1.83+, Entrez Direct 21.0+. dbSNP Build 156 (September 2022) is the current schema; Build 151 (2017) was the last with relational SQL dumps. Builds 152-155 dual-released JSON+SQL; 156+ is JSON-only.
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying. The Variation Services REST API uses path-based versioning (/variation/v0/); E-utilities db=snp returns thin legacy summaries missing build-156 schema fields.
'Look up this rsID / normalize variant representations' -> Resolve rsIDs through merge chains, compute canonical SPDI, and convert between rsID, HGVS-g, HGVS-c, and VCF allele representations.
myvariant.MyVariantInfo().getvariant(rsid, fields=['dbsnp', 'clinvar', 'gnomad_exome'])requests.get(f'https://api.ncbi.nlm.nih.gov/variation/v0/refsnp/{rsid_int}')Bio.Entrez.esearch(db='snp', term=rsid); returns thin summaryftp.ncbi.nlm.nih.gov/snp/latest_release/JSON/refsnp-chr{N}.json.bz2This is the load-bearing concept. dbSNP cluster definition: ss records (submitted SNPs) are mapped to the genome and clustered into RefSNPs by position + variant type, not by allele. A single rsID can point to a locus with multiple alleles:
rs12345 may resolve to {A>G, A>T, A>C} at one position; the RefSNP JSON primary_snapshot_data.placements_with_allele[*].alleles enumerates them.Rule: Use rsID as a human-facing label only; use SPDI or ClinGen Allele Registry CA ID for joins.
| Aspect | Build 151 (2017) | Build 156 (2022) and current |
|--------|------------------|------------------------------|
| Distribution | Relational SQL dumps + XML | JSON per RefSNP, partitioned by chromosome |
| FTP path | ftp/snp/organisms/human_9606/ | ftp.ncbi.nlm.nih.gov/snp/latest_release/JSON/ |
| Primary key | snp_id, ss_id | refsnp_id, with primary_snapshot_data block |
| Frequency data | Embedded sparse | ALFA aggregated populations |
| Merge tracking | RsMergeArch.bcp.gz | refsnp-merged.json.bz2 (also RsMergeArch.bcp.gz retained for legacy) |
| Withdrawn | SNPHistory.bcp.gz | refsnp-withdrawn.json.bz2 |
| API access | Legacy E-utilities db=snp only | Variation Services REST /v0/refsnp/{id} returns the full JSON |
E-utilities still works for db=snp but returns a thin pre-156 summary missing key fields like primary_snapshot_data.placements_with_allele; pipelines reliant on Entrez get out-of-date data.
When two rsIDs are found to refer to the same allele cluster, the higher (later-assigned) rsID is merged into the lower. RsMergeArch.bcp.gz stores (rsHigh, rsLow, rsCurrent) tuples.
The trap: rsCurrent in any given row is the merge target at the time of that merge event, NOT the current dbSNP rsID. A multi-merge chain (rs3 -> rs2 -> rs1, then later rs1 -> rs0) appears as multiple rows. Naive one-hop lookup resolves to a stale ID.
Withdrawn rsIDs (submitter-withdrawn or QC-failed) live in SNPHistory.bcp.gz, not RsMergeArch. Both tables must be consulted to resolve any historical rsID.
SPDI (Sequence:Position:Deletion:Insertion) format: NC_000017.11:43044294:G:A. Position is 0-based, half-open (differs from HGVS's 1-based, fully-closed).
The Contextual Allele transformation (Variant Overprecision Correction Algorithm) returns the right-aligned, normalized canonical form across left-aligned VCFs and right-aligned HGVS conventions. This is the basis for ClinGen Allele Registry CA ID computation.
| Representation | Build dependency | Transcript dependency | Bijective? | Best for |
|----------------|-----------------|----------------------|------------|----------|
| VCF (chrom-pos-ref-alt) | Yes | No | Yes (same build) | Pipelines, bulk |
| SPDI | Yes (via RefSeq accession) | No | Yes for SNV/small indel | Canonical normalization |
| HGVS-g | Yes (NC_xxxxx.N) | No | Yes for SNV/small indel | Human-readable genomic |
| HGVS-c | Indirect (via transcript) | Yes | No (one HGVS-c -> many HGVS-g) | Clinical reporting |
| HGVS-p | Indirect | Yes | Degenerate (one HGVS-p -> many HGVS-c) | Protein-level annotation |
| rsID | None (cluster identifier) | None | NO (multi-allelic) | Human label only |
| CA ID | None (canonical) | None | Yes | Cross-database join |
| Scenario | Recommended path | Why |
|----------|------------------|-----|
| Resolve single rsID to coordinates + alleles | Variation Services /v0/refsnp/{id} | Returns full Build 156 JSON, including merge history |
| Resolve historical/deprecated rsID | Variation Services /v0/refsnp/{id} -> follow merged_snapshot_data chain | Single-hop RsMergeArch lookup misses multi-hop chains |
| Batch query 100-10k rsIDs | myvariant.info getvariants(rsids) | Aggregated with ClinVar/gnomAD overlay; rate-limit safe |
| Convert coords <-> rsID | myvariant.info HGVS query or Variation Services /spdi/{spdi}/rsid | SPDI is the canonical bridge |
| Normalize variant representations | Variation Services /hgvs/{hgvs}/contextuals | Returns canonical SPDI, right-aligned |
| Bulk genomic-wide rsID -> coords | Local download of refsnp-chr{N}.json.bz2 + parser | No rate limits; weekly snapshots |
| Joining dbSNP with gnomAD by ID | Use SPDI or CA ID, never rsID alone | rsID is a cluster; alleles may not match |
| Get population AF for common variant | ALFA (via Variation Services) for array-genotyped variants; gnomAD for sequencing-derived | Different sample compositions |
Goal: Resolve an rsID to full Build 156 RefSNP JSON, including coordinates, alleles, gene context, and merge history.
Approach: Hit Variation Services /v0/refsnp/{id_without_rs}; the response includes primary_snapshot_data (current) and merged_snapshot_data (if this rsID is itself a merge target).
import requests
VARSVC = 'https://api.ncbi.nlm.nih.gov/variation/v0'
def refsnp(rsid):
'''Fetch full Build 156 RefSNP JSON. rsid can be 'rs121913529' or 121913529.'''
rs_int = str(rsid).lstrip('rs')
r = requests.get(f'{VARSVC}/refsnp/{rs_int}', timeout=30)
if r.status_code == 404:
return None
r.raise_for_status()
return r.json()
def summarize_refsnp(payload):
'''Extract minimal fields. Handles multi-allelic cluster correctly.
The placement JSON nests assembly metadata; the precise path varies by
Build / API version. Common variants seen in the wild:
placement['seq_id_traits_by_assembly'][0]['assembly_name']
placement['placement_annot']['seq_id_traits_by_assembly'][0]['assembly_name']
Inspect the actual JSON returned for the current dbSNP Build before
relying on either path in production.
'''
if payload is None or payload.get('is_withdrawn'):
return None
primary = payload.get('primary_snapshot_data', {})
placements = primary.get('placements_with_allele', [])
def assembly_name(p):
traits = (p.get('placement_annot') or p).get('seq_id_traits_by_assembly') or []
return traits[0].get('assembly_name') if traits else ''
grch38 = next((p for p in placements if 'GRCh38' in (assembly_name(p) or '')), None)
if grch38 is None:
return None
alleles = []
for allele in grch38.get('alleles', []):
spdi = allele.get('allele', {}).get('spdi', {})
alleles.append({
'ref': spdi.get('deleted_sequence'),
'alt': spdi.get('inserted_sequence'),
'seq_id': spdi.get('seq_id'),
'pos_0based': spdi.get('position')
})
return {
'rsid': payload.get('refsnp_id'),
'gene': primary.get('allele_annotations', [{}])[0].get('assembly_annotation', [{}])[0].get('genes', [{}])[0].get('locus'),
'placements_grch38': alleles,
'is_multiallelic': len(alleles) > 2,
'merge_history': payload.get('merged_snapshot_data', [])
}
Goal: Resolve a possibly-deprecated rsID to the current canonical rsID, following the full merge chain.
Approach: Recursively follow merged_snapshot_data until the response has no further merge entries, with cycle detection.
def resolve_merge_chain(rsid, max_hops=10):
'''Follow multi-hop merge chain. Cycle-safe with max_hops cap.'''
seen = set()
current = str(rsid).lstrip('rs')
for _ in range(max_hops):
if current in seen:
return {'error': 'merge cycle detected', 'chain': list(seen)}
seen.add(current)
payload = refsnp(current)
if payload is None:
return {'error': 'not found', 'final_rsid': current, 'chain': list(seen)}
if payload.get('is_withdrawn'):
return {'status': 'withdrawn', 'final_rsid': current, 'chain': list(seen)}
primary = payload.get('primary_snapshot_data')
if primary is not None:
return {'status': 'resolved', 'final_rsid': payload.get('refsnp_id'), 'chain': list(seen)}
merged = payload.get('merged_snapshot_data', [])
if not merged:
return {'status': 'orphan', 'final_rsid': current, 'chain': list(seen)}
current = str(merged[0].get('merged_into', ''))
return {'error': 'hop limit', 'chain': list(seen)}
Goal: Move between variant representations using Variation Services as the canonical bridge.
Approach: SPDI endpoints handle build resolution and right-alignment; HGVS contextuals applies the Variant Overprecision Correction Algorithm.
def hgvs_to_spdi_canonical(hgvs):
'''Resolve HGVS to canonical SPDI via the Variant Overprecision Correction Algorithm.'''
r = requests.get(f'{VARSVC}/hgvs/{hgvs}/contextuals', timeout=30)
if not r.ok:
return None
contextuals = r.json().get('data', {}).get('spdis', [])
return contextuals[0] if contextuals else None
def spdi_to_rsid(spdi_str):
'''SPDI 'NC_000017.11:43044294:G:A' -> rsID if a cluster exists.'''
r = requests.get(f'{VARSVC}/spdi/{spdi_str}/rsids', timeout=30)
if not r.ok:
return None
rsids = r.json().get('data', {}).get('rsids', [])
return rsids[0] if rsids else None
def vcf_to_canonical_spdi(chrom, pos, ref, alt, assembly='GRCh38'):
'''VCF (1-based) -> SPDI (0-based, right-aligned).'''
refseq_map = {('1', 'GRCh38'): 'NC_000001.11', ('17', 'GRCh38'): 'NC_000017.11'}
refseq = refseq_map.get((str(chrom).lstrip('chr'), assembly))
if refseq is None:
return None
raw_spdi = f'{refseq}:{pos - 1}:{ref}:{alt}'
r = requests.get(f'{VARSVC}/spdi/{raw_spdi}/canonical_representative', timeout=30)
return r.json().get('data', {}).get('spdi') if r.ok else None
| Source | Sample basis | Variants covered | When to use | |--------|-------------|------------------|-------------| | ALFA | ~1M dbGaP subjects (array + WGS) across 12 ancestry groups | 447M+ sites; broader (includes array-only) | Common variants, dbGaP-deposited cohorts | | gnomAD v4 | 807k WGS+WES individuals across 9 ancestry groups | Sequencing-derived (deeper at rare variants) | Rare-variant FAF95, ACMG BS1/BA1 |
ALFA does NOT provide FAF95-style upper-bound CIs; raw AF only. ALFA captures consent-tier metadata enabling consent-respecting lookups for variants gnomAD doesn't carry.
def alfa_frequency(rsid, ancestry='Total'):
'''Pull ALFA per-population AF via Variation Services.'''
payload = refsnp(rsid)
if payload is None:
return None
freq_records = payload.get('primary_snapshot_data', {}).get('allele_annotations', [{}])[0].get('frequency', [])
alfa_records = [f for f in freq_records if 'ALFA' in f.get('study_name', '')]
for record in alfa_records:
if record.get('common_name') == ancestry:
return {
'allele': record.get('observation', {}).get('inserted_sequence'),
'count': record.get('allele_count'),
'total': record.get('total_count'),
'freq': record.get('allele_count') / record.get('total_count') if record.get('total_count') else None
}
return None
1. Treating rsID as a unique variant identifier
2. Single-hop RsMergeArch lookup
RsMergeArch.bcp.gz and treat rsCurrent as the final answer.merged_snapshot_data (handles multi-hop in one call).3. Confusing withdrawn vs merged
SNPHistory.bcp.gz, not RsMergeArch. They are NOT merged into anything; the cluster was QC-failed or submitter-retracted.is_withdrawn field in RefSNP JSON; flag the variant for manual review.4. E-utilities thin summary
Entrez.esummary(db='snp') and treat output as authoritative.db=snp returns a pre-Build-156 summary missing primary_snapshot_data.placements_with_allele, frequency data, and merge history./v0/refsnp/{id} for full JSON.5. SPDI 0-based vs HGVS 1-based mismatch
6. Strand/orientation ambiguity for A/T C/G variants
| Pattern | Likely cause | Action |
|---------|-------------|--------|
| dbSNP rsID returns 404 in current build | Withdrawn (in refsnp-withdrawn.json.bz2) | Check withdrawal reason; consider manual curation |
| rsID resolves to different coords across builds | Genome assembly change (GRCh37 -> GRCh38) | Use SPDI with explicit RefSeq accession; lift over via pyliftover |
| ALFA AF and gnomAD AF disagree by >2x | Different sample compositions; ALFA includes array-only sites under-represented in gnomAD | Trust gnomAD for sequencing data; ALFA for array-derived; use the more relevant source |
| Multiple rsIDs map to one SPDI | True duplicates from independent submissions; rare since Build 152 enforced cluster merging | Pick the lowest rsID per RsMergeArch convention |
| One rsID has different ref allele in dbSNP vs gnomAD | dbSNP uses NCBI's ref; gnomAD aligns to its build assembly | Normalize to SPDI before joining |
| Threshold | Convention | Source |
|-----------|-----------|--------|
| Build 156 | Current as of Sep 2022; JSON-only distribution | dbSNP NCBI |
| Multi-allelic rate | ~6-8% of dbSNP rsIDs are multi-allelic | Phan 2025 NAR |
| Variation Services rate limit | 10 req/s with API key; 3 req/s without | NCBI E-utilities policy |
| SPDI position | 0-based, half-open | NCBI SPDI specification |
| HGVS position | 1-based, fully-closed | HGVS nomenclature |
| ALFA samples | ~1M individuals across 12 ancestries (2024 release) | NCBI dbGaP aggregation |
| Bulk download chunks | One file per chromosome (refsnp-chr{N}.json.bz2) | NCBI FTP |
| Symptom | Cause | Solution |
|---------|-------|----------|
| 404 from Variation Services on valid rsID | rsID is withdrawn or never assigned | Check refsnp-withdrawn.json.bz2; consider strand-flipped equivalent |
| Merge chain resolves but final rsID has different alleles | Multi-allelic cluster; pick allele matching the variant | Filter placements_with_allele.alleles[*] by allele match |
| ALFA frequency missing for common variant | Variant not in dbGaP-deposited studies | Fall back to gnomAD (sequencing-derived) |
| Entrez.esummary returns old data | Legacy E-utilities, not Build 156 schema | Switch to Variation Services REST /v0/refsnp/{id} |
| HGVS-c conversion fails for synonymous variants | Some HGVS-c rely on non-MANE transcripts not in NCBI default | Specify transcript explicitly; use VEP --mane_select |
| SPDI for indel does not round-trip | Left/right alignment mismatch | Use /spdi/{spdi}/canonical_representative for normalization |
| Bulk JSON parse OOM | refsnp-chr1.json.bz2 is ~20GB uncompressed | Stream parse with bz2.BZ2File + line-by-line JSON; do not load whole file |
| Pushback | Standard response |
|----------|-------------------|
| "Why not just use rsID for the join?" | rsID is a cluster identifier; ~6-8% of clusters are multi-allelic, causing silent mismatches. We use SPDI / CA ID. |
| "This annotation says rs12345 but the literature says rs67890" | rsIDs are merged; we resolved through RsMergeArch / merged_snapshot_data to the current canonical rsID. |
| "dbSNP frequency != gnomAD frequency" | ALFA (dbSNP-embedded) and gnomAD use different sample sets and different ascertainment (array vs sequencing); reconciled per use case. |
| "Why wasn't Entrez used?" | Entrez db=snp returns the pre-Build-156 summary missing key fields; we use Variation Services REST for the full JSON. |
| "Coordinate off-by-one in the SPDI" | SPDI is 0-based half-open; VCF is 1-based; intentional conversion applied. |
https://api.ncbi.nlm.nih.gov/variation/v0/ftp.ncbi.nlm.nih.gov/snp/latest_release/JSON/https://www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/https://reg.clinicalgenome.org/docs/cg-car/testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.