skills/genomics-bioinformatics/dbsnp-database/SKILL.md
Query NCBI dbSNP for SNP records by rsID, gene, or region via E-utilities and Variation Services REST API. Retrieve alleles, MAF, variant class (SNV/indel/MNV), clinical links, cross-DB IDs (ClinVar, dbVar, 1000G). Free; 3 req/sec (10 with key). For clinical pathogenicity use clinvar-database; for population frequencies use gnomad-database.
npx skillsauth add jaechang-hits/sciagent-skills dbsnp-databaseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
NCBI dbSNP is the primary public repository for short human genetic variants, cataloguing over 1 billion SNPs, indels, and MNVs with allele frequencies, functional annotations, and cross-references to ClinVar, gnomAD, and 1000 Genomes. Variants are identified by stable rsIDs (reference SNP cluster IDs). Access is free via two APIs: the legacy NCBI E-utilities and the newer NCBI Variation Services REST API, which returns structured JSON.
clinvar-database; dbSNP provides IDs and frequency but not curated clinical significancegnomad-database; dbSNP MAF is a single aggregate frequencyrequests, pandas, matplotlib, xml.etree.ElementTree (stdlib)rs80357906), gene symbols, or chromosomal coordinatesemail parameter)&api_key=YOUR_KEY to all requestspip install requests pandas matplotlib
# xml.etree.ElementTree is part of Python stdlib — no additional install needed
import requests
import json
EMAIL = "[email protected]" # required by NCBI policy
BASE_EUTILS = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
BASE_VARIATION = "https://api.ncbi.nlm.nih.gov/variation/v0"
def fetch_snp_by_rsid(rsid: str) -> dict:
"""Fetch a dbSNP record by rsID using the NCBI Variation Services API (structured JSON)."""
rs_num = str(rsid).lstrip("rs")
r = requests.get(f"{BASE_VARIATION}/refsnp/{rs_num}", timeout=15)
r.raise_for_status()
return r.json()
record = fetch_snp_by_rsid("rs1800497") # DRD2 Taq1A
print(f"rsID: rs{record['refsnp_id']}")
print(f"Variant type: {record['primary_snapshot_data'].get('variant_type')}")
# Top-level keys: citations, create_date, dbsnp1_merges, last_update_build_id,
# last_update_date, lost_obs_movements, mane_select_ids, present_obs_movements,
# primary_snapshot_data, refsnp_id. (No top-level `organism` field.)
# rsID: rs1800497
# Variant type: snv
Fetch the full SNP record for a single rsID using efetch with db=snp. Returns an XML document with alleles, placements, and frequency data.
import requests
import xml.etree.ElementTree as ET
EMAIL = "[email protected]"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
def efetch_snp_xml(rsid: str) -> ET.Element:
"""Fetch dbSNP XML record for a single rsID via the docsum rettype.
Note: rettype="xml" returns a namespaced ExchangeSet; rettype="docsum"
returns the simpler eSummaryResult/DocumentSummary tree without namespaces."""
rs_num = str(rsid).lstrip("rs")
r = requests.get(f"{BASE}/efetch.fcgi",
params={"db": "snp", "id": rs_num,
"rettype": "docsum", "retmode": "xml",
"email": EMAIL},
timeout=20)
r.raise_for_status()
return ET.fromstring(r.text)
root = efetch_snp_xml("rs80357906")
# Parse the DocumentSummary record (MAF/MAFALLELE were removed in 2024;
# GLOBAL_MAFS is a sub-tree — use ESummary JSON below for easier access)
for docsum in root.iter("DocumentSummary"):
rs_id = docsum.get("uid")
snp_class = docsum.findtext("SNP_CLASS", "Unknown")
chr_pos = docsum.findtext("CHRPOS", "N/A")
clin_sig = docsum.findtext("CLINICAL_SIGNIFICANCE", "N/A")
print(f"rs{rs_id} | Class: {snp_class} | Position: {chr_pos}")
print(f" ClinSig: {clin_sig}")
# rs80357906 | Class: delins | Position: 17:43057062
# ClinSig: pathogenic,risk-factor,uncertain-significance
# Fetch using ESummary for structured JSON (preferred for batch)
def esummary_snp(rsid: str) -> dict:
rs_num = str(rsid).lstrip("rs")
r = requests.get(f"{BASE}/esummary.fcgi",
params={"db": "snp", "id": rs_num,
"retmode": "json", "email": EMAIL},
timeout=15)
r.raise_for_status()
result = r.json()["result"]
return result.get(rs_num, {})
rec = esummary_snp("rs80357906")
print(f"rs{rec.get('snp_id')}:")
print(f" Class : {rec.get('snp_class')}") # e.g., 'delins'
# `maf`/`mafallele` were removed from ESummary in 2024 — use `global_mafs`
# (list of {study, freq}) and pick a study (e.g., 'GnomAD_genomes') or the
# global aggregate ('TOPMED'/'1000Genomes').
for m in rec.get('global_mafs', [])[:4]:
print(f" MAF[{m['study']:18s}]: {m['freq']}")
print(f" ChrPos : {rec.get('chrpos')}") # 17:43057062
print(f" ClinSig : {rec.get('clinical_significance')}")
print(f" FxnClass : {rec.get('fxn_class')}")
Search dbSNP for all variants in a gene using esearch. Returns a list of rsIDs matching the gene.
import requests
EMAIL = "[email protected]"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
def esearch_snp(query: str, retmax: int = 100) -> tuple[list, int]:
"""Search dbSNP using a query string. Returns (id_list, total_count)."""
r = requests.get(f"{BASE}/esearch.fcgi",
params={"db": "snp", "term": query,
"retmax": retmax, "retmode": "json",
"email": EMAIL},
timeout=15)
r.raise_for_status()
result = r.json()["esearchresult"]
return result["idlist"], int(result["count"])
# All variants in BRCA1
ids, total = esearch_snp("BRCA1[gene] AND human[orgn]", retmax=20)
print(f"BRCA1 variants in dbSNP: {total:,} total")
print(f"First 5 rsIDs: {['rs' + i for i in ids[:5]]}")
# Only clinical variants (linked to ClinVar)
ids_clin, total_clin = esearch_snp(
"BRCA1[gene] AND human[orgn] AND clinsig[filter]", retmax=50)
print(f"BRCA1 variants with clinical significance: {total_clin:,}")
Search for all variants in a genomic region using chromosome coordinates.
import requests
EMAIL = "[email protected]"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
def search_region(chrom: str, start: int, stop: int,
assembly: str = "GRCh38", retmax: int = 200) -> tuple[list, int]:
"""Find all dbSNP variants in a chromosomal region."""
query = f"{chrom}[CHR] AND {start}:{stop}[CHRPOS37]" if assembly == "GRCh37" else \
f"{chrom}[CHR] AND {start}:{stop}[CHRPOS]"
r = requests.get(f"{BASE}/esearch.fcgi",
params={"db": "snp", "term": query,
"retmax": retmax, "retmode": "json",
"email": EMAIL},
timeout=20)
r.raise_for_status()
result = r.json()["esearchresult"]
return result["idlist"], int(result["count"])
# PCSK9 exon 4 region (GRCh38)
ids, total = search_region("1", 55039700, 55040200)
print(f"Variants in chr1:55039700-55040200: {total:,} total")
print(f"Retrieved {len(ids)} rsIDs: {['rs' + i for i in ids[:5]]}")
Retrieve structured summary data for variant records using ESummary, extracting MAF, alleles, and database cross-links.
import requests, json
EMAIL = "[email protected]"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
def fetch_snp_summaries(rsids: list) -> dict:
"""Fetch ESummary records for a list of rsIDs. Returns dict keyed by rs number."""
ids_str = ",".join(str(r).lstrip("rs") for r in rsids)
r = requests.post(f"{BASE}/esummary.fcgi",
data={"db": "snp", "id": ids_str,
"retmode": "json", "email": EMAIL},
timeout=20)
r.raise_for_status()
return r.json()["result"]
rsids = ["rs80357906", "rs80357220", "rs28897672", "rs1801133"]
result = fetch_snp_summaries(rsids)
for rs_num in rsids:
uid = str(rs_num).lstrip("rs")
rec = result.get(uid, {})
# `global_mafs` is a list of {study, freq}; pick the first or filter by study
gmafs = rec.get("global_mafs", [])
maf_str = gmafs[0]["freq"] if gmafs else "N/A"
maf_study = gmafs[0]["study"] if gmafs else ""
print(f"\n{rs_num}:")
print(f" Class : {rec.get('snp_class', 'N/A')}")
print(f" MAF : {maf_str} (from {maf_study})")
print(f" Location : {rec.get('chrpos', 'N/A')}")
print(f" ClinSig : {rec.get('clinical_significance', 'N/A')}")
print(f" Function : {rec.get('fxn_class', 'N/A')}")
Efficiently upload hundreds of rsIDs to the NCBI history server using EPost, then retrieve them in batches with EFetch.
import requests, time, pandas as pd
EMAIL = "[email protected]"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
def epost_ids(id_list: list) -> tuple[str, str]:
"""Upload rsIDs to NCBI history server. Returns (WebEnv, query_key)."""
ids_str = ",".join(str(i).lstrip("rs") for i in id_list)
r = requests.post(f"{BASE}/epost.fcgi",
data={"db": "snp", "id": ids_str, "email": EMAIL},
timeout=30)
r.raise_for_status()
import xml.etree.ElementTree as ET
root = ET.fromstring(r.text)
webenv = root.findtext("WebEnv")
query_key = root.findtext("QueryKey")
return webenv, query_key
def efetch_history(webenv: str, query_key: str,
retstart: int = 0, retmax: int = 100) -> dict:
"""Retrieve records from NCBI history server using ESummary."""
r = requests.get(f"{BASE}/esummary.fcgi",
params={"db": "snp", "WebEnv": webenv,
"query_key": query_key, "retstart": retstart,
"retmax": retmax, "retmode": "json",
"email": EMAIL},
timeout=30)
r.raise_for_status()
return r.json()["result"]
rsid_batch = ["rs80357906", "rs80357220", "rs28897672", "rs1801133",
"rs429358", "rs7412", "rs1800497"]
webenv, query_key = epost_ids(rsid_batch)
print(f"Posted {len(rsid_batch)} IDs | WebEnv: {webenv[:40]}...")
records = []
for start in range(0, len(rsid_batch), 100):
result = efetch_history(webenv, query_key, retstart=start, retmax=100)
for uid in result.get("uids", []):
rec = result[uid]
# global_mafs replaced maf/mafallele in 2024 — flatten the first entry
gmafs = rec.get("global_mafs", [])
records.append({
"rsid": f"rs{uid}",
"snp_class": rec.get("snp_class"),
"maf": gmafs[0]["freq"] if gmafs else None,
"maf_study": gmafs[0]["study"] if gmafs else None,
"chrpos": rec.get("chrpos"),
"clinical_sig": rec.get("clinical_significance"),
})
time.sleep(0.5)
df = pd.DataFrame(records)
print(f"\nRetrieved {len(df)} records:")
print(df.to_string(index=False))
The newer REST API returns structured JSON with detailed allele placements, frequencies, and variant type annotations. Preferred for programmatic rsID resolution.
import requests
import pandas as pd
BASE_VARIATION = "https://api.ncbi.nlm.nih.gov/variation/v0"
def fetch_refsnp(rsid: str) -> dict:
"""Fetch structured JSON from NCBI Variation Services API."""
rs_num = str(rsid).lstrip("rs")
r = requests.get(f"{BASE_VARIATION}/refsnp/{rs_num}", timeout=15)
r.raise_for_status()
return r.json()
def parse_allele_frequencies(record: dict) -> list:
"""Extract allele frequencies from a Variation Services record.
freq[i] keys: study_name, study_version, local_row_id, observation, allele_count, total_count.
The asserted allele identifier lives in freq.observation.inserted_sequence (NOT at ann.allele)."""
freqs = []
snapshot = record.get("primary_snapshot_data", {})
for ann in snapshot.get("allele_annotations", []):
for f in ann.get("frequency", []):
obs = f.get("observation", {})
freqs.append({
"allele": obs.get("inserted_sequence"),
"study": f.get("study_name"),
"allele_count": f.get("allele_count"),
"total_count": f.get("total_count"),
"freq": (f["allele_count"] / f["total_count"]
if f.get("total_count") else None),
})
return freqs
record = fetch_refsnp("rs1800497") # DRD2 Taq1A variant
print(f"rsID: rs{record['refsnp_id']}")
print(f"Variant type: {record['primary_snapshot_data'].get('variant_type')}") # 'snv'
placements = record["primary_snapshot_data"].get("placements_with_allele", [])
for placement in placements[:2]:
seq_id = placement.get("seq_id")
for allele in placement.get("alleles", []):
spdi = allele.get("allele", {}).get("spdi", {})
print(f" Placement: {seq_id} | SPDI: {spdi.get('inserted_sequence')}")
freqs = parse_allele_frequencies(record)
if freqs:
df_freq = pd.DataFrame(freqs[:5])
print(f"\nAllele frequencies ({len(freqs)} entries):")
print(df_freq.to_string(index=False))
dbSNP uses two IDs: rs IDs (Reference SNP cluster IDs) are stable public identifiers assigned after clustering submitted variants. ss IDs (Submitted SNP IDs) are assigned to individual laboratory submissions before clustering. Use rs IDs for all queries — ss IDs are internal and submission-specific. A single rs ID may cluster multiple ss IDs from different submissions.
clinvar-database for the full pathogenicity record.| Class | Description | Example |
|-------|-------------|---------|
| snv | Single nucleotide variant (A>T) | rs80357906 |
| indel | Insertion or deletion | rs786201005 |
| mnv | Multi-nucleotide variant | rs1057519737 |
| ins | Pure insertion | rs113993960 |
| del | Pure deletion | rs66767301 |
| microsatellite | STR (short tandem repeat) | rs5030655 |
Goal: Given a list of rsIDs from a variant call pipeline, retrieve MAF, position, and clinical significance for all variants in one run.
import requests, time, pandas as pd
EMAIL = "[email protected]"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
def epost_snp(rsids):
ids = ",".join(str(r).lstrip("rs") for r in rsids)
r = requests.post(f"{BASE}/epost.fcgi",
data={"db": "snp", "id": ids, "email": EMAIL}, timeout=30)
r.raise_for_status()
import xml.etree.ElementTree as ET
root = ET.fromstring(r.text)
return root.findtext("WebEnv"), root.findtext("QueryKey")
def esummary_history(webenv, query_key, start, retmax=100):
r = requests.get(f"{BASE}/esummary.fcgi",
params={"db": "snp", "WebEnv": webenv,
"query_key": query_key, "retstart": start,
"retmax": retmax, "retmode": "json", "email": EMAIL},
timeout=30)
r.raise_for_status()
return r.json()["result"]
# Example: VCF post-processing — annotate a list of called variants
variant_rsids = [
"rs80357906", "rs80357220", "rs28897672", "rs1801133",
"rs429358", "rs7412", "rs1800497", "rs2230199",
]
print(f"Posting {len(variant_rsids)} rsIDs to NCBI history server...")
webenv, query_key = epost_snp(variant_rsids)
records = []
for start in range(0, len(variant_rsids), 100):
result = esummary_history(webenv, query_key, start=start, retmax=100)
for uid in result.get("uids", []):
rec = result[uid]
gmafs = rec.get("global_mafs", []) # 2024 schema (replaced maf/mafallele)
records.append({
"rsid": f"rs{uid}",
"snp_class": rec.get("snp_class"),
"maf": gmafs[0]["freq"] if gmafs else None,
"maf_study": gmafs[0]["study"] if gmafs else None,
"chrpos_grch38": rec.get("chrpos"),
"gene": rec.get("genes", [{}])[0].get("name") if rec.get("genes") else None,
"clinical_significance": rec.get("clinical_significance"),
"fxn_class": rec.get("fxn_class"),
})
time.sleep(0.5)
df = pd.DataFrame(records)
df.to_csv("variant_annotations.csv", index=False)
print(f"Annotated {len(df)} variants → variant_annotations.csv")
print(df[["rsid", "snp_class", "maf", "chrpos_grch38", "clinical_significance"]].to_string(index=False))
Goal: Search all dbSNP variants in a gene and plot their variant class distribution.
import requests, time
import pandas as pd
import matplotlib.pyplot as plt
EMAIL = "[email protected]"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
def search_gene_variants(gene: str, retmax: int = 500) -> list:
r = requests.get(f"{BASE}/esearch.fcgi",
params={"db": "snp", "term": f"{gene}[gene] AND human[orgn]",
"retmax": retmax, "retmode": "json", "email": EMAIL},
timeout=20)
r.raise_for_status()
return r.json()["esearchresult"]["idlist"]
def fetch_summaries_batch(ids: list, batch_size: int = 100) -> list:
records = []
for i in range(0, len(ids), batch_size):
batch = ids[i:i+batch_size]
ids_str = ",".join(batch)
r = requests.post(f"{BASE}/esummary.fcgi",
data={"db": "snp", "id": ids_str,
"retmode": "json", "email": EMAIL},
timeout=30)
r.raise_for_status()
result = r.json()["result"]
for uid in result.get("uids", []):
rec = result[uid]
records.append({"rsid": f"rs{uid}", "snp_class": rec.get("snp_class", "unknown")})
time.sleep(0.5)
return records
gene = "CFTR"
print(f"Searching dbSNP for {gene} variants...")
ids = search_gene_variants(gene, retmax=300)
print(f"Found {len(ids)} IDs; fetching summaries...")
records = fetch_summaries_batch(ids)
df = pd.DataFrame(records)
class_counts = df["snp_class"].value_counts()
# Plot variant class distribution
fig, ax = plt.subplots(figsize=(8, 5))
colors = ["#4472C4", "#ED7D31", "#A9D18E", "#FF0000", "#FFC000", "#7030A0"]
bars = ax.bar(class_counts.index, class_counts.values,
color=colors[:len(class_counts)], edgecolor="white")
ax.bar_label(bars, padding=3, fontsize=9)
ax.set_xlabel("Variant Class")
ax.set_ylabel("Count")
ax.set_title(f"dbSNP Variant Classes in {gene} (n={len(df)})")
plt.tight_layout()
plt.savefig(f"{gene}_variant_classes.png", dpi=150, bbox_inches="tight")
print(f"Saved {gene}_variant_classes.png")
print(class_counts.to_string())
# snv 241
# indel 38
# mnv 12
# del 7
# ins 2
| Parameter | Function/Endpoint | Default | Range / Options | Effect |
|-----------|-------------------|---------|-----------------|--------|
| db | All E-utilities | required | "snp" | Database selector; must be "snp" for dbSNP queries |
| id | efetch, esummary, epost | required | rsID number(s) without rs prefix | Variant identifier(s) to fetch |
| term | esearch | required | dbSNP query string | Search expression with field tags: [gene], [CHR], [CHRPOS], [rs], clinsig[filter] |
| retmax | esearch | 20 | 1–10000 | Maximum records returned per search |
| retmode | esearch, esummary | "xml" | "json", "xml" | Response format; use "json" for easy parsing |
| rettype | efetch | "docsum" | "docsum", "xml" | Record type for efetch responses |
| WebEnv + query_key | esummary, efetch | — | from epost response | History server tokens for batch retrieval |
| email | All E-utilities | required | valid email string | NCBI policy; used for rate attribution |
| api_key | All E-utilities | optional | NCBI API key string | Raises rate limit from 3 to 10 req/sec |
Register for a free NCBI API key: Adds api_key=YOUR_KEY to requests and triples your rate limit (3 → 10 req/sec) with no other changes. Register at https://www.ncbi.nlm.nih.gov/account/.
Use epost+esummary for batches of more than 10 rsIDs: Avoid looping individual efetch calls. EPost uploads all IDs in one request to the history server; subsequent ESummary calls retrieve them in configurable batches of up to 500.
Prefer NCBI Variation Services API for structured JSON: The /variation/v0/refsnp/{rs_num} endpoint returns a fully structured JSON with SPDI allele representations, placements, and frequency tables. Easier to parse than E-utilities XML for modern applications.
Check snp_class before interpreting MAF: Indels and MNVs use different allele counting conventions than SNVs. Treat multi-allelic sites carefully — the reported MAF may refer to one allele among several.
Combine dbSNP with ClinVar lookups: dbSNP records the clinical_significance field as a string (e.g., "pathogenic") but does not contain submitter details, review status, or HGVS details. Use clinvar-database for the full pathogenicity record when clinical interpretation is required.
When to use: Check whether a variant is registered in dbSNP before downstream annotation.
import requests
EMAIL = "[email protected]"
def check_rsid(rsid: str) -> dict:
"""Check if an rsID exists in dbSNP and return basic info."""
rs_num = str(rsid).lstrip("rs")
r = requests.get(
"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi",
params={"db": "snp", "id": rs_num, "retmode": "json", "email": EMAIL},
timeout=10
)
result = r.json().get("result", {})
rec = result.get(rs_num, {})
if not rec or "error" in rec:
return {"rsid": rsid, "found": False}
gmafs = rec.get("global_mafs", [])
return {
"rsid": rsid,
"found": True,
"snp_class": rec.get("snp_class"),
"chrpos": rec.get("chrpos"),
"maf": gmafs[0]["freq"] if gmafs else None,
"maf_study": gmafs[0]["study"] if gmafs else None,
"clinical_significance": rec.get("clinical_significance"),
}
for rsid in ["rs80357906", "rs9999999999", "rs1800497"]:
info = check_rsid(rsid)
if info["found"]:
print(f"{rsid}: {info['snp_class']} | pos={info['chrpos']} | MAF={info['maf']} ({info['maf_study']})")
else:
print(f"{rsid}: NOT FOUND in dbSNP")
# rs80357906: delins | pos=17:43057062 | MAF=G=0.0008929/4 (Estonian)
# rs9999999999: NOT FOUND in dbSNP
# rs1800497: snv | pos=11:113400106 | MAF=G=0.42... (study varies)
When to use: Convert a gene name to a list of rsIDs for use in downstream tools (PLINK, ANNOVAR, etc.).
import requests, time, pandas as pd
EMAIL = "[email protected]"
BASE = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
def gene_to_rsids(gene: str, max_variants: int = 200) -> pd.DataFrame:
"""Search dbSNP for a gene and return rsIDs with GRCh38 coordinates."""
# Step 1: search
r = requests.get(f"{BASE}/esearch.fcgi",
params={"db": "snp", "term": f"{gene}[gene] AND human[orgn]",
"retmax": max_variants, "retmode": "json", "email": EMAIL},
timeout=15)
r.raise_for_status()
ids = r.json()["esearchresult"]["idlist"]
if not ids:
return pd.DataFrame()
time.sleep(0.4)
# Step 2: fetch summaries
r2 = requests.post(f"{BASE}/esummary.fcgi",
data={"db": "snp", "id": ",".join(ids),
"retmode": "json", "email": EMAIL},
timeout=30)
r2.raise_for_status()
result = r2.json()["result"]
rows = []
for uid in result.get("uids", []):
rec = result[uid]
gmafs = rec.get("global_mafs", [])
rows.append({
"rsid": f"rs{uid}",
"chrpos_grch38": rec.get("chrpos"),
"snp_class": rec.get("snp_class"),
"maf": gmafs[0]["freq"] if gmafs else None,
})
return pd.DataFrame(rows)
df = gene_to_rsids("APOE", max_variants=50)
print(f"APOE variants retrieved: {len(df)}")
print(df.head(8).to_string(index=False))
df.to_csv("APOE_rsids.csv", index=False)
| Problem | Cause | Solution |
|---------|-------|----------|
| HTTP 429 or connection refused | Rate limit exceeded (3 req/sec) | Add time.sleep(0.35) between requests; register for API key to get 10 req/sec |
| ESummary returns {"error": "Invalid uid"} | rsID does not exist in dbSNP | Check rsID spelling; verify with NCBI browser; variant may be a novel call not yet in dbSNP |
| esearch returns 0 results for a gene | Gene symbol mismatch or missing human[orgn] filter | Try adding AND human[orgn]; check NCBI gene symbol at https://www.ncbi.nlm.nih.gov/gene |
| KeyError: 'maf' / 'mafallele' in ESummary parsing | Fields removed in the 2024 dbSNP ESummary schema | Use rec["global_mafs"] — list of {"study": ..., "freq": "<allele>=<value>/<count>"}; pick a study (GnomAD_genomes, TOPMED, ALFA) and parse the freq string |
| global_mafs empty | Variant has no aggregated population frequency in dbSNP | Use gnomAD via gnomad-database directly for population frequencies |
| root.iter("DocumentSummary") returns 0 with rettype="xml" | EFetch XML root is namespaced ({https://www.ncbi.nlm.nih.gov/SNP/docsum}ExchangeSet) | Use rettype="docsum" (no namespace) or pass the full namespaced tag to iter() |
| Variation Services API returns 404 | rsID not found or wrong URL format | Confirm integer rs number (no rs prefix) in /refsnp/{rs_num} endpoint |
| EPost XML parsing fails | Non-XML response (rate limit HTML error page) | Check response status code first; add retry logic with time.sleep(1) |
| Batch efetch returns fewer records than posted | Some rsIDs were merged or retired | Cross-check against NCBI merge history; retired rsIDs redirect to current active rs |
clinvar-database — ClinVar pathogenicity classifications for variants identified by rsID (complement to dbSNP)gnomad-database — Population allele frequencies by ancestry group (more detailed than dbSNP MAF)gwas-database — GWAS Catalog for SNP-trait associations from published GWAS studiesensembl-database — Ensembl REST API for variant consequences and gene annotationssnpeff-variant-annotation — Annotate VCF files with SnpEff and SnpSift, which adds dbSNP rsIDs and functional predictionstools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.