skills/genomics-bioinformatics/gnomad-database/SKILL.md
gnomAD v4 population variant frequencies via GraphQL API. Allele counts and frequencies stratified by ancestry (AFR, AMR, EAS, NFE, SAS, FIN, ASJ, MID), gene-level constraint (pLI, LOEUF, missense z), and coverage. Identify rare or constrained variants. For clinical pathogenicity use clinvar-database; for GWAS use gwas-database.
npx skillsauth add jaechang-hits/sciagent-skills gnomad-databaseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The Genome Aggregation Database (gnomAD) is a resource of aggregated exome and genome sequencing data from 730,000+ individuals. It provides population variant frequencies stratified by 9 ancestry groups, gene-level constraint scores (pLI, LOEUF), and read coverage information. Access is free via a GraphQL API at https://gnomad.broadinstitute.org/api — no authentication required, no official SDK.
clinvar-database; gnomAD provides frequency evidence but does not classify pathogenicitygwas-database; gnomAD is for population frequency lookupsrequests, pandas, matplotlibBRCA1), variant IDs (1-69511-A-G format, or rsIDs)time.sleep(0.5) between requests for polite access; avoid bursts over 10 requests/secondpip install requests pandas matplotlib
import requests
import time
GNOMAD_API = "https://gnomad.broadinstitute.org/api"
def gnomad_query(query: str, variables: dict = None) -> dict:
"""Execute a gnomAD GraphQL query and return the data payload."""
payload = {"query": query, "variables": variables or {}}
r = requests.post(GNOMAD_API, json=payload, timeout=30)
r.raise_for_status()
result = r.json()
if "errors" in result:
raise ValueError(f"GraphQL errors: {result['errors']}")
return result["data"]
# Quick check: get pLI / LOEUF for BRCA1
# GnomadConstraint fields are FLAT (no nested `lof { oe_ci { upper } }` type).
# `pli` is the current field; `pLI` is preserved as a deprecated alias.
query = """
query GeneConstraint($gene_symbol: String!, $reference_genome: ReferenceGenomeId!) {
gene(gene_symbol: $gene_symbol, reference_genome: $reference_genome) {
gnomad_constraint { pli oe_lof_upper }
}
}
"""
data = gnomad_query(query, {"gene_symbol": "BRCA1", "reference_genome": "GRCh38"})
constraint = data["gene"]["gnomad_constraint"]
print(f"BRCA1 pLI: {constraint['pli']:.3e}") # ~5.5e-38 (very high LoF-intolerant)
print(f"BRCA1 LOEUF: {constraint['oe_lof_upper']:.3f}") # 0.928
Fetch all variants in a gene with population allele frequencies. Returns a list of variants with their genome-level frequencies.
import requests, time
GNOMAD_API = "https://gnomad.broadinstitute.org/api"
def gnomad_query(query, variables=None):
r = requests.post(GNOMAD_API, json={"query": query, "variables": variables or {}}, timeout=30)
r.raise_for_status()
result = r.json()
if "errors" in result:
raise ValueError(f"GraphQL errors: {result['errors']}")
return result["data"]
GENE_VARIANTS_QUERY = """
query GeneVariants($gene_symbol: String!, $reference_genome: ReferenceGenomeId!, $dataset: DatasetId!) {
gene(gene_symbol: $gene_symbol, reference_genome: $reference_genome) {
gene_id
symbol
variants(dataset: $dataset) {
variant_id
rsids
chrom
pos
ref
alt
consequence
lof
genome {
an
ac
af
faf95 { popmax popmax_population }
}
}
}
}
"""
data = gnomad_query(GENE_VARIANTS_QUERY, {
"gene_symbol": "PCSK9",
"reference_genome": "GRCh38",
"dataset": "gnomad_r4"
})
variants = data["gene"]["variants"]
print(f"Gene: {data['gene']['symbol']} ({data['gene']['gene_id']})")
print(f"Total variants: {len(variants)}")
# Filter to rare variants (AF < 0.001)
rare = [v for v in variants if v["genome"] and v["genome"]["af"] is not None and v["genome"]["af"] < 0.001]
print(f"Rare variants (AF < 0.1%): {len(rare)}")
for v in rare[:3]:
print(f" {v['variant_id']} | {v['consequence']} | AF={v['genome']['af']:.2e}")
Fetch detailed information for a single variant by its gnomAD variant ID (CHROM-POS-REF-ALT format) or search by rsID.
VARIANT_QUERY = """
query VariantDetails($variantId: String!, $dataset: DatasetId!) {
variant(variantId: $variantId, dataset: $dataset) {
variant_id
rsids
chrom
pos
ref
alt
transcript_consequences {
gene_symbol
transcript_id
is_canonical
major_consequence
lof
lof_filter
lof_flags
}
genome {
an
ac
af
faf95 { popmax popmax_population }
populations { id ac an homozygote_count }
}
}
}
"""
# Query.variant() arg is `variantId` (camelCase). The top-level deprecated
# `consequence`/`lof`/`lof_filter`/`lof_flags` fields on VariantDetails were
# removed — read them from `transcript_consequences` (plural list; pick the
# canonical transcript with is_canonical=True).
data = gnomad_query(VARIANT_QUERY, {
"variantId": "1-55039974-G-T", # PCSK9 p.Tyr142Ter (LoF)
"dataset": "gnomad_r4"
})
v = data["variant"]
canon = next((t for t in (v.get("transcript_consequences") or []) if t.get("is_canonical")),
(v.get("transcript_consequences") or [{}])[0])
print(f"Variant : {v['variant_id']}")
print(f"rsIDs : {v['rsids']}")
print(f"Gene : {canon.get('gene_symbol')}")
print(f"Consequence : {canon.get('major_consequence')} | LoF: {canon.get('lof')}")
g = v["genome"]
print(f"Genome AF : {g['af']:.2e} (AC={g['ac']}, AN={g['an']})")
print(f"FAF95 popmax: {g['faf95']['popmax']:.2e} in {g['faf95']['popmax_population']}")
Retrieve allele frequency broken down by ancestry group for a specific variant.
import pandas as pd
POPULATION_FREQ_QUERY = """
query PopFreqs($variantId: String!, $dataset: DatasetId!) {
variant(variantId: $variantId, dataset: $dataset) {
variant_id
genome {
populations {
id
ac
an
homozygote_count
}
}
}
}
"""
ANCESTRY_LABELS = {
"afr": "African/African American",
"amr": "Admixed American",
"eas": "East Asian",
"fin": "Finnish",
"nfe": "Non-Finnish European",
"sas": "South Asian",
"asj": "Ashkenazi Jewish",
"mid": "Middle Eastern",
"oth": "Other",
}
data = gnomad_query(POPULATION_FREQ_QUERY, {
"variantId": "1-55039974-G-T",
"dataset": "gnomad_r4"
})
pops = data["variant"]["genome"]["populations"]
# VariantPopulation no longer exposes `af` directly — compute from ac/an.
main_pops = [p for p in pops if p["id"] in ANCESTRY_LABELS and p["an"] > 0]
df = pd.DataFrame(main_pops)
df["af"] = df["ac"] / df["an"]
df["label"] = df["id"].map(ANCESTRY_LABELS)
df = df.sort_values("af", ascending=False)
print(df[["label", "ac", "an", "af", "homozygote_count"]].to_string(index=False))
Retrieve per-base read depth coverage for a gene region to assess data completeness.
COVERAGE_QUERY = """
query Coverage($chrom: String!, $start: Int!, $stop: Int!,
$reference_genome: ReferenceGenomeId!, $dataset: DatasetId!) {
region(chrom: $chrom, start: $start, stop: $stop, reference_genome: $reference_genome) {
coverage(dataset: $dataset) {
exome { mean median over_1 over_10 over_20 over_30 over_100 }
genome { mean median over_1 over_10 over_20 over_30 over_100 }
}
}
}
"""
# Coverage is no longer a top-level Query field — it lives under Region, and the
# `RegionCoverage` shape returns parallel `exome` / `genome` arrays (one entry per
# position; there is no per-row `pos` — the array index is implicit position).
data = gnomad_query(COVERAGE_QUERY, {
"chrom": "1",
"start": 55039700,
"stop": 55040200,
"reference_genome": "GRCh38",
"dataset": "gnomad_r4",
})
cov = data["region"]["coverage"]
exome = cov.get("exome") or []
genome = cov.get("genome") or []
print(f"Coverage positions: exome={len(exome)}, genome={len(genome)}")
if exome:
avg_mean = sum(c["mean"] for c in exome) / len(exome)
pct_20x = sum(1 for c in exome if (c.get("over_20") or 0) > 0.9) / len(exome) * 100
print(f"Exome — average mean depth: {avg_mean:.1f}x; {pct_20x:.1f}% positions >=90% at >=20x")
c = exome[0]
print(f" Sample position: mean={c['mean']:.1f}x, median={c['median']}x, "
f">=10x:{c['over_10']:.3f} >=20x:{c['over_20']:.3f} >=30x:{c['over_30']:.3f}")
Retrieve gene-level constraint scores: pLI (probability of loss-of-function intolerance), LOEUF (LoF observed/expected upper bound fraction), and missense z-score.
CONSTRAINT_QUERY = """
query GeneConstraint($gene_symbol: String!, $reference_genome: ReferenceGenomeId!) {
gene(gene_symbol: $gene_symbol, reference_genome: $reference_genome) {
gene_id
symbol
name
gnomad_constraint {
pli
mis_z
lof_z
obs_lof
exp_lof
oe_lof
oe_lof_lower
oe_lof_upper
}
}
}
"""
# GnomadConstraint fields are flat (no nested `lof { obs exp oe oe_ci }`).
# `gene_name` was removed from Gene — use `name` or `symbol`. `pNull`/`pRec` are gone.
genes = ["PCSK9", "BRCA1", "TP53", "TTN"]
print(f"{'Gene':<10} {'pLI':>10} {'LOEUF':>7} {'mis_z':>7}")
print("-" * 38)
for gene in genes:
data = gnomad_query(CONSTRAINT_QUERY, {"gene_symbol": gene, "reference_genome": "GRCh38"})
c = data["gene"]["gnomad_constraint"]
print(f"{gene:<10} {c['pli']:>10.3e} {c['oe_lof_upper']:>7.3f} {c['mis_z']:>7.2f}")
time.sleep(0.5)
# Gene pLI LOEUF mis_z
# PCSK9 4.27e-04 0.456 1.41
# BRCA1 5.52e-38 0.928 1.73
# TP53 8.86e-22 1.020 1.93
# TTN 1.00e+00 0.871 1.30
Fetch all variants in a chromosomal region, useful for targeted panels and regional analyses.
REGION_VARIANTS_QUERY = """
query RegionVariants($chrom: String!, $start: Int!, $stop: Int!,
$dataset: DatasetId!, $reference_genome: ReferenceGenomeId!) {
region(chrom: $chrom, start: $start, stop: $stop,
reference_genome: $reference_genome) {
variants(dataset: $dataset) {
variant_id
rsids
pos
consequence
lof
genome {
af
ac
an
faf95 { popmax }
}
}
}
}
"""
data = gnomad_query(REGION_VARIANTS_QUERY, {
"chrom": "1",
"start": 55039974,
"stop": 55064852, # PCSK9 coding region
"dataset": "gnomad_r4",
"reference_genome": "GRCh38"
})
variants = data["region"]["variants"]
print(f"Variants in region: {len(variants)}")
# Summarize by consequence
from collections import Counter
conseq_counts = Counter(v["consequence"] for v in variants if v["consequence"])
for c, n in conseq_counts.most_common(5):
print(f" {c}: {n}")
# Loss-of-function variants
lof_vars = [v for v in variants if v["lof"] == "HC"]
print(f"\nHigh-confidence LoF variants: {len(lof_vars)}")
for v in lof_vars[:3]:
af = v["genome"]["af"] if v["genome"] else None
print(f" {v['variant_id']} | AF={af:.2e}" if af else f" {v['variant_id']} | AF=NA")
gnomAD v4 has two datasets: gnomad_r4 (exomes + genomes, GRCh38, 730K+ individuals) and gnomad_r2_1 (GRCh37, 141K individuals). The API uses a GraphQL schema where variants are accessed either through gene(), region(), or direct variant() lookups. Each variant has separate exome and genome frequency objects; the genome object is preferred for population frequency comparisons.
gnomAD v4 reports frequencies for 9 top-level ancestry groups identified by genetic ancestry (not self-reported):
| Code | Population | Dataset size (approx) |
|------|-----------|----------------------|
| afr | African/African American | 76,000+ |
| amr | Admixed American | 45,000+ |
| eas | East Asian | 50,000+ |
| fin | Finnish | 24,000+ |
| nfe | Non-Finnish European | 400,000+ |
| sas | South Asian | 80,000+ |
| asj | Ashkenazi Jewish | 10,000+ |
| mid | Middle Eastern | 5,000+ |
| oth | Other/Unknown | varies |
The faf95 field provides a one-sided 95% confidence interval lower bound on the allele frequency in the population where the variant is most common. Use this for conservative variant filtering in clinical pipelines — a variant with faf95.popmax < 0.001 is likely rare enough to warrant clinical investigation.
| Score | Interpretation |
|-------|----------------|
| pLI > 0.9 | Gene is intolerant to LoF — likely essential |
| LOEUF < 0.35 | Strong LoF constraint (upper CI of oe ratio) |
| mis_z > 3.09 | Gene shows significant missense constraint |
| pLI < 0.1 | Gene tolerates LoF — homozygous LoF variants exist |
Goal: Retrieve all rare (AF < 1%) variants in a gene, stratified by consequence, exported to CSV.
import requests, time, pandas as pd
GNOMAD_API = "https://gnomad.broadinstitute.org/api"
def gnomad_query(query, variables=None):
r = requests.post(GNOMAD_API, json={"query": query, "variables": variables or {}}, timeout=30)
r.raise_for_status()
result = r.json()
if "errors" in result:
raise ValueError(result["errors"])
return result["data"]
GENE_VARIANTS_QUERY = """
query GeneVariants($gene_symbol: String!, $reference_genome: ReferenceGenomeId!, $dataset: DatasetId!) {
gene(gene_symbol: $gene_symbol, reference_genome: $reference_genome) {
gene_id symbol name
variants(dataset: $dataset) {
variant_id rsids chrom pos ref alt consequence lof lof_filter
genome {
an ac af
faf95 { popmax popmax_population }
populations { id ac an homozygote_count }
}
}
}
}
"""
gene = "LDLR"
data = gnomad_query(GENE_VARIANTS_QUERY, {
"gene_symbol": gene,
"reference_genome": "GRCh38",
"dataset": "gnomad_r4"
})
variants = data["gene"]["variants"]
rows = []
for v in variants:
g = v.get("genome") or {}
af = g.get("af")
if af is None or af >= 0.01: # keep only rare variants
continue
rows.append({
"variant_id": v["variant_id"],
"rsids": ";".join(v.get("rsids") or []),
"consequence": v.get("consequence"),
"lof": v.get("lof"),
"af_genome": af,
"ac": g.get("ac"),
"an": g.get("an"),
"faf95_popmax": g.get("faf95", {}).get("popmax"),
"faf95_pop": g.get("faf95", {}).get("popmax_population"),
})
df = pd.DataFrame(rows)
df = df.sort_values("af_genome")
df.to_csv(f"{gene}_rare_variants.csv", index=False)
print(f"{gene}: {len(variants)} total variants, {len(df)} rare (AF<1%)")
print(df.groupby("consequence")["variant_id"].count().sort_values(ascending=False).head(6))
# LDLR: 2847 total variants, 2631 rare (AF<1%)
# consequence
# missense_variant 1423
# synonymous_variant 512
# splice_region_variant 231
# stop_gained 198
Goal: Query a list of variants and produce a barplot of allele frequencies by ancestry group.
import requests, time
import pandas as pd
import matplotlib.pyplot as plt
GNOMAD_API = "https://gnomad.broadinstitute.org/api"
def gnomad_query(query, variables=None):
r = requests.post(GNOMAD_API, json={"query": query, "variables": variables or {}}, timeout=30)
r.raise_for_status()
result = r.json()
if "errors" in result:
raise ValueError(result["errors"])
return result["data"]
POPULATION_FREQ_QUERY = """
query PopFreqs($variantId: String!, $dataset: DatasetId!) {
variant(variantId: $variantId, dataset: $dataset) {
variant_id
genome {
populations { id ac an homozygote_count }
}
}
}
"""
ANCESTRY_LABELS = {
"afr": "AFR", "amr": "AMR", "eas": "EAS", "fin": "FIN",
"nfe": "NFE", "sas": "SAS", "asj": "ASJ", "mid": "MID",
}
variant_id = "1-55039974-G-T" # PCSK9 p.Tyr142Ter
data = gnomad_query(POPULATION_FREQ_QUERY, {
"variantId": variant_id,
"dataset": "gnomad_r4"
})
pops = data["variant"]["genome"]["populations"]
rows = [{"code": p["id"], "af": p["ac"] / p["an"], "ac": p["ac"], "an": p["an"]}
for p in pops if p["id"] in ANCESTRY_LABELS and p["an"] > 0]
df = pd.DataFrame(rows)
df["label"] = df["code"].map(ANCESTRY_LABELS)
df = df.sort_values("af", ascending=False)
fig, ax = plt.subplots(figsize=(9, 4))
bars = ax.bar(df["label"], df["af"] * 100, color="#4472C4", edgecolor="white")
ax.bar_label(bars, fmt="%.3f%%", fontsize=8, padding=2)
ax.set_xlabel("Ancestry Group")
ax.set_ylabel("Allele Frequency (%)")
ax.set_title(f"gnomAD v4 Population Frequencies\n{variant_id}")
ax.set_ylim(0, df["af"].max() * 150)
plt.tight_layout()
plt.savefig("gnomad_pop_frequencies.png", dpi=150, bbox_inches="tight")
print(f"Saved gnomad_pop_frequencies.png (n={len(df)} ancestry groups)")
print(df[["label", "af", "ac", "an"]].to_string(index=False))
Goal: Score a gene list by constraint metrics and flag LoF-intolerant genes.
import requests, time, pandas as pd
GNOMAD_API = "https://gnomad.broadinstitute.org/api"
def gnomad_query(query, variables=None):
r = requests.post(GNOMAD_API, json={"query": query, "variables": variables or {}}, timeout=30)
r.raise_for_status()
result = r.json()
if "errors" in result:
raise ValueError(result["errors"])
return result["data"]
CONSTRAINT_QUERY = """
query GeneConstraint($gene_symbol: String!, $reference_genome: ReferenceGenomeId!) {
gene(gene_symbol: $gene_symbol, reference_genome: $reference_genome) {
gene_id symbol
gnomad_constraint {
pli mis_z lof_z
obs_lof exp_lof oe_lof oe_lof_lower oe_lof_upper
}
}
}
"""
# GnomadConstraint is flat (no nested `lof` type); `pNull` / `pRec` are gone;
# `gene_name` is gone — use `symbol`. Filter LoF-intolerant by LOEUF < 0.35
# (the gnomAD recommendation), which is more robust than pLI > 0.9.
gene_list = ["BRCA1", "BRCA2", "PCSK9", "LDLR", "TTN", "CFTR", "HTT"]
records = []
for gene in gene_list:
try:
data = gnomad_query(CONSTRAINT_QUERY, {"gene_symbol": gene, "reference_genome": "GRCh38"})
c = data["gene"]["gnomad_constraint"]
records.append({
"gene": gene,
"pli": c["pli"],
"LOEUF": c["oe_lof_upper"],
"mis_z": c["mis_z"],
"lof_obs": c["obs_lof"],
"lof_exp": c["exp_lof"],
"lof_oe": c["oe_lof"],
})
except Exception as e:
print(f"Warning: {gene} failed — {e}")
time.sleep(0.5)
df = pd.DataFrame(records).sort_values("LOEUF")
df["lof_intolerant"] = df["LOEUF"] < 0.35
print(df[["gene", "pli", "LOEUF", "mis_z", "lof_intolerant"]].to_string(index=False))
df.to_csv("constraint_scores.csv", index=False)
print(f"\nLoF-intolerant genes (LOEUF < 0.35): {df['lof_intolerant'].sum()}/{len(df)}")
| Parameter | Function/Endpoint | Default | Range / Options | Effect |
|-----------|-------------------|---------|-----------------|--------|
| dataset | All variant queries | — | gnomad_r4, gnomad_r2_1, gnomad_r3 | Dataset version (GRCh38 for r4/r3, GRCh37 for r2_1) |
| reference_genome | gene(), region() | — | GRCh38, GRCh37 | Coordinate system; must match dataset |
| variant_id | variant() | — | CHROM-POS-REF-ALT string | Identifies the specific variant to query |
| gene_symbol | gene() | — | HGNC symbol string | Gene to retrieve; case-insensitive |
| chrom, start, stop | region() | — | valid genomic coordinates | Region boundaries for region queries |
| faf95.popmax | variant() genome | — | float 0–1 | Filtering allele frequency (95% CI upper bound); use < 0.001 for rare |
| lof filter field | gene() variants | — | "HC" (high-confidence), "LC" | LoF confidence level |
| populations.id | genome.populations | — | afr, amr, eas, fin, nfe, sas, asj, mid, oth | Per-ancestry frequency |
Use gnomad_r4 for GRCh38 analyses: gnomAD v4 is the most current dataset with 730K+ individuals. Use gnomad_r2_1 only when comparing to GRCh37-based variant calls.
Use faf95.popmax for clinical filtering, not overall AF: The filtering allele frequency accounts for maximum population stratification and provides a more conservative rarity estimate than the global AF.
Add time.sleep(0.5) in batch loops: gnomAD has no published rate limits but the API is shared infrastructure. Polite delays prevent server-side throttling.
Filter lof == "HC" for LoF burden analyses: Low-confidence LoF ("LC") annotations are often in repetitive regions or may be sequencing artifacts. High-confidence ("HC") calls are filtered by LOFTEE.
Check AN before interpreting AF: Low allele number (AN) means poor coverage in that population. A zero or near-zero AF may reflect absent data, not true rarity. Cross-reference with the coverage query when AN is unexpectedly low.
When to use: Quick check before clinical interpretation — confirm no ancestry group has AF > 1%.
import requests
GNOMAD_API = "https://gnomad.broadinstitute.org/api"
def is_common_in_any_population(variant_id, threshold=0.01, dataset="gnomad_r4"):
query = """
query($variantId: String!, $dataset: DatasetId!) {
variant(variantId: $variantId, dataset: $dataset) {
genome { faf95 { popmax popmax_population } af }
}
}
"""
r = requests.post(GNOMAD_API, json={"query": query,
"variables": {"variantId": variant_id, "dataset": dataset}},
timeout=15)
data = r.json()["data"]["variant"]
if not data or not data["genome"]:
return None, "Variant not found in gnomAD"
af = data["genome"]["af"]
popmax = data["genome"]["faf95"]["popmax"]
pop = data["genome"]["faf95"]["popmax_population"]
is_common = (popmax or 0) >= threshold
return is_common, f"overall AF={af:.2e}, FAF95 popmax={popmax:.2e} in {pop}"
common, info = is_common_in_any_population("1-55039974-G-T")
print(f"Common: {common} | {info}")
# Common: False | overall AF=3.2e-05, FAF95 popmax=6.4e-05 in nfe
When to use: Score multiple genes from a differential expression or GWAS gene list.
import requests, time, pandas as pd
GNOMAD_API = "https://gnomad.broadinstitute.org/api"
def get_constraint(gene_symbol, reference_genome="GRCh38"):
query = """
query($gene_symbol: String!, $reference_genome: ReferenceGenomeId!) {
gene(gene_symbol: $gene_symbol, reference_genome: $reference_genome) {
gnomad_constraint { pli mis_z oe_lof_upper }
}
}
"""
r = requests.post(GNOMAD_API, json={"query": query,
"variables": {"gene_symbol": gene_symbol, "reference_genome": reference_genome}},
timeout=15)
data = r.json().get("data", {}).get("gene", {})
if not data or not data.get("gnomad_constraint"):
return None
c = data["gnomad_constraint"]
return {"gene": gene_symbol, "pli": c["pli"], "LOEUF": c["oe_lof_upper"], "mis_z": c["mis_z"]}
genes = ["BRCA1", "BRCA2", "ATM", "CHEK2", "PALB2"]
rows = [r for g in genes for r in [get_constraint(g)] if r]
time.sleep(0.5) # polite delay per gene in real loop
df = pd.DataFrame(rows)
print(df.to_string(index=False))
# gene pli LOEUF mis_z
# BRCA1 5.52e-38 0.928 1.73
# BRCA2 6.34e-09 0.521 0.97
When to use: Get high-confidence LoF variants from gnomAD for downstream annotation.
import requests, pandas as pd
GNOMAD_API = "https://gnomad.broadinstitute.org/api"
def get_lof_variants(gene_symbol, dataset="gnomad_r4", max_af=0.001):
query = """
query($gene_symbol: String!, $reference_genome: ReferenceGenomeId!, $dataset: DatasetId!) {
gene(gene_symbol: $gene_symbol, reference_genome: $reference_genome) {
variants(dataset: $dataset) {
variant_id rsids chrom pos ref alt consequence lof
genome { af ac an }
}
}
}
"""
r = requests.post(GNOMAD_API, json={"query": query,
"variables": {"gene_symbol": gene_symbol, "reference_genome": "GRCh38", "dataset": dataset}},
timeout=60)
variants = r.json()["data"]["gene"]["variants"]
lof = [v for v in variants
if v.get("lof") == "HC"
and v.get("genome") and v["genome"].get("af") is not None
and v["genome"]["af"] < max_af]
return pd.DataFrame([{
"variant_id": v["variant_id"],
"rsids": ";".join(v.get("rsids") or []),
"consequence": v["consequence"],
"af": v["genome"]["af"],
"ac": v["genome"]["ac"],
} for v in lof])
df = get_lof_variants("CFTR", max_af=0.001)
print(f"High-confidence LoF variants in CFTR (AF<0.1%): {len(df)}")
print(df.head(5).to_string(index=False))
df.to_csv("CFTR_HC_lof_variants.csv", index=False)
| Problem | Cause | Solution |
|---------|-------|----------|
| {"errors": [...]} from GraphQL | Invalid field name, wrong dataset ID, or null gene | Check field names match gnomAD v4 schema; use gnomad_r4 not gnomad_v4 |
| Variant returns None genome object | Variant only in exome data, not genome | Try accessing exome field instead of genome; genome is absent for exome-only variants |
| Gene query returns empty variants list | Gene symbol not found or mismatch | Verify HGNC symbol (case-sensitive); use gene_id (ENSG ID) as fallback |
| faf95 returns null | Variant is absent or monomorphic in all populations | Check ac and an — variant may have AC=0 or be filtered |
| requests.exceptions.Timeout | Large gene (e.g., TTN) takes >30s | Increase timeout=120; for very large genes use region queries instead |
| Population AF is None for some groups | Variant not observed in that ancestry | Treat None AF as 0 for filtering; check an to confirm the group was sequenced |
| reference_genome mismatch error | Using GRCh37 coords with gnomad_r4 | Use GRCh38 for gnomad_r4/gnomad_r3; use GRCh37 only for gnomad_r2_1 |
clinvar-database — ClinVar pathogenicity classifications (complement to gnomAD population frequency data)gwas-database — GWAS Catalog for SNP-trait associations from published GWAS studiesensembl-database — Ensembl VEP for variant consequence prediction and gene annotationdbsnp-database — dbSNP for rsID lookup, variant classes, and cross-database ID mappingtools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.