skills/genomics-bioinformatics/regulomedb-database/SKILL.md
Query RegulomeDB v2 GET REST API to score variants for regulatory function and retrieve overlapping evidence (TF binding, histone marks, DNase peaks, footprints, motifs, eQTLs, chromatin state). Scores range 1a (strongest) to 7 (none). Use for GWAS hit prioritization, regulatory variant annotation, cis-regulatory discovery. Use clinvar-database for pathogenicity; gwas-database for trait associations.
npx skillsauth add jaechang-hits/sciagent-skills regulomedb-databaseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
RegulomeDB integrates large-scale functional genomics data (ENCODE, Roadmap Epigenomics) to score genetic variants for regulatory potential. Each variant receives a ranking from 1a (highest regulatory confidence: eQTL + TF + DNase + motif + chromatin) to 7 (no known regulatory function). The v2 API is exposed as GET https://regulomedb.org/regulome-search/; the legacy POST /regulome-search/, POST /regulome-summary/, and GET /regulome-datasets/ JSON endpoints are no longer functional (return regulome-notfound stubs or 500). Access is free and requires no authentication.
@graph evidence rows)features.QTL)nearby_snps)clinvar-database instead when you need clinical pathogenicity classifications; RegulomeDB scores regulatory function, not germline disease associationgwas-database instead when you want published GWAS associations with traitsrequests, pandas, matplotlibrs4946036), genomic positions (chr1:1000000), or region coordinates (chr1:1000000-2000000)time.sleep(0.3) between requests in batch workflowspip install requests pandas matplotlib
import requests
BASE = "https://regulomedb.org"
def regulome_score(variant, genome="GRCh38"):
"""Score a single variant (rsID or chr:pos-pos) via the GET /regulome-search/ endpoint."""
r = requests.get(
f"{BASE}/regulome-search/",
params={"regions": variant, "genome": genome, "format": "json"},
timeout=30,
)
r.raise_for_status()
d = r.json()
rs = d.get("regulome_score", {})
vs = d.get("variants", [])
return {
"query": variant,
"ranking": rs.get("ranking"), # 1a / 1b / ... / 7
"probability": float(rs.get("probability", 0)),
"rsids": vs[0].get("rsids") if vs else [],
"chrom": vs[0].get("chrom") if vs else None,
"pos": vs[0].get("start") if vs else None,
}
print(regulome_score("rs4946036"))
# {'query': 'rs4946036', 'ranking': '7', 'probability': 0.18412,
# 'rsids': ['rs4946036'], 'chrom': 'chr6', 'pos': 114819799}
The GET /regulome-search/ endpoint accepts an rsID or coordinate as regions=. Returns a regulome_score block (probability, ranking, tissue-specific scores) plus features flags and the per-dataset @graph evidence rows.
import requests
BASE = "https://regulomedb.org"
def score_variant(variant, genome="GRCh38"):
"""Return the regulome_score block and resolved coordinates."""
r = requests.get(
f"{BASE}/regulome-search/",
params={"regions": variant, "genome": genome, "format": "json"},
timeout=30,
)
r.raise_for_status()
d = r.json()
rs = d.get("regulome_score", {})
vs = d.get("variants", [])
feats = d.get("features", {})
print(f"Variant : {variant}")
print(f"Resolved : {vs[0]['chrom']}:{vs[0]['start']} ({', '.join(vs[0].get('rsids', []))})")
print(f"Ranking : {rs.get('ranking')} prob={rs.get('probability')}")
print(f"Features : ChIP={feats['ChIP']} Chromatin_accessibility={feats['Chromatin_accessibility']} "
f"QTL={feats['QTL']} Footprint={feats['Footprint']} PWM_matched={feats['PWM_matched']}")
return d
# Strong-regulatory locus example
score_variant("chr11:5226739-5226740")
# Ranking: 1a (HBB beta-globin promoter, multi-evidence)
# Score by chromosomal position alone
score_variant("chr17:7670000-7670001") # TP53 region
A range query returns up to limit resolved variants (variants[]) and all @graph evidence rows in the window, plus nearby_snps (rsIDs adjacent to the resolved hits).
import requests, pandas as pd
BASE = "https://regulomedb.org"
def scan_region(chrom, start, end, genome="GRCh38", limit=200):
"""List variants in a region with their resolved positions and overlapping rsIDs."""
r = requests.get(
f"{BASE}/regulome-search/",
params={"regions": f"{chrom}:{start}-{end}", "genome": genome,
"format": "json", "limit": limit},
timeout=60,
)
r.raise_for_status()
d = r.json()
variants = d.get("variants", [])
print(f"Variants in {chrom}:{start}-{end}: {len(variants)} (total indexed = {d.get('total')})")
rows = [{"rsids": ", ".join(v.get("rsids", [])),
"chrom": v.get("chrom"),
"start": v.get("start"),
"end": v.get("end")} for v in variants]
return pd.DataFrame(rows)
df = scan_region("chr11", 5226000, 5227000)
print(df.head(10).to_string(index=False))
@graph RowsEach @graph[i] row is one experimental piece of evidence overlapping the query. Fields: method, target_label, biosample_ontology{term_name, organ_slims, classification}, dataset, file, value, chrom, start, end, strand, ancestry, disease_term_name.
import requests, pandas as pd
BASE = "https://regulomedb.org"
def evidence_rows(variant, genome="GRCh38"):
r = requests.get(
f"{BASE}/regulome-search/",
params={"regions": variant, "genome": genome, "format": "json"},
timeout=60,
)
r.raise_for_status()
g = r.json().get("@graph", [])
rows = []
for row in g:
bs = row.get("biosample_ontology") or {}
rows.append({
"method": row.get("method"),
"target_label": row.get("target_label"),
"biosample": bs.get("term_name"),
"organ_slims": ", ".join(bs.get("organ_slims") or []),
"dataset": row.get("dataset", "").split("/")[-2] if row.get("dataset") else None,
"value": row.get("value"),
})
return pd.DataFrame(rows)
df_evidence = evidence_rows("chr11:5226739-5226740")
# Each method is one of: ChIP-seq, Histone ChIP-seq, ATAC-seq, DNase-seq,
# footprints, PWMs, chromatin state, eQTLs
print(df_evidence["method"].value_counts())
To list the transcription factors binding near a variant, filter @graph rows where method == "ChIP-seq" and read target_label + biosample_ontology.term_name.
import requests, pandas as pd
BASE = "https://regulomedb.org"
def tf_binding(variant, genome="GRCh38"):
r = requests.get(f"{BASE}/regulome-search/",
params={"regions": variant, "genome": genome, "format": "json"},
timeout=60)
r.raise_for_status()
rows = []
for g in r.json().get("@graph", []):
if g.get("method") != "ChIP-seq":
continue
bs = g.get("biosample_ontology") or {}
rows.append({
"tf": g.get("target_label"),
"biosample": bs.get("term_name"),
"classification": bs.get("classification"),
})
return pd.DataFrame(rows)
df_tfs = tf_binding("chr11:5226739-5226740")
print(f"TF ChIP-seq peaks overlapping query: {len(df_tfs)}")
print(df_tfs.groupby("tf").size().sort_values(ascending=False).head(10))
regulome_score.tissue_specific_scores maps ~50 tissues to per-tissue regulatory probabilities (0–1). Rank tissues to identify where the variant has the strongest regulatory signal.
import requests, pandas as pd
BASE = "https://regulomedb.org"
def tissue_scores(variant, genome="GRCh38", top_n=10):
r = requests.get(f"{BASE}/regulome-search/",
params={"regions": variant, "genome": genome, "format": "json"},
timeout=30)
r.raise_for_status()
ts = r.json().get("regulome_score", {}).get("tissue_specific_scores", {})
s = pd.Series({k: float(v) for k, v in ts.items()})
return s.sort_values(ascending=False).head(top_n)
print("Top tissues by regulatory probability for chr11:5226739-5226740:")
print(tissue_scores("chr11:5226739-5226740"))
nearby_snps carries dbSNP rsIDs near the resolved coordinates, with reference/alt allele frequencies (when GnomAD-indexed).
import requests, pandas as pd
BASE = "https://regulomedb.org"
def nearby(variant, genome="GRCh38"):
r = requests.get(f"{BASE}/regulome-search/",
params={"regions": variant, "genome": genome, "format": "json"},
timeout=30)
r.raise_for_status()
rows = []
for s in r.json().get("nearby_snps", []):
rows.append({
"rsid": s.get("rsid"),
"chrom": s.get("chrom"),
"pos": s.get("coordinates", {}).get("gte"),
"type": s.get("variation_type"),
"maf": s.get("maf"),
})
return pd.DataFrame(rows)
df_nearby = nearby("rs4946036")
print(f"Nearby SNPs to rs4946036: {len(df_nearby)}")
print(df_nearby.head(10).to_string(index=False))
RegulomeDB ranks encode the strength of evidence overlapping a variant. The regulome_score.ranking string is one of:
| Ranking | Evidence | Confidence | |---------|----------|------------| | 1a | eQTL + TF + DNase + motif + matched footprint | Highest | | 1b–1f | Multi-evidence (sub-ranks reflect which inputs match) | Very high | | 2a | TF binding + DNase + motif | High | | 2b | TF binding + any DNase (no motif required) | High | | 2c | TF binding + DNase (limited) | Moderate-high | | 3a | DNase + motif (no TF ChIP-seq) | Moderate | | 3b | Motif only (no DNase) | Moderate | | 4 | Single TF binding evidence | Low-moderate | | 5 | DNase peak only | Low | | 6 | Other regulatory evidence | Minimal | | 7 | No known regulatory function | None |
regulome_score.probability is the numeric model score (0–1) underlying the discrete ranking.
features Booleans vs @graph Detailfeatures is a high-level summary — boolean flags indicating presence of ChIP, Chromatin_accessibility, Footprint, PWM, QTL, etc. For per-dataset detail (which exact TF / cell type / experiment), iterate @graph[] and filter by method.
regions= accepts:
# rsID — resolved server-side to current-build coordinates
"rs4946036"
# Single-position range
"chr11:5226739-5226740"
# Wider region (returns multiple variants[] entries + larger @graph)
"chr11:5226000-5227000"
Goal: Score a list of GWAS lead SNPs and rank by regulatory confidence.
import requests, time, pandas as pd
import matplotlib.pyplot as plt
BASE = "https://regulomedb.org"
gwas_snps = ["rs7903146", "rs10811661", "rs1801282", "rs4946036",
"rs2268177", "rs10830963", "rs1111875"]
records = []
for snp in gwas_snps:
r = requests.get(f"{BASE}/regulome-search/",
params={"regions": snp, "genome": "GRCh38", "format": "json"},
timeout=30)
r.raise_for_status()
d = r.json()
rs = d.get("regulome_score", {})
feats = d.get("features", {})
g = d.get("@graph", [])
tfs = sorted({row["target_label"] for row in g
if row.get("method") == "ChIP-seq" and row.get("target_label")})
records.append({
"snp": snp,
"ranking": rs.get("ranking"),
"probability": float(rs.get("probability", 0)),
"has_qtl": feats.get("QTL", False),
"tf_count": len(tfs),
"num_evidence_rows": len(g),
})
time.sleep(0.3)
df = pd.DataFrame(records).sort_values("probability", ascending=False)
print(df.to_string(index=False))
df.to_csv("gwas_regulatory_priority.csv", index=False)
fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(df["snp"], df["probability"], color="steelblue", edgecolor="black")
ax.set_ylabel("Regulatory probability")
ax.set_title("GWAS lead-SNP regulatory probabilities")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.savefig("gwas_score_distribution.png", dpi=150, bbox_inches="tight")
Goal: Summarize the methods underlying the score at a locus (e.g., HBB promoter).
import requests, pandas as pd
import matplotlib.pyplot as plt
BASE = "https://regulomedb.org"
def locus_profile(region, genome="GRCh38"):
r = requests.get(f"{BASE}/regulome-search/",
params={"regions": region, "genome": genome, "format": "json"},
timeout=60)
r.raise_for_status()
d = r.json()
rs = d.get("regulome_score", {})
g = d.get("@graph", [])
counts = pd.Series([row.get("method") for row in g]).value_counts()
print(f"\n=== {region} | ranking={rs.get('ranking')} prob={rs.get('probability')} ===")
print(counts.to_string())
return counts
counts = locus_profile("chr11:5226739-5226740") # HBB
fig, ax = plt.subplots(figsize=(8, 4))
counts.plot(kind="barh", color="seagreen", ax=ax)
ax.set_xlabel("Evidence rows in @graph")
ax.set_title("Regulatory evidence by method (HBB promoter)")
plt.tight_layout()
plt.savefig("locus_evidence_profile.png", dpi=150, bbox_inches="tight")
| Parameter | Endpoint | Default | Range / Options | Effect |
|-----------|----------|---------|-----------------|--------|
| regions | GET /regulome-search/ | required | rsID, chrN:start-end, or chrN:pos-pos | Variant/region to score |
| genome | GET /regulome-search/ | "GRCh38" | "GRCh38", "GRCh37" | Reference genome assembly |
| format | GET /regulome-search/ | "html" | "json", "tsv", "html" | Use "json" for programmatic access |
| limit | GET /regulome-search/ | 200 | 1–1000 | Max resolved variants in variants[] for region queries |
| from | GET /regulome-search/ | 0 | non-negative int | Offset for paging through large @graph lists |
Use GET, not POST. The POST /regulome-search/, POST /regulome-summary/, and GET /regulome-datasets/ JSON endpoints return a regulome-notfound stub or HTTP 500. Only GET /regulome-search/?regions=...&genome=...&format=json returns real data.
Read regulome_score.ranking, not regulomedb_score. The field used to be named regulomedb_score in legacy docs; the live API exposes it as regulome_score.ranking (string like "1a", "7").
Add time.sleep(0.3) between calls. RegulomeDB has no published rate limit, but polite spacing prevents intermittent 502s under load.
Score 7 ≠ "no regulation". Score 7 means absence of evidence in RegulomeDB's curated datasets, not biological absence of regulation. Cross-check with ENCODE/Roadmap directly via encode-database for negative results.
For tissue specificity, use tissue_specific_scores. The single ranking is an aggregate; the per-tissue probabilities reveal where the variant has the strongest regulatory signal.
Watch the GRCh37 vs GRCh38 build. rsIDs are auto-resolved to the requested build, but coordinate-based regions=chrN:start-end queries are build-specific — pass the matching genome= value.
When to use: One-off check before kicking off a larger pipeline.
import requests
def quick_score(variant, genome="GRCh38"):
r = requests.get("https://regulomedb.org/regulome-search/",
params={"regions": variant, "genome": genome, "format": "json"},
timeout=20)
r.raise_for_status()
rs = r.json().get("regulome_score", {})
print(f"{variant}: ranking={rs.get('ranking')} prob={rs.get('probability')}")
quick_score("rs4946036") # ranking=7 prob=0.18412
quick_score("chr11:5226739-5226740") # ranking=1a (HBB promoter)
import requests, time, pandas as pd
HIGH_CONF = {"1a", "1b", "1c", "1d", "1e", "1f", "2a", "2b"}
def high_conf_only(variants, genome="GRCh38"):
keep = []
for v in variants:
r = requests.get("https://regulomedb.org/regulome-search/",
params={"regions": v, "genome": genome, "format": "json"},
timeout=30)
ranking = r.json().get("regulome_score", {}).get("ranking")
if ranking in HIGH_CONF:
keep.append({"variant": v, "ranking": ranking})
time.sleep(0.3)
return pd.DataFrame(keep)
df = high_conf_only(["rs4946036", "rs7903146", "chr11:5226739-5226740"])
print(df.to_string(index=False))
When to use: Find variants with features.QTL == True, i.e. those overlapping a curated QTL row in @graph (method "QTLs").
import requests, time, pandas as pd
def qtl_overlap(variants, genome="GRCh38"):
rows = []
for v in variants:
r = requests.get("https://regulomedb.org/regulome-search/",
params={"regions": v, "genome": genome, "format": "json"},
timeout=30)
d = r.json()
if not d.get("features", {}).get("QTL"):
time.sleep(0.3); continue
for g in d.get("@graph", []):
if g.get("method") == "QTLs":
rows.append({
"variant": v,
"ranking": d.get("regulome_score", {}).get("ranking"),
"qtl_target": g.get("target_label"),
"value": g.get("value"),
"biosample": (g.get("biosample_ontology") or {}).get("term_name"),
})
time.sleep(0.3)
return pd.DataFrame(rows)
print(qtl_overlap(["rs4946036", "chr11:5226739-5226740"]))
| Problem | Cause | Solution |
|---------|-------|----------|
| Response body {"@id":"/regulome-notfound","@type":["regulome-help"]...} | Using the legacy POST /regulome-search/ with a JSON body | Switch to GET with regions= query param + format=json |
| HTTP 500 | Hitting the deprecated /regulome-summary/ endpoint | Endpoint is dead; aggregate counts client-side from @graph[].method |
| KeyError: 'regulomedb_score' | Field renamed | Use regulome_score.ranking (string) and regulome_score.probability (float-as-string) |
| peaks / eqtls / assay_type missing | Old schema | Iterate @graph[] and filter by method (ChIP-seq, DNase-seq, Histone ChIP-seq, ATAC-seq, footprints, PWMs, QTLs, chromatin state) |
| Empty variants[] for an rsID | rsID not in RegulomeDB index or build mismatch | Try the chr:pos form; check genome= matches the coordinates |
| Region search returns 0 @graph rows | Region size too small or in an uncharacterized chromosome | Widen the window to ≥ 200 bp; avoid alt contigs (chrUn_*, *_random) |
| Region query truncates at 200 results | Default limit=200 | Pass limit=1000 or page with from=0,200,400,... |
gwas-database — NHGRI-EBI GWAS Catalog for published SNP-trait associations; pair with RegulomeDB to prioritize GWAS hitsclinvar-database — Clinical pathogenicity classifications; complements RegulomeDB's functional regulatory evidenceencode-database — Direct ENCODE REST API access for the TF ChIP-seq / ATAC-seq peak sets that underlie RegulomeDB scoresensembl-database — Variant annotation and gene coordinate lookup; use to map rsIDs to genomic positions before region queriestools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.