skills/structural-biology-drug-discovery/pdb-database/SKILL.md
Query RCSB PDB (200K+ structures) via the public REST + GraphQL APIs with plain `requests` (no SDK). Search by text, attribute, sequence, or 3D structure similarity (Search API); retrieve metadata via GraphQL (Data API); download PDB/mmCIF from files.rcsb.org. For AlphaFold predictions use alphafold-database-access; for protein sequences only use uniprot-protein-database.
npx skillsauth add jaechang-hits/scicraft pdb-databaseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Why no SDK? The
rcsb-apiPython SDK is convenient sugar over three public, no-auth REST endpoints (search.rcsb.org,data.rcsb.org,files.rcsb.org). When the SDK is unavailable, every operation can be reproduced with plainrequestsand a small JSON payload. This SKILL.md uses the REST path throughout so the code runs in any environment withrequestsinstalled.
RCSB PDB is the worldwide repository for 3D structural data of biological macromolecules with 200,000+ experimentally determined structures. Programmatic access is via three free, no-auth endpoints:
| API | Base URL | Method | Purpose |
|---|---|---|---|
| Search | https://search.rcsb.org/rcsbsearch/v2/query | POST JSON | Find PDB IDs by text, attribute filters, sequence, or 3D similarity |
| Data | https://data.rcsb.org/graphql | POST GraphQL | Retrieve structured metadata (entries, polymer entities, assemblies, ligands) |
| Files | https://files.rcsb.org/download/{id}.{format} | GET | Download coordinate files (mmCIF, PDB, FASTA) |
Use this skill for programmatic structural biology queries, drug target analysis, and protein family comparisons.
alphafold-database-access insteaduniprot-protein-database insteadrequests (only requirement). Optional: biopython for parsing downloaded coordinate files.time.sleep(0.2-0.5) between requests are sufficient; implement exponential backoff on HTTP 429.pip install requests
# Optional, for coordinate parsing:
pip install biopython
Typical search-then-fetch pattern: hit the Search API, get a list of PDB IDs, then resolve metadata via the GraphQL Data API.
import requests
SEARCH = "https://search.rcsb.org/rcsbsearch/v2/query"
DATA = "https://data.rcsb.org/graphql"
# 1. Search: human X-ray structures of "kinase" at resolution < 2.0 Å
payload = {
"query": {
"type": "group", "logical_operator": "and",
"nodes": [
{"type": "terminal", "service": "full_text",
"parameters": {"value": "kinase"}},
{"type": "terminal", "service": "text",
"parameters": {"attribute": "rcsb_entity_source_organism.scientific_name",
"operator": "exact_match", "value": "Homo sapiens"}},
{"type": "terminal", "service": "text",
"parameters": {"attribute": "rcsb_entry_info.resolution_combined",
"operator": "less", "value": 2.0}},
],
},
"return_type": "entry",
"request_options": {"paginate": {"rows": 10}},
}
r = requests.post(SEARCH, json=payload, timeout=30)
r.raise_for_status()
result = r.json()
pdb_ids = [hit["identifier"] for hit in result["result_set"]]
print(f"Total matches: {result['total_count']}, first batch: {pdb_ids}")
# 2. Fetch metadata for the first hit via GraphQL
gql = """{ entry(entry_id: "%s") {
struct { title }
exptl { method }
rcsb_entry_info { resolution_combined deposited_atom_count polymer_entity_count }
} }""" % pdb_ids[0]
r2 = requests.post(DATA, json={"query": gql}, timeout=30)
entry = r2.json()["data"]["entry"]
print(entry["struct"]["title"])
print(f"Method: {entry['exptl'][0]['method']}, Resolution: {entry['rcsb_entry_info']['resolution_combined']} Å")
Free-text search uses service: "full_text" and searches across all indexed fields.
import requests
SEARCH = "https://search.rcsb.org/rcsbsearch/v2/query"
def text_search(keyword, rows=25):
payload = {
"query": {"type": "terminal", "service": "full_text",
"parameters": {"value": keyword}},
"return_type": "entry",
"request_options": {"paginate": {"rows": rows}},
}
r = requests.post(SEARCH, json=payload, timeout=30)
r.raise_for_status()
data = r.json()
return [hit["identifier"] for hit in data["result_set"]], data["total_count"]
ids, total = text_search("hemoglobin")
print(f"Found {total} structures; first batch: {ids[:5]}")
Attribute search uses service: "text" with structured attribute/operator/value parameters.
import requests
SEARCH = "https://search.rcsb.org/rcsbsearch/v2/query"
def attribute_search(attribute, operator, value, return_type="entry", rows=25):
payload = {
"query": {"type": "terminal", "service": "text",
"parameters": {"attribute": attribute,
"operator": operator,
"value": value}},
"return_type": return_type,
"request_options": {"paginate": {"rows": rows}},
}
r = requests.post(SEARCH, json=payload, timeout=30)
r.raise_for_status()
return r.json()
# Human proteins
human = attribute_search("rcsb_entity_source_organism.scientific_name",
"exact_match", "Homo sapiens", rows=5)
print(f"Human structures: {human['total_count']}")
# X-ray only
xray = attribute_search("exptl.method", "exact_match", "X-RAY DIFFRACTION", rows=5)
print(f"X-ray structures: {xray['total_count']}")
# Resolution range: 1.5–2.5 Å
res = attribute_search(
"rcsb_entry_info.resolution_combined", "range",
{"from": 1.5, "to": 2.5, "include_lower": True, "include_upper": True},
rows=5
)
print(f"1.5–2.5 Å: {res['total_count']}")
Find structures with similar sequences using MMseqs2. Service is "sequence"; target selects protein vs. nucleic acid.
import requests
SEARCH = "https://search.rcsb.org/rcsbsearch/v2/query"
kras_seq = ("MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQ"
"EEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHHYREQIKRVKDSEDVPMVLVGNKCDLPS"
"RTVDTKQAQDLARSYGIPFIETSAKTRQGVDDAFYTLVREIRKHKEKMSK")
payload = {
"query": {
"type": "terminal", "service": "sequence",
"parameters": {
"target": "pdb_protein_sequence", # or "pdb_dna_sequence", "pdb_rna_sequence"
"value": kras_seq,
"evalue_cutoff": 0.1,
"identity_cutoff": 0.9,
},
},
"return_type": "polymer_entity",
"request_options": {"paginate": {"rows": 10}},
}
r = requests.post(SEARCH, json=payload, timeout=30)
r.raise_for_status()
data = r.json()
print(f"KRAS-like hits: {data['total_count']}")
for hit in data["result_set"][:5]:
print(f" {hit['identifier']} score={hit.get('score', 'n/a')}")
Find structures with similar 3D geometry using BioZernike descriptors. Service is "structure"; pass the reference entry + assembly ID.
import requests
SEARCH = "https://search.rcsb.org/rcsbsearch/v2/query"
payload = {
"query": {
"type": "terminal", "service": "structure",
"parameters": {
"value": {"entry_id": "4HHB", "assembly_id": "1"},
"operator": "strict_shape_match", # or "relaxed_shape_match"
},
},
"return_type": "polymer_entity",
"request_options": {"paginate": {"rows": 10}},
}
r = requests.post(SEARCH, json=payload, timeout=30)
r.raise_for_status()
data = r.json()
print(f"Structurally similar to 4HHB: {data['total_count']}")
for hit in data["result_set"][:5]:
print(f" {hit['identifier']} score={hit.get('score', 'n/a')}")
The GraphQL endpoint at data.rcsb.org/graphql is the canonical way to retrieve structured metadata for known PDB IDs. One request can pull fields across the full data hierarchy (entry → polymer_entity → assembly → chem_comp).
import requests
DATA = "https://data.rcsb.org/graphql"
# Entry-level metadata
gql = """{ entry(entry_id: "4HHB") {
struct { title }
exptl { method }
rcsb_entry_info { resolution_combined deposited_atom_count polymer_entity_count nonpolymer_entity_count }
rcsb_accession_info { deposit_date initial_release_date }
} }"""
r = requests.post(DATA, json={"query": gql}, timeout=30)
entry = r.json()["data"]["entry"]
print(f"Title : {entry['struct']['title']}")
print(f"Method : {entry['exptl'][0]['method']}")
print(f"Resolution : {entry['rcsb_entry_info']['resolution_combined']} Å")
print(f"Atoms : {entry['rcsb_entry_info']['deposited_atom_count']}")
# Polymer entity (sequence, organism, MW)
gql = """{ polymer_entity(entry_id: "4HHB", entity_id: "1") {
entity_poly { pdbx_seq_one_letter_code }
rcsb_polymer_entity { formula_weight }
rcsb_entity_source_organism { scientific_name ncbi_taxonomy_id }
} }"""
r = requests.post(DATA, json={"query": gql}, timeout=30)
pe = r.json()["data"]["polymer_entity"]
print(f"Sequence (first 50): {pe['entity_poly']['pdbx_seq_one_letter_code'][:50]}")
print(f"Organism : {pe['rcsb_entity_source_organism'][0]['scientific_name']}")
print(f"MW : {pe['rcsb_polymer_entity']['formula_weight']}")
# Batch: pull metadata for many entries in one request
gql = """{ entries(entry_ids: ["4HHB", "1A3N", "1HHB"]) {
rcsb_id
struct { title }
exptl { method }
rcsb_entry_info { resolution_combined }
} }"""
r = requests.post(DATA, json={"query": gql}, timeout=30)
for e in r.json()["data"]["entries"]:
res = e["rcsb_entry_info"]["resolution_combined"]
print(f" {e['rcsb_id']}: {e['exptl'][0]['method']:<25} {res} Å — {e['struct']['title'][:40]}")
Coordinate files (mmCIF, PDB, FASTA, assembly variants) are served directly from files.rcsb.org.
import requests
def download_structure(pdb_id, fmt="cif", output_dir="."):
"""Download mmCIF / PDB / FASTA. URLs: .pdb, .cif, /fasta/entry/{ID}, .pdb1 (assembly)."""
url = f"https://files.rcsb.org/download/{pdb_id}.{fmt}"
r = requests.get(url, timeout=60)
if r.status_code == 200:
path = f"{output_dir}/{pdb_id}.{fmt}"
# mmCIF / PDB are text; assemblies and biological units are also text
with open(path, "w") as f:
f.write(r.text)
print(f"Downloaded {path} ({len(r.text)/1024:.1f} KB)")
return path
print(f"HTTP {r.status_code} for {pdb_id}.{fmt}")
return None
download_structure("4HHB", fmt="cif")
download_structure("4HHB", fmt="pdb")
# FASTA sequence for an entry
r = requests.get("https://www.rcsb.org/fasta/entry/4HHB", timeout=30)
r.raise_for_status()
print(r.text[:400])
Combine terminal queries with type: "group" and a logical_operator of "and" / "or". Nested groups give arbitrary boolean expressions; negation is via "node_id" references with "operator": "negate" on the group (rare — usually expressed as the inverse attribute filter).
import requests, datetime
SEARCH = "https://search.rcsb.org/rcsbsearch/v2/query"
# AND: high-resolution human structures
q_and = {
"type": "group", "logical_operator": "and",
"nodes": [
{"type": "terminal", "service": "text",
"parameters": {"attribute": "rcsb_entity_source_organism.scientific_name",
"operator": "exact_match", "value": "Homo sapiens"}},
{"type": "terminal", "service": "text",
"parameters": {"attribute": "rcsb_entry_info.resolution_combined",
"operator": "less", "value": 2.0}},
],
}
# OR: human or mouse
q_or = {
"type": "group", "logical_operator": "or",
"nodes": [
{"type": "terminal", "service": "text",
"parameters": {"attribute": "rcsb_entity_source_organism.scientific_name",
"operator": "exact_match", "value": "Homo sapiens"}},
{"type": "terminal", "service": "text",
"parameters": {"attribute": "rcsb_entity_source_organism.scientific_name",
"operator": "exact_match", "value": "Mus musculus"}},
],
}
# Combined: recent (last 30 days) + high-quality
one_month_ago = (datetime.date.today() - datetime.timedelta(days=30)).isoformat()
today = datetime.date.today().isoformat()
q_recent_hq = {
"type": "group", "logical_operator": "and",
"nodes": [
{"type": "terminal", "service": "text",
"parameters": {"attribute": "rcsb_entry_info.resolution_combined",
"operator": "less", "value": 2.0}},
{"type": "terminal", "service": "text",
"parameters": {"attribute": "refine.ls_R_factor_R_free",
"operator": "less", "value": 0.25}},
{"type": "terminal", "service": "text",
"parameters": {"attribute": "rcsb_accession_info.initial_release_date",
"operator": "range",
"value": {"from": one_month_ago, "to": today,
"include_lower": True, "include_upper": True}}},
],
}
payload = {"query": q_recent_hq, "return_type": "entry",
"request_options": {"paginate": {"rows": 5}}}
r = requests.post(SEARCH, json=payload, timeout=30)
print(f"Recent high-quality: {r.json()['total_count']} structures")
Search responses include total_count. Paginate with request_options.paginate.start and rows (max ~10000 per page in practice; 100–500 is a good batch size).
import requests, time
SEARCH = "https://search.rcsb.org/rcsbsearch/v2/query"
def search_all(query_node, return_type="entry", page=100, max_results=None, delay=0.3):
"""Paginate through every result; rate-limit between pages."""
out, start = [], 0
while True:
payload = {"query": query_node, "return_type": return_type,
"request_options": {"paginate": {"start": start, "rows": page}}}
r = requests.post(SEARCH, json=payload, timeout=30)
r.raise_for_status()
d = r.json()
batch = [h["identifier"] for h in d.get("result_set", [])]
if not batch:
break
out.extend(batch)
if max_results and len(out) >= max_results:
return out[:max_results]
if len(batch) < page or len(out) >= d.get("total_count", 0):
break
start += page
time.sleep(delay)
return out
# Example: every insulin entry
ids = search_all(
{"type": "terminal", "service": "full_text", "parameters": {"value": "insulin"}},
max_results=300,
)
print(f"Insulin entries collected: {len(ids)}")
# Batch metadata fetch via GraphQL `entries(...)` to avoid one round-trip per ID
DATA = "https://data.rcsb.org/graphql"
def batch_metadata(pdb_ids, chunk=50):
"""Fetch (title, method, resolution) for many entries with one POST per chunk."""
all_rows = []
for i in range(0, len(pdb_ids), chunk):
ids_arr = pdb_ids[i:i+chunk]
ids_str = ", ".join(f'"{p}"' for p in ids_arr)
gql = f"""{{ entries(entry_ids: [{ids_str}]) {{
rcsb_id
struct {{ title }}
exptl {{ method }}
rcsb_entry_info {{ resolution_combined }}
}} }}"""
r = requests.post(DATA, json={"query": gql}, timeout=60)
r.raise_for_status()
for e in r.json()["data"]["entries"]:
res = e["rcsb_entry_info"]["resolution_combined"]
all_rows.append({
"pdb_id": e["rcsb_id"],
"method": e["exptl"][0]["method"] if e["exptl"] else None,
"resolution": res[0] if isinstance(res, list) and res else res,
"title": e["struct"]["title"],
})
return all_rows
rows = batch_metadata(ids[:20])
for r in rows[:5]:
print(f" {r['pdb_id']}: {r['method']:<25} {r['resolution']} Å — {r['title'][:50]}")
| Service | Use case | Required parameters |
|---|---|---|
| full_text | Free-text keyword across all indexed fields | value (string) |
| text | Structured attribute filter | attribute, operator, value |
| sequence | MMseqs2 sequence similarity | target ∈ {pdb_protein_sequence, pdb_dna_sequence, pdb_rna_sequence}, value (sequence), evalue_cutoff, identity_cutoff |
| seqmotif | Pattern / regex / PROSITE motif | value (pattern), pattern_type ∈ {simple, prosite, regex} |
| structure | 3D shape similarity (BioZernike) | value ({entry_id, assembly_id}), operator ∈ {strict_shape_match, relaxed_shape_match} |
| strucmotif | 3D residue-arrangement motif | value (residue list), rmsd_cutoff |
| chemical | Ligand similarity by SMILES/InChI | value, match_type ∈ {graph-exact, graph-relaxed, fingerprint-similarity, sub-structure-stereo-relaxed} |
"text")| Operator | Value shape | Example |
|---|---|---|
| exact_match | string | "Homo sapiens" |
| contains_words / contains_phrase | string | "tyrosine kinase" |
| equals / greater / less / greater_or_equal / less_or_equal | number | 2.0 |
| range | {from, to, include_lower, include_upper} | {"from": 1.5, "to": 2.5, "include_lower": True, "include_upper": True} |
| exists | (none) | — |
| in | array | ["X-RAY DIFFRACTION", "ELECTRON MICROSCOPY"] |
return_type controls the granularity of identifiers in result_set:
| return_type | Identifier shape | Example |
|---|---|---|
| entry | 4HHB | One per PDB ID |
| polymer_entity | 4HHB_1 | One per polymer chain entity |
| non_polymer_entity | 4HHB_2 | Ligands, cofactors |
| assembly | 4HHB-1 | Biological unit |
| polymer_instance | 4HHB.A | Individual chain coordinates |
| mol_definition | HEM | Chemical component (PDB ligand code) |
| Root | Identifier shape | Returns |
|---|---|---|
| entry(entry_id: ...) | "4HHB" | Entry-level metadata |
| entries(entry_ids: [...]) | array | Batch entry lookup |
| polymer_entity(entry_id: ..., entity_id: ...) | "4HHB", "1" | Sequence + organism |
| polymer_entity_instance(entry_id: ..., asym_id: ...) | "4HHB", "A" | Chain-level coords/metadata |
| assembly(entry_id: ..., assembly_id: ...) | "4HHB", "1" | Biological assembly |
| chem_comp(comp_id: ...) | "HEM" | Small molecule reference |
| Format | URL pattern | Notes |
|---|---|---|
| mmCIF | https://files.rcsb.org/download/{id}.cif | Recommended; no atom-count limit |
| PDB | https://files.rcsb.org/download/{id}.pdb | Legacy; 99,999 atom limit |
| Assembly (mmCIF) | https://files.rcsb.org/download/{id}-assembly{N}.cif | Biological unit |
| FASTA | https://www.rcsb.org/fasta/entry/{id} | Sequence only |
Goal: Find high-resolution human EGFR structures with bound ligands.
import requests, time
SEARCH = "https://search.rcsb.org/rcsbsearch/v2/query"
DATA = "https://data.rcsb.org/graphql"
payload = {
"query": {
"type": "group", "logical_operator": "and",
"nodes": [
{"type": "terminal", "service": "full_text",
"parameters": {"value": "EGFR epidermal growth factor receptor"}},
{"type": "terminal", "service": "text",
"parameters": {"attribute": "rcsb_entity_source_organism.scientific_name",
"operator": "exact_match", "value": "Homo sapiens"}},
{"type": "terminal", "service": "text",
"parameters": {"attribute": "rcsb_entry_info.resolution_combined",
"operator": "less", "value": 2.5}},
],
},
"return_type": "entry",
"request_options": {"paginate": {"rows": 50}},
}
r = requests.post(SEARCH, json=payload, timeout=30)
r.raise_for_status()
pdb_ids = [h["identifier"] for h in r.json()["result_set"]]
print(f"EGFR ≤2.5 Å human structures: {len(pdb_ids)}")
# Filter to entries with bound ligands via batch GraphQL
ids_str = ", ".join(f'"{p}"' for p in pdb_ids[:20])
gql = f"""{{ entries(entry_ids: [{ids_str}]) {{
rcsb_id
struct {{ title }}
rcsb_entry_info {{ resolution_combined nonpolymer_entity_count }}
}} }}"""
r2 = requests.post(DATA, json={"query": gql}, timeout=60)
for e in r2.json()["data"]["entries"]:
n_lig = e["rcsb_entry_info"]["nonpolymer_entity_count"] or 0
if n_lig > 0:
res = e["rcsb_entry_info"]["resolution_combined"]
res_v = res[0] if isinstance(res, list) else res
print(f" {e['rcsb_id']}: {res_v} Å, ligands={n_lig} — {e['struct']['title'][:60]}")
time.sleep(0.05)
Goal: Find all PDB structures with sequence similar to a query (KRAS), then summarize their resolution + experimental method.
import requests, time
SEARCH = "https://search.rcsb.org/rcsbsearch/v2/query"
DATA = "https://data.rcsb.org/graphql"
kras_seq = ("MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQ"
"EEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHHYREQIKRVKDSEDVPMVLVGNKCDLPS"
"RTVDTKQAQDLARSYGIPFIETSAKTRQGVDDAFYTLVREIRKHKEKMSK")
payload = {
"query": {
"type": "terminal", "service": "sequence",
"parameters": {"target": "pdb_protein_sequence", "value": kras_seq,
"evalue_cutoff": 1e-5, "identity_cutoff": 0.5},
},
"return_type": "polymer_entity",
"request_options": {"paginate": {"rows": 20}},
}
r = requests.post(SEARCH, json=payload, timeout=30)
hits = r.json()["result_set"]
print(f"KRAS family hits: {len(hits)}")
# Unique PDB IDs from polymer_entity identifiers (e.g., "4OBE_1" -> "4OBE")
entry_ids = sorted({h["identifier"].split("_")[0] for h in hits})
# Batch metadata
ids_str = ", ".join(f'"{p}"' for p in entry_ids)
gql = f"""{{ entries(entry_ids: [{ids_str}]) {{
rcsb_id
struct {{ title }}
exptl {{ method }}
rcsb_entry_info {{ resolution_combined }}
}} }}"""
r2 = requests.post(DATA, json={"query": gql}, timeout=60)
for e in sorted(r2.json()["data"]["entries"],
key=lambda x: (x["rcsb_entry_info"]["resolution_combined"] or [99])[0] if isinstance(x["rcsb_entry_info"]["resolution_combined"], list) else (x["rcsb_entry_info"]["resolution_combined"] or 99)):
res = e["rcsb_entry_info"]["resolution_combined"]
res_v = res[0] if isinstance(res, list) else res
print(f" {e['rcsb_id']}: {res_v} Å {e['exptl'][0]['method']:<25} {e['struct']['title'][:50]}")
Goal: Download mmCIF, then enumerate chains with BioPython.
import requests
from Bio.PDB import MMCIFParser
pdb_id = "4HHB"
r = requests.get(f"https://files.rcsb.org/download/{pdb_id}.cif", timeout=60)
r.raise_for_status()
with open(f"{pdb_id}.cif", "w") as f:
f.write(r.text)
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure(pdb_id, f"{pdb_id}.cif")
for model in structure:
for chain in model:
std_res = [r for r in chain if r.id[0] == " "]
atoms = sum(len(list(r.get_atoms())) for r in std_res)
print(f"Chain {chain.id}: {len(std_res)} residues, {atoms} atoms")
| Parameter | Endpoint | Default | Range / Options | Effect |
|-----------|----------|---------|-----------------|--------|
| value | search sequence | required | protein/DNA/RNA sequence string | Query sequence for MMseqs2 |
| evalue_cutoff | search sequence | 0.1 | 1e-10–10 | E-value threshold |
| identity_cutoff | search sequence | 0.9 | 0.0–1.0 | Minimum identity fraction |
| target | search sequence | "pdb_protein_sequence" | pdb_protein_sequence, pdb_dna_sequence, pdb_rna_sequence | Sequence type |
| operator | search text (attribute) | required | see Attribute Operators | Comparison kind |
| operator | search structure | strict_shape_match | strict_shape_match, relaxed_shape_match | 3D match stringency |
| return_type | all search | entry | entry, polymer_entity, assembly, polymer_instance, mol_definition, … | Identifier granularity |
| paginate.start / paginate.rows | request_options | 0 / 25 | up to ~10000 rows/page in practice | Pagination window |
| GraphQL field entries(entry_ids: [...]) | data.rcsb.org/graphql | — | array of PDB IDs | Batch entry metadata |
Search → fetch: Use the Search API to get a list of IDs, then GraphQL entries(entry_ids: [...]) for batch metadata. Avoid one GraphQL request per ID.
Use full_text vs text deliberately: free-text keyword search needs "service": "full_text". Structured attribute filters need "service": "text". They are not interchangeable.
mmCIF over PDB format: PDB format is being phased out and has a 99,999 atom limit. Always download .cif for new code.
Set realistic paginate.rows: rows: 100 is a good default for batch work; the API may slow down beyond ~10000. Loop with paginate.start for full traversal.
Rate limit with time.sleep(0.2) in batch loops: No published hard cap, but the public infrastructure is shared. On HTTP 429, back off exponentially.
Inspect the payload before posting: print(json.dumps(payload, indent=2)) is the cheapest way to debug HTTP 400 errors.
entries(entry_ids: [...]) does not validate every ID: if one ID is wrong, the whole array returns null entries. Validate IDs separately if you can't trust the source.
import requests
r = requests.get("https://www.rcsb.org/fasta/entry/4HHB", timeout=30)
print(r.text)
import requests
DATA = "https://data.rcsb.org/graphql"
gql = """{ entry(entry_id: "4HHB") {
polymer_entities {
rcsb_id
rcsb_polymer_entity_container_identifiers { auth_asym_ids }
entity_poly { rcsb_entity_polymer_type pdbx_seq_one_letter_code_can }
}
} }"""
r = requests.post(DATA, json={"query": gql}, timeout=30)
for pe in r.json()["data"]["entry"]["polymer_entities"]:
chains = pe["rcsb_polymer_entity_container_identifiers"]["auth_asym_ids"]
seq = pe["entity_poly"]["pdbx_seq_one_letter_code_can"][:50]
print(f" {pe['rcsb_id']} chains={chains} type={pe['entity_poly']['rcsb_entity_polymer_type']} seq={seq}…")
import requests
DATA = "https://data.rcsb.org/graphql"
gql = """{ entry(entry_id: "1IEP") {
nonpolymer_entities {
rcsb_id
nonpolymer_comp { chem_comp { id name formula } }
}
} }"""
r = requests.post(DATA, json={"query": gql}, timeout=30)
for npe in r.json()["data"]["entry"]["nonpolymer_entities"]:
cc = npe["nonpolymer_comp"]["chem_comp"]
print(f" {npe['rcsb_id']}: {cc['id']} ({cc['name']}) {cc['formula']}")
The Search API exposes a JSON schema at https://search.rcsb.org/rcsbsearch/v2/metadata/schema. Use it to look up valid attribute paths.
import requests
r = requests.get("https://search.rcsb.org/rcsbsearch/v2/metadata/schema", timeout=30)
schema = r.json()
# Schema lists hundreds of attribute paths; sample a few
sample_paths = [k for k in schema if "resolution" in k.lower()][:5]
print(sample_paths)
| Problem | Cause | Solution |
|---------|-------|----------|
| HTTP 400 — Invalid request to the [ text ] service on a free-text query | Wrong service name | Use "service": "full_text" for keyword search; "service": "text" is for structured attribute filters |
| HTTP 400 with cryptic schema message | Bad operator/value shape | Check the AttributeQuery Operators table; range needs the {from,to,include_lower,include_upper} dict |
| Empty result_set | Filters too strict | Relax filters one at a time; verify attribute names via the schema endpoint |
| HTTP 404 on entries(entry_ids: ["XYZW"]) | The entry doesn't exist | RCSB returns null rather than 404 inside the GraphQL response — check each data.entries[i] for null |
| HTTP 429 Too Many Requests | Burst pace | Add time.sleep(0.3) between requests; exponential backoff on 429 |
| HTTP 500 from search | Server-side glitch | Retry after 5–10 s; check status.rcsb.org |
| Downloaded .pdb file truncated | >99,999 atoms (legacy format limit) | Download .cif instead |
| GraphQL response has errors array | Field name typo or wrong root | Read the error message; the API is strict about field names — check the schema browser at https://data.rcsb.org/index.html#graphql-api |
rcsb-api PyPI package; this SKILL.md uses the underlying REST/GraphQL directly so no SDK install is needed.tools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.