skills/proteomics-protein-engineering/interpro-database/SKILL.md
Query InterPro REST API for protein domain architecture, family classification, and member-DB integration. Search entries, retrieve a protein's domains, list family members, get taxonomic distribution, link to PDB. Unifies Pfam, PANTHER, PIRSF, PRINTS, PROSITE, SMART, CDD, NCBIfam. Use uniprot-protein-database for sequences; pdb-database for 3D structures.
npx skillsauth add jaechang-hits/scicraft interpro-databaseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
InterPro is the EBI's integrated protein family, domain, and functional site database. It consolidates signatures from 13 member databases (Pfam, PANTHER, PIRSF, PRINTS, PROSITE, SMART, CDD, NCBIfam, and others) into unified InterPro entries, each describing a homologous superfamily, domain, family, repeat, or conserved site. The REST API at https://www.ebi.ac.uk/interpro/api/ is free and requires no authentication.
uniprot-protein-databasePfam directly; InterPro is the meta-layerrequests, pandas, matplotlibP04637) or InterPro accessions (e.g., IPR011009)time.sleep(1.0) between requests for batch queries; paginate with ?cursor= or ?page_size=pip install requests pandas matplotlib
import requests
INTERPRO_BASE = "https://www.ebi.ac.uk/interpro/api"
def interpro_get(path: str, params: dict = None) -> dict:
"""Send a GET request to the InterPro API and return parsed JSON."""
r = requests.get(
f"{INTERPRO_BASE}/{path}",
params=params,
headers={"Accept": "application/json"},
timeout=30
)
r.raise_for_status()
return r.json()
# Get domain architecture for TP53 (P04637)
# Note: `protein/uniprot/{acc}/` returns only {metadata}; the entries-per-protein
# data lives at `entry/interpro/protein/uniprot/{acc}/` and is keyed `results`.
data = interpro_get("entry/interpro/protein/uniprot/P04637/")
entries = data.get("results", [])
print(f"InterPro entries for TP53: {data.get('count')} (this page: {len(entries)})")
for e in entries[:4]:
m = e["metadata"]
print(f" {m['accession']} {m['type']:<25} {m['name']}")
# InterPro entries for TP53: 9
# IPR002117 family p53 tumour suppressor family
# IPR036674 homologous_superfamily p53-like tetramerisation domain superfamily
Search for InterPro entries by name keyword or fetch a specific entry by accession.
import requests
INTERPRO_BASE = "https://www.ebi.ac.uk/interpro/api"
def search_entries(query: str, entry_type: str = None,
page_size: int = 20) -> list:
"""Search InterPro entries by keyword; optionally filter by type."""
params = {"search": query, "page_size": page_size}
if entry_type:
params["type"] = entry_type # family, domain, homologous_superfamily, repeat, site
r = requests.get(
f"{INTERPRO_BASE}/entry/interpro/",
params=params,
headers={"Accept": "application/json"},
timeout=30
)
r.raise_for_status()
return r.json().get("results", [])
hits = search_entries("serine kinase", entry_type="domain")
print(f"InterPro domain entries matching 'serine kinase': {len(hits)}")
for h in hits[:5]:
m = h["metadata"]
print(f" {m['accession']} {m['type']:<10} {m['name']}")
# InterPro domain entries matching 'serine kinase': 8
# IPR000719 domain Protein kinase domain
# IPR008271 domain Serine/threonine/tyrosine kinase, active site
# Fetch a specific InterPro entry by accession
r = requests.get(
f"{INTERPRO_BASE}/entry/interpro/IPR000719/",
headers={"Accept": "application/json"},
timeout=30
)
r.raise_for_status()
meta = r.json()["metadata"]
print(f"Accession : {meta['accession']}")
print(f"Name : {meta['name']}")
print(f"Type : {meta['type']}")
print(f"Member DBs : {list(meta.get('member_databases', {}).keys())}")
go_terms = meta.get("go_terms", [])
print(f"GO terms : {[g['identifier'] for g in go_terms[:3]]}")
# Accession : IPR000719
# Name : Protein kinase domain
# Type : domain
# Member DBs : ['pfam', 'smart', 'cdd', 'ncbifam', 'panther']
# GO terms : ['GO:0004672', 'GO:0005524', 'GO:0006468']
Retrieve all InterPro entries (domains, families, sites) matched in a protein by UniProt accession.
import requests
INTERPRO_BASE = "https://www.ebi.ac.uk/interpro/api"
def get_protein_domain_architecture(uniprot_acc: str) -> dict:
"""Return all InterPro entry matches for a protein. Uses the
`entry/interpro/protein/uniprot/{acc}/` endpoint, which returns
{count, next, previous, results}. Each result has metadata + a
nested `proteins[0].entry_protein_locations` for the per-protein match."""
r = requests.get(
f"{INTERPRO_BASE}/entry/interpro/protein/uniprot/{uniprot_acc}/",
headers={"Accept": "application/json"},
timeout=60
)
r.raise_for_status()
return r.json()
data = get_protein_domain_architecture("P04637") # TP53
results = data.get("results", [])
# Pull length/source from the first match's nested protein record
prot0 = results[0]["proteins"][0] if results and results[0].get("proteins") else {}
print(f"Protein length : {prot0.get('protein_length')}")
print(f"Source DB : {prot0.get('source_database')}")
print(f"InterPro entries: {data.get('count')}")
for entry in results[:6]:
m = entry["metadata"]
# Locations are nested under proteins[0].entry_protein_locations
locs = entry["proteins"][0].get("entry_protein_locations", []) if entry.get("proteins") else []
loc_str = ", ".join(
f"{frag['start']}-{frag['end']}"
for loc in locs for frag in loc.get("fragments", [])
)
print(f" {m['accession']} {m['type']:<25} {m['name'][:35]:<35} [{loc_str}]")
# Compare domain architectures of two proteins side-by-side
import pandas as pd
def domain_set(uniprot_acc: str) -> set:
data = get_protein_domain_architecture(uniprot_acc)
return {e["metadata"]["accession"] for e in data.get("results", [])}
brca1_domains = domain_set("P38398") # BRCA1
tp53_domains = domain_set("P04637") # TP53
shared = brca1_domains & tp53_domains
unique_brca1 = brca1_domains - tp53_domains
unique_tp53 = tp53_domains - brca1_domains
print(f"Shared InterPro entries: {len(shared)}")
print(f"BRCA1-unique : {len(unique_brca1)}")
print(f"TP53-unique : {len(unique_tp53)}")
List proteins that contain a specific InterPro entry (family or domain).
import requests, time
INTERPRO_BASE = "https://www.ebi.ac.uk/interpro/api"
def get_entry_proteins(interpro_acc: str, reviewed_only: bool = True,
page_size: int = 50) -> list:
"""Return proteins (UniProt) containing a given InterPro entry.
Path order is `/protein/{db}/entry/interpro/{IPR}/` — the inverse
`entry/interpro/{IPR}/protein/{db}/` times out (408) on large families."""
db = "reviewed" if reviewed_only else "uniprot"
r = requests.get(
f"{INTERPRO_BASE}/protein/{db}/entry/interpro/{interpro_acc}/",
params={"page_size": page_size},
headers={"Accept": "application/json"},
timeout=60
)
r.raise_for_status()
return r.json().get("results", [])
proteins = get_entry_proteins("IPR011009") # Protein kinase-like domain SF
print(f"Reviewed proteins with IPR011009 (page 1): {len(proteins)}")
for p in proteins[:4]:
m = p["metadata"]
# metadata fields: accession, gene, length, name, source_database, source_organism
print(f" {m['accession']} {(m.get('gene') or ''):<8} "
f"len={m.get('length', '?')} "
f"org={(m.get('source_organism') or {}).get('scientificName', '')[:30]}")
# Paginate all proteins for a family using cursor
def get_all_entry_proteins(interpro_acc: str,
reviewed_only: bool = True) -> list:
INTERPRO_BASE = "https://www.ebi.ac.uk/interpro/api"
db = "reviewed" if reviewed_only else "uniprot"
# Path-inverted: /protein/{db}/entry/interpro/{IPR}/ is the working order
url = f"{INTERPRO_BASE}/protein/{db}/entry/interpro/{interpro_acc}/"
all_proteins = []
params = {"page_size": 200}
while url:
r = requests.get(url, params=params,
headers={"Accept": "application/json"}, timeout=60)
r.raise_for_status()
data = r.json()
all_proteins.extend(data.get("results", []))
url = data.get("next")
params = None # next URL already has params encoded
if url:
time.sleep(1.0)
return all_proteins
proteins = get_all_entry_proteins("IPR000719") # Protein kinase domain
print(f"Total reviewed proteins with protein kinase domain: {len(proteins)}")
Get the taxonomic distribution of proteins annotated with a given InterPro entry.
import requests
INTERPRO_BASE = "https://www.ebi.ac.uk/interpro/api"
def get_entry_taxonomy(interpro_acc: str,
page_size: int = 50) -> list:
"""Return taxonomic summary for proteins in a given InterPro entry.
Path-inverted: `/taxonomy/uniprot/entry/interpro/{IPR}/`. Each result
has `metadata` (taxon: accession=taxId, name, parent, children, rank)
and `entries[]` (representative protein-match locations for that taxon)."""
r = requests.get(
f"{INTERPRO_BASE}/taxonomy/uniprot/entry/interpro/{interpro_acc}/",
params={"page_size": page_size},
headers={"Accept": "application/json"},
timeout=90
)
r.raise_for_status()
return r.json().get("results", [])
# Use a smaller entry (p53 DBD); IPR000719 (kinase) has ~270k taxa and times out.
taxa = get_entry_taxonomy("IPR011615")
print(f"Top taxa for IPR011615 (p53 DNA-binding domain):")
for t in taxa[:8]:
m = t["metadata"]
print(f" taxId={m['accession']:>10} {m.get('name', ''):<30} "
f"rank={m.get('rank') or 'n/a'}")
Retrieve PDB structures associated with an InterPro entry.
import requests
INTERPRO_BASE = "https://www.ebi.ac.uk/interpro/api"
def get_entry_structures(interpro_acc: str, page_size: int = 25) -> list:
"""Return PDB structures that include a match to a given InterPro entry.
Path-inverted: `/structure/pdb/entry/interpro/{IPR}/`. The flat form with
`?entry_interpro=...` is silently slow / 408s on this resource."""
r = requests.get(
f"{INTERPRO_BASE}/structure/pdb/entry/interpro/{interpro_acc}/",
params={"page_size": page_size},
headers={"Accept": "application/json"},
timeout=60
)
r.raise_for_status()
return r.json().get("results", [])
structures = get_entry_structures("IPR011009") # Protein kinase-like SF
print(f"PDB structures linked to IPR011009 (page 1): {len(structures)}")
for s in structures[:5]:
m = s["metadata"]
print(f" {m['accession'].upper()} resolution={m.get('resolution', 'N/A')} Å "
f"experiment={m.get('experiment_type', 'N/A')}")
# PDB structures linked to IPR011009: ~8,000+
# 1A06 resolution=2.5 Å experiment=x-ray
# ...
Download the FASTA sequences of proteins in an InterPro family for alignment or phylogenetics.
import requests, time
INTERPRO_BASE = "https://www.ebi.ac.uk/interpro/api"
def get_family_fasta(interpro_acc: str,
reviewed_only: bool = True,
max_sequences: int = 100) -> str:
"""Retrieve FASTA sequences for proteins in an InterPro entry."""
db = "reviewed" if reviewed_only else "uniprot"
proteins = []
# Path-inverted: protein-list-for-entry is /protein/{db}/entry/interpro/{IPR}/
url = f"{INTERPRO_BASE}/protein/{db}/entry/interpro/{interpro_acc}/"
params = {"page_size": min(max_sequences, 200)}
while url and len(proteins) < max_sequences:
r = requests.get(url, params=params,
headers={"Accept": "application/json"}, timeout=60)
r.raise_for_status()
data = r.json()
proteins.extend(data.get("results", []))
url = data.get("next") if len(proteins) < max_sequences else None
params = None
if url:
time.sleep(1.0)
# Fetch FASTA from UniProt for each accession
accessions = [p["metadata"]["accession"] for p in proteins[:max_sequences]]
fasta_url = "https://rest.uniprot.org/uniprotkb/stream"
query = " OR ".join(f"accession:{acc}" for acc in accessions)
r = requests.get(fasta_url,
params={"query": query, "format": "fasta"},
timeout=120)
r.raise_for_status()
return r.text
fasta = get_family_fasta("IPR000719", reviewed_only=True, max_sequences=20)
seq_count = fasta.count(">")
print(f"FASTA sequences retrieved: {seq_count}")
print(fasta[:300]) # preview first sequence header + start
InterPro classifies entries into five types. The type determines what biological relationship the match implies:
| Type | Description | Example |
|------|-------------|---------|
| family | Homologous group of proteins sharing common ancestry and function | IPR000719 (Protein kinase) |
| domain | Discrete structural and functional unit that can occur in multiple protein contexts | IPR011009 (Protein kinase-like SF) |
| homologous_superfamily | Structurally similar domains that may have diverged in sequence | IPR011993 (Pleckstrin-like) |
| repeat | Short, repeated sequence unit that occurs multiple times within a protein | IPR001440 (TPR repeat) |
| site | Short conserved motif: active site, binding site, or post-translational modification site | IPR008271 (Ser/Thr kinase active site) |
Each InterPro entry integrates signatures from one or more member databases. The InterPro accession (IPR...) is the unified meta-entry; member database accessions point to the underlying models:
| Member DB | Accession prefix | Modeling approach | |-----------|-----------------|-------------------| | Pfam | PF | Hidden Markov Models (profile HMMs) | | PANTHER | PTHR | Phylogenetic trees + HMMs | | PIRSF | PIRSF | Full-length HMMs | | PRINTS | PR | Fingerprint motif groups | | PROSITE | PS | Patterns and profiles | | SMART | SM | HMMs with database integration | | CDD | cd | Position-specific scoring matrices (PSSMs) | | NCBIfam | NF | NCBI-curated HMMs |
The InterPro API paginates results at the collection level. Each response includes a next URL (or null when exhausted) and a count field. For large families (e.g., kinases: 10,000+ proteins) always iterate using the next cursor.
import requests, time
def iterate_interpro(url: str, page_size: int = 200) -> list:
"""Generic paginator for any InterPro list endpoint."""
results = []
params = {"page_size": page_size}
while url:
r = requests.get(url, params=params,
headers={"Accept": "application/json"}, timeout=60)
r.raise_for_status()
data = r.json()
results.extend(data.get("results", []))
url = data.get("next")
params = None
if url:
time.sleep(1.0)
return results
Goal: Retrieve all InterPro domains for a list of proteins and produce a summary table showing which domains each protein carries.
import requests, time, pandas as pd
INTERPRO_BASE = "https://www.ebi.ac.uk/interpro/api"
def get_domains(uniprot_acc: str) -> list:
"""List InterPro entries for a protein. Uses the
`entry/interpro/protein/uniprot/{acc}/` endpoint (keyed `results`)."""
r = requests.get(
f"{INTERPRO_BASE}/entry/interpro/protein/uniprot/{uniprot_acc}/",
headers={"Accept": "application/json"}, timeout=60
)
if r.status_code == 404:
return []
r.raise_for_status()
data = r.json()
return [
{
"protein": uniprot_acc,
"accession": e["metadata"]["accession"],
"name": e["metadata"]["name"],
"type": e["metadata"]["type"],
"source_db": list(e["metadata"].get("member_databases", {}).keys()),
}
for e in data.get("results", [])
]
proteins = ["P04637", "P38398", "Q00987", "P10415"] # TP53, BRCA1, MDM2, BCL2
rows = []
for acc in proteins:
rows.extend(get_domains(acc))
time.sleep(1.0)
df = pd.DataFrame(rows)
print(f"Total domain matches: {len(df)}")
print(df.groupby(["protein", "type"])["accession"].count().unstack(fill_value=0))
# Pivot: proteins × domain accessions
pivot = df[df["type"] == "domain"].pivot_table(
index="protein", columns="accession", aggfunc="size", fill_value=0
)
pivot.to_csv("domain_architecture_matrix.csv")
print(f"\nDomain × protein matrix: {pivot.shape}")
Goal: Retrieve proteins in a kinase domain family that have experimental structures in the PDB, ranked by resolution.
import requests, time, pandas as pd
INTERPRO_BASE = "https://www.ebi.ac.uk/interpro/api"
# Step 1: Get PDB structures linked to the protein kinase-like SF entry.
# Use the path-inverted form; the flat `?entry_interpro=` filter 408s.
r = requests.get(
f"{INTERPRO_BASE}/structure/pdb/entry/interpro/IPR011009/",
params={"page_size": 200},
headers={"Accept": "application/json"}, timeout=60
)
r.raise_for_status()
structures = r.json().get("results", [])
print(f"PDB structures with IPR011009 (kinase-like SF, page 1): {len(structures)}")
rows = []
for s in structures:
m = s["metadata"]
rows.append({
"pdb_id": m["accession"],
"resolution": m.get("resolution"),
"experiment": m.get("experiment_type", ""),
"name": m.get("name", ""),
})
df = pd.DataFrame(rows)
df = df.dropna(subset=["resolution"]).sort_values("resolution")
print(f"\nTop 10 highest-resolution kinase structures:")
print(df[["pdb_id", "resolution", "experiment", "name"]].head(10).to_string(index=False))
df.to_csv("kinase_structures.csv", index=False)
print(f"\nSaved kinase_structures.csv ({len(df)} X-ray / cryo-EM structures)")
Goal: Visualize how many reviewed proteins in each major kingdom carry a given InterPro domain.
import requests, time
import pandas as pd
import matplotlib.pyplot as plt
INTERPRO_BASE = "https://www.ebi.ac.uk/interpro/api"
def get_taxonomy_counts(interpro_acc: str, page_size: int = 100,
max_pages: int = 5) -> pd.DataFrame:
"""Walk the taxonomy results for an InterPro entry. The API does not
expose a per-taxon protein count at this endpoint — instead each
paged record is one (taxon × representative protein-match) row.
Aggregate client-side by taxon name to approximate frequency."""
rows, url = [], f"{INTERPRO_BASE}/taxonomy/uniprot/entry/interpro/{interpro_acc}/"
params = {"page_size": page_size}
for _ in range(max_pages):
if not url:
break
r = requests.get(url, params=params,
headers={"Accept": "application/json"}, timeout=90)
r.raise_for_status()
data = r.json()
for t in data.get("results", []):
m = t["metadata"]
rows.append({
"taxon_id": m["accession"],
"name": m.get("name", ""),
"rank": m.get("rank") or "",
})
url = data.get("next")
params = None
if url:
time.sleep(1.0)
return pd.DataFrame(rows)
IPR_ACC = "IPR011615" # p53 DNA-binding domain (smaller; kinase 408s)
df = get_taxonomy_counts(IPR_ACC, max_pages=3)
print(f"Tax entries pulled for {IPR_ACC}: {len(df)}")
# Aggregate by name and take top 15 (each row = one rep. protein-match)
top = (df.groupby("name").size().sort_values(ascending=False).head(15)
.reset_index(name="rep_matches"))
fig, ax = plt.subplots(figsize=(10, 5))
bars = ax.barh(top["name"], top["rep_matches"], color="#2171B5")
ax.bar_label(bars, fmt="%d", padding=3, fontsize=8)
ax.set_xlabel("Representative protein-matches")
ax.set_title(f"Taxonomic distribution of {IPR_ACC} (p53 DNA-binding domain)")
ax.invert_yaxis()
plt.tight_layout()
plt.savefig(f"{IPR_ACC}_taxonomy.png", dpi=150, bbox_inches="tight")
print(f"Saved {IPR_ACC}_taxonomy.png")
| Parameter | Endpoint | Default | Range / Options | Effect |
|-----------|----------|---------|-----------------|--------|
| search | entry/interpro/ | — | free-text string | Keyword filter on entry name and short name |
| type | entry/interpro/ | all types | family, domain, homologous_superfamily, repeat, site | Filter entries by InterPro type |
| page_size | all list endpoints | 20 | 1–200 | Results returned per page |
| entry_interpro | structure/pdb/ | — | IPR###### | Filter structures by linked InterPro entry |
| source_database | protein/ | — | reviewed, uniprot, trembl | Filter proteins by UniProt curation level |
| reviewed (URL path) | entry/{ipr}/{acc}/protein/ | uniprot | reviewed, uniprot | Swiss-Prot reviewed only vs all UniProtKB |
| relations | entry/interpro/{acc}/ | — | contains, contained_by, child_of, parent_of | Navigate the InterPro hierarchy |
| next | all list endpoints | — | URL from response | Cursor-based pagination; use the full URL from the next field |
Use reviewed proteins for curated domain lists: The unreviewed TrEMBL set is 5–10× larger and contains automated predictions. For benchmarking, family analysis, or training sets, restrict to reviewed (Swiss-Prot) entries to avoid noise from unreviewed predictions.
Chunk large taxonomy or protein lists: Retrieving all 10,000+ proteins for a broad family like the protein kinase superfamily can take minutes and produce large payloads. Limit queries with page_size=200 and the next cursor; store intermediate results to disk.
Add time.sleep(1.0) between paginated calls: The InterPro API is shared EBI infrastructure with no published rate limit. A 1-second pause per page is a safe minimum for batch scripts.
Prefer InterPro accessions over member DB accessions for cross-database queries: A Pfam PF00069 and PANTHER PTHR24340 both model kinase domains but with different protein coverage. Using the parent InterPro IPR000719 gives the union of all member DB matches in one query.
Check type before interpreting entry_protein_locations: Only domain, repeat, and site entries carry meaningful position information. family and homologous_superfamily entries typically span the full protein and their coordinates are less informative.
When to use: Given a UniProt accession, rapidly list which InterPro domains it contains.
import requests
INTERPRO_BASE = "https://www.ebi.ac.uk/interpro/api"
def list_protein_domains(uniprot_acc: str) -> list:
"""Return list of (accession, type, name) tuples for a protein."""
r = requests.get(
f"{INTERPRO_BASE}/entry/interpro/protein/uniprot/{uniprot_acc}/",
headers={"Accept": "application/json"}, timeout=60
)
r.raise_for_status()
return [
(e["metadata"]["accession"], e["metadata"]["type"], e["metadata"]["name"])
for e in r.json().get("results", [])
]
domains = list_protein_domains("P00533") # EGFR
print(f"InterPro entries in EGFR (P00533): {len(domains)}")
for acc, etype, name in domains:
print(f" {acc} {etype:<25} {name}")
# InterPro entries in EGFR (P00533): 10
# IPR009030 homologous_superfamily Growth factor receptor, cysteine-rich
# IPR000719 domain Protein kinase domain
When to use: Map how many proteins in a domain family are covered by each member database (Pfam vs PANTHER vs SMART, etc.).
import requests, time
import pandas as pd
INTERPRO_BASE = "https://www.ebi.ac.uk/interpro/api"
interpro_acc = "IPR000719" # Protein kinase domain
r = requests.get(
f"{INTERPRO_BASE}/entry/interpro/{interpro_acc}/",
headers={"Accept": "application/json"}, timeout=30
)
r.raise_for_status()
member_dbs = r.json()["metadata"].get("member_databases", {})
print(f"Member databases for {interpro_acc}:")
for db, details in member_dbs.items():
print(f" {db}: {details}")
# Visualize member database source breakdown
labels = list(member_dbs.keys())
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(7, 4))
ax.bar(labels, [1] * len(labels), color="#4472C4") # presence/absence per DB
ax.set_ylabel("Integrated (1=yes)")
ax.set_title(f"Member databases in {interpro_acc}")
plt.tight_layout()
plt.savefig(f"{interpro_acc}_member_dbs.png", dpi=150, bbox_inches="tight")
When to use: Bridge from structural domain to functional GO annotation.
import requests
INTERPRO_BASE = "https://www.ebi.ac.uk/interpro/api"
def get_go_terms_for_entry(interpro_acc: str) -> list:
"""Return GO terms associated with an InterPro entry."""
r = requests.get(
f"{INTERPRO_BASE}/entry/interpro/{interpro_acc}/",
headers={"Accept": "application/json"}, timeout=30
)
r.raise_for_status()
go_terms = r.json()["metadata"].get("go_terms", [])
return [
{"id": g["identifier"], "name": g["name"],
"category": g.get("category", {}).get("name", "")}
for g in go_terms
]
go_terms = get_go_terms_for_entry("IPR000719")
print(f"GO terms for IPR000719 (protein kinase domain): {len(go_terms)}")
for g in go_terms:
print(f" {g['id']} [{g['category'][:2].upper()}] {g['name']}")
# GO terms for IPR000719 (protein kinase domain): 3
# GO:0004672 [MO] protein kinase activity
# GO:0005524 [MO] ATP binding
# GO:0006468 [BI] protein phosphorylation
| Problem | Cause | Solution |
|---------|-------|----------|
| HTTP 404 on protein lookup | Accession not found in InterPro | Verify the UniProt accession exists; isoform accessions (P12345-2) may not be indexed separately |
| Empty entries list for a protein | Protein has no InterPro matches (e.g., intrinsically disordered) | Check UniProt directly; not all proteins have classified domains |
| protein/uniprot/{acc}/ returns only metadata (no entries) | That endpoint is protein-only; entry matches live elsewhere | Use entry/interpro/protein/uniprot/{acc}/ and read the results[] key |
| entry/interpro/{IPR}/protein/{db}/ returns 408 / hangs | The path with entry/... first does a slow join | Invert the path: protein/{db}/entry/interpro/{IPR}/ |
| structure/pdb/?entry_interpro={IPR} times out (408) | Same join order issue | Use structure/pdb/entry/interpro/{IPR}/ |
| entry/interpro/{IPR}/taxonomy/uniprot/ 408s for large families | Same | Use taxonomy/uniprot/entry/interpro/{IPR}/; for very large entries (e.g. IPR000719 kinase) the inverted form may still 408 — fall back to a more specific sub-family entry |
| HTTP 400 on entry search | Invalid query parameters or unsupported type value | Use one of: family, domain, homologous_superfamily, repeat, site |
| Pagination stops early | next is null before expected count | This is correct; all results have been returned |
| Very slow response for large families | Protein set has thousands of members | Increase page_size to 200; persist results after each page |
| ConnectionError or Timeout | Transient network or server issue | Retry with exponential backoff; EBI services occasionally have brief downtimes |
| Member DB accessions missing | Entry is new and member DB integration is pending | Use the InterPro accession for queries; member DB-level details update with each release |
uniprot-protein-database — UniProt REST API for protein sequences, Swiss-Prot functional annotations (active sites, PTMs, disease associations), and ID mappingesm-protein-language-model — Generate protein language model embeddings for sequences; useful after identifying a protein family with InterPropdb-database — Retrieve and download experimental 3D structures by PDB ID; cross-reference structure IDs discovered via InterPro structure queriestools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.