skills/systems-biology-multiomics/string-database-ppi/SKILL.md
Query STRING REST API for PPIs (59M proteins, 20B interactions, 5000+ species). Retrieve networks, run GO/KEGG enrichment, find partners, test PPI significance, visualize networks, analyze homology. For chemical interactions use chembl-database-bioactivity; pathways use kegg-database.
npx skillsauth add jaechang-hits/scicraft string-database-ppiInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Query the STRING protein-protein interaction database (59M proteins, 20B+ interactions, 5000+ species) via REST API. Covers network retrieval, functional enrichment (GO, KEGG, Pfam), interaction partner discovery, PPI enrichment testing, network visualization, and homology analysis.
uv pip install requests pandas
Rate limiting: No strict rate limit, but wait ~1 second between API calls. For proteome-scale analyses, use bulk downloads from https://string-db.org/cgi/download instead of the API.
import requests
import time
STRING_API = "https://string-db.org/api"
def string_query(endpoint, params, fmt="tsv"):
"""Reusable helper for all STRING API calls."""
url = f"{STRING_API}/{fmt}/{endpoint}"
params.setdefault("caller_identity", "python_script")
response = requests.get(url, params=params)
response.raise_for_status()
return response.text
# Map gene names to STRING IDs (always do this first)
result = string_query("get_string_ids", {
"identifiers": "TP53\nBRCA1\nEGFR",
"species": 9606
})
print(result)
# Get interaction network
time.sleep(1)
network = string_query("network", {
"identifiers": "TP53%0dBRCA1%0dMDM2",
"species": 9606,
"required_score": 400
})
print(network[:500])
| Organism | Common Name | Taxon ID | |----------|-------------|----------| | Homo sapiens | Human | 9606 | | Mus musculus | Mouse | 10090 | | Rattus norvegicus | Rat | 10116 | | Drosophila melanogaster | Fruit fly | 7227 | | Caenorhabditis elegans | C. elegans | 6239 | | Saccharomyces cerevisiae | Yeast | 4932 | | Arabidopsis thaliana | Thale cress | 3702 | | Escherichia coli K-12 | E. coli | 511145 | | Danio rerio | Zebrafish | 7955 | | Gallus gallus | Chicken | 9031 |
Full species list: https://string-db.org/cgi/input?input_page_active_form=organisms
STRING uses Ensembl protein IDs with taxon prefix: {taxonId}.{ensemblProteinId} (e.g., 9606.ENSP00000269305 for human TP53). Always map gene names to STRING IDs first via get_string_ids for faster subsequent queries.
Combined scores (0-1000) integrating 7 evidence channels:
| Channel | Code | Source |
|---------|------|--------|
| Neighborhood | nscore | Conserved genomic neighborhood |
| Fusion | fscore | Gene fusion events |
| Phylogenetic profile | pscore | Co-occurrence across species |
| Coexpression | ascore | Correlated RNA expression |
| Experimental | escore | Biochemical/genetic experiments |
| Database | dscore | Curated pathway/complex databases |
| Text-mining | tscore | Literature co-occurrence and NLP |
Recommended thresholds:
Replace /tsv/ in the URL with the desired format:
/json/)/image/)# Map gene names to STRING IDs
result = string_query("get_string_ids", {
"identifiers": "TP53\nBRCA1\nEGFR",
"species": 9606,
"limit": 1, # matches per identifier
"echo_query": 1 # include query term in output
})
# Parse the mapping
import pandas as pd
import io
df = pd.read_csv(io.StringIO(result), sep='\t')
id_map = dict(zip(df['queryItem'], df['stringId']))
print(id_map)
# {'TP53': '9606.ENSP00000269305', 'BRCA1': '9606.ENSP00000...', ...}
# Get PPI network with confidence scores
network = string_query("network", {
"identifiers": "TP53%0dBRCA1%0dMDM2%0dATM%0dCHEK2",
"species": 9606,
"required_score": 400,
"network_type": "functional" # or "physical"
})
# Parse network edges
time.sleep(1)
df = pd.read_csv(io.StringIO(network), sep='\t')
print(f"Found {len(df)} interactions")
print(df[['preferredName_A', 'preferredName_B', 'score']].head())
# Expand network with additional interactors
expanded = string_query("network", {
"identifiers": "TP53",
"species": 9606,
"add_nodes": 10, # add 10 most connected proteins
"required_score": 700
})
# Get PNG network image
url = f"{STRING_API}/image/network"
params = {
"identifiers": "TP53%0dMDM2%0dATM%0dCHEK2%0dBRCA1",
"species": 9606,
"required_score": 700,
"network_flavor": "evidence", # "evidence", "confidence", or "actions"
"caller_identity": "python_script"
}
response = requests.get(url, params=params)
with open("network.png", "wb") as f:
f.write(response.content)
# Discover top interaction partners
partners = string_query("interaction_partners", {
"identifiers": "TP53",
"species": 9606,
"limit": 20,
"required_score": 700
})
df = pd.read_csv(io.StringIO(partners), sep='\t')
print(f"Top 20 TP53 interactors:")
print(df[['preferredName_B', 'score']].head(10))
# GO, KEGG, Pfam, InterPro, SMART, UniProt Keywords enrichment
# Statistical method: Fisher's exact test with Benjamini-Hochberg FDR correction
enrichment = string_query("enrichment", {
"identifiers": "TP53%0dMDM2%0dATM%0dCHEK2%0dBRCA1%0dATR%0dTP73",
"species": 9606
})
df = pd.read_csv(io.StringIO(enrichment), sep='\t')
significant = df[df['fdr'] < 0.05]
print(f"Significant terms: {len(significant)}")
# Group by annotation category
for cat, group in significant.groupby('category'):
print(f"\n{cat}: {len(group)} terms")
for _, row in group.head(3).iterrows():
print(f" {row['description']} (FDR={row['fdr']:.2e})")
import json
# Test if proteins form a significant functional module
result = string_query("ppi_enrichment", {
"identifiers": "TP53%0dMDM2%0dATM%0dCHEK2%0dBRCA1",
"species": 9606,
"required_score": 400
}, fmt="json")
data = json.loads(result)
print(f"Observed edges: {data['number_of_edges']}")
print(f"Expected edges: {data['expected_number_of_edges']}")
print(f"P-value: {data['p_value']}")
# p < 0.05 → proteins form a significantly enriched network
# Get homology/similarity between proteins
homology = string_query("homology", {
"identifiers": "TP53%0dTP63%0dTP73",
"species": 9606
})
print(homology)
import requests, pandas as pd, io, json, time
STRING_API = "https://string-db.org/api"
def string_query(endpoint, params, fmt="tsv"):
url = f"{STRING_API}/{fmt}/{endpoint}"
params.setdefault("caller_identity", "python_script")
response = requests.get(url, params=params)
response.raise_for_status()
time.sleep(1)
return response.text
genes = "TP53%0dBRCA1%0dATM%0dCHEK2%0dMDM2%0dATR%0dBRCA2"
# Step 1: Map identifiers
mapping = string_query("get_string_ids", {"identifiers": genes.replace("%0d", "\n"), "species": 9606})
# Step 2: Get interaction network
network = string_query("network", {"identifiers": genes, "species": 9606, "required_score": 400})
net_df = pd.read_csv(io.StringIO(network), sep='\t')
print(f"Network: {len(net_df)} interactions")
# Step 3: Test PPI enrichment
ppi = json.loads(string_query("ppi_enrichment", {"identifiers": genes, "species": 9606}, fmt="json"))
print(f"PPI enrichment p-value: {ppi['p_value']}")
# Step 4: Functional enrichment
enrich = string_query("enrichment", {"identifiers": genes, "species": 9606})
enrich_df = pd.read_csv(io.StringIO(enrich), sep='\t')
sig = enrich_df[enrich_df['fdr'] < 0.05]
print(f"Significant GO/KEGG terms: {len(sig)}")
# Step 5: Save network image
img_resp = requests.get(f"{STRING_API}/image/network", params={
"identifiers": genes, "species": 9606, "required_score": 400,
"network_flavor": "evidence", "caller_identity": "python_script"
})
with open("protein_network.png", "wb") as f:
f.write(img_resp.content)
# Start with seed proteins, discover connected functional modules
seed = "TP53"
# Step 1: Get high-confidence interaction partners
partners = string_query("interaction_partners", {
"identifiers": seed, "species": 9606, "limit": 30, "required_score": 700
})
df = pd.read_csv(io.StringIO(partners), sep='\t')
all_proteins = list(set(df['preferredName_A'].tolist() + df['preferredName_B'].tolist()))
print(f"Expanded network: {len(all_proteins)} proteins")
# Step 2: Enrichment on expanded set
expanded_ids = "%0d".join(all_proteins[:50])
enrichment = string_query("enrichment", {"identifiers": expanded_ids, "species": 9606})
enrich_df = pd.read_csv(io.StringIO(enrichment), sep='\t')
modules = enrich_df[enrich_df['fdr'] < 0.001]
print(f"Highly significant terms: {len(modules)}")
# Compare protein interactions across species
for species, name, gene in [(9606, "Human", "TP53"), (10090, "Mouse", "Trp53")]:
network = string_query("network", {
"identifiers": gene, "species": species,
"required_score": 700, "add_nodes": 5
})
df = pd.read_csv(io.StringIO(network), sep='\t')
print(f"{name} ({gene}): {len(df)} interactions at score >= 700")
import pandas as pd, io
enrichment_tsv = string_query("enrichment", {
"identifiers": "TP53%0dBRCA1%0dATM", "species": 9606
})
df = pd.read_csv(io.StringIO(enrichment_tsv), sep='\t')
# Columns: category, term, description, number_of_genes, p_value, fdr
kegg = df[df['category'] == 'KEGG'].sort_values('fdr')
print(kegg[['description', 'fdr']].head(5))
import time
protein_lists = [["TP53", "MDM2"], ["EGFR", "ERBB2"], ["BRCA1", "BRCA2"]]
results = []
for proteins in protein_lists:
ids = "%0d".join(proteins)
network = string_query("network", {"identifiers": ids, "species": 9606})
results.append(network)
time.sleep(1) # respect rate limits
version = string_query("version", {})
print(f"STRING version: {version.strip()}")
# Include in methods section: "STRING v{version}, accessed {date}"
| Parameter | Endpoint | Default | Description |
|-----------|----------|---------|-------------|
| identifiers | All | — | Protein IDs, %0d-separated for URL or \n-separated for POST |
| species | All | — | NCBI taxon ID (9606=human, 10090=mouse) |
| required_score | network, partners, ppi_enrichment | 400 | Confidence threshold 0-1000 |
| network_type | network | functional | functional (all evidence) or physical (direct binding) |
| add_nodes | network, image | 0 | Additional connected proteins to include (0-10) |
| limit | get_string_ids, partners | 1/10 | Max results per query |
| network_flavor | image | evidence | evidence, confidence, or actions |
| Problem | Cause | Solution |
|---------|-------|----------|
| No proteins found | Wrong species or identifier typo | Verify species taxon ID; use get_string_ids to check identifier mapping |
| Empty network | Too strict confidence threshold | Lower required_score; verify proteins actually interact in STRING |
| Timeout on large queries | Too many proteins in single request | Split into batches of 50-100; use bulk downloads for proteome-scale |
| "Species required" error | Missing species for >10 protein networks | Always include species parameter |
| Unexpected results | Wrong network type or STRING version | Check network_type (functional vs physical); verify version with /version |
| 400 Bad Request | Malformed identifiers | Use %0d separator in URL or \n in POST body; URL-encode special characters |
| Enrichment returns no terms | Too few input proteins | Enrichment needs 5+ proteins for meaningful results |
get_string_ids() before other operations; STRING IDs (e.g., 9606.ENSP00000269305) are faster than gene namestime.sleep(1) between API callsstring_version() for reproducibilitynetworkx-graph-analysis — Graph analysis and visualization of STRING interaction networkskegg-database — Pathway-centric queries complementary to STRING enrichmentbioservices-multi-database — Alternative access to STRING via the PSICQUIC interfaceMain SKILL.md + 1 reference file. Original total: 990 lines (SKILL.md 534 + string_reference.md 456). Scripts: 370 lines (string_api.py).
references/api_advanced.md: Advanced API features (values/ranks enrichment, bulk upload, R/Cytoscape integration), output format details, HTTP error codes, data license — content from original string_reference.md that exceeds Core API scope.
Original file disposition:
SKILL.md (534 lines) → Core API modules 1-7, Workflows 1-3, Quick Start helper function, Key Concepts (species table, score thresholds, network types). "Common Use Cases" per-operation subsections consolidated into Core API module descriptions (rule 7b): each operation's "When to use" and "Use cases" → Core API intro text. "Detailed Reference" stub section → removed, content consolidated inlinereferences/string_reference.md (456 lines) → Partially consolidated inline: API endpoints → Core API modules with code blocks; species table → Key Concepts; confidence scores → Key Concepts; identifier format → Key Concepts. Advanced features (values/ranks enrichment, bulk upload), integration examples (R STRINGdb, Cytoscape), output format details, HTTP error codes, data license → migrated to references/api_advanced.mdscripts/string_api.py (370 lines) → Helper function pattern absorbed into Quick Start (string_query reusable function). Per-function disposition: string_map_ids → Core API Module 1; string_network → Module 2; string_network_image → Module 3; string_interaction_partners → Module 4; string_enrichment → Module 5; string_ppi_enrichment → Module 6; string_homology → Module 7; string_version → Recipe. All were thin wrappers around urllib; replaced with requests-based string_query helperRetention: ~460 lines (SKILL.md) + ~180 lines (reference) = ~640 / 990 original = ~65%.
tools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.