scientific-skills/alphafold-database/SKILL.md
Access AlphaFold 200M+ AI-predicted protein structures. Retrieve structures by UniProt ID, download PDB/mmCIF files, analyze confidence metrics (pLDDT, PAE), for drug discovery and structural biology.
npx skillsauth add K-Dense-AI/claude-scientific-skills alphafold-databaseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
AlphaFold DB is a public repository of AI-predicted 3D protein structures for over 200 million proteins, maintained by DeepMind and EMBL-EBI. Access structure predictions with confidence metrics, download coordinate files, retrieve bulk datasets, and integrate predictions into computational workflows.
This skill should be used when working with AI-predicted protein structures in scenarios such as:
Using Biopython (Recommended):
The Biopython library provides the simplest interface for retrieving AlphaFold structures:
from Bio.PDB import alphafold_db
# Get all predictions for a UniProt accession
predictions = list(alphafold_db.get_predictions("P00520"))
# Download structure file (mmCIF format)
for prediction in predictions:
cif_file = alphafold_db.download_cif_for(prediction, directory="./structures")
print(f"Downloaded: {cif_file}")
# Get Structure objects directly
from Bio.PDB import MMCIFParser
structures = list(alphafold_db.get_structural_models_for("P00520"))
Direct API Access:
Query predictions using REST endpoints:
import requests
# Get prediction metadata for a UniProt accession
uniprot_id = "P00520"
api_url = f"https://alphafold.ebi.ac.uk/api/prediction/{uniprot_id}"
response = requests.get(api_url)
prediction_data = response.json()
# Extract AlphaFold ID
alphafold_id = prediction_data[0]['entryId']
print(f"AlphaFold ID: {alphafold_id}")
Using UniProt to Find Accessions:
Search UniProt to find protein accessions first:
import urllib.parse, urllib.request
def get_uniprot_ids(query, query_type='PDB_ID'):
"""Query UniProt to get accession IDs"""
url = 'https://www.uniprot.org/uploadlists/'
params = {
'from': query_type,
'to': 'ACC',
'format': 'txt',
'query': query
}
data = urllib.parse.urlencode(params).encode('ascii')
with urllib.request.urlopen(urllib.request.Request(url, data)) as response:
return response.read().decode('utf-8').splitlines()
# Example: Find UniProt IDs for a protein name
protein_ids = get_uniprot_ids("hemoglobin", query_type="GENE_NAME")
AlphaFold provides multiple file formats for each prediction:
File Types Available:
model_v4.cif): Atomic coordinates in mmCIF/PDBx formatconfidence_v4.json): Per-residue pLDDT scores (0-100)predicted_aligned_error_v4.json): PAE matrix for residue pair confidenceDownload URLs:
import requests
alphafold_id = "AF-P00520-F1"
version = "v4"
# Model coordinates (mmCIF)
model_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-model_{version}.cif"
response = requests.get(model_url)
with open(f"{alphafold_id}.cif", "w") as f:
f.write(response.text)
# Confidence scores (JSON)
confidence_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-confidence_{version}.json"
response = requests.get(confidence_url)
confidence_data = response.json()
# Predicted Aligned Error (JSON)
pae_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-predicted_aligned_error_{version}.json"
response = requests.get(pae_url)
pae_data = response.json()
PDB Format (Alternative):
# Download as PDB format instead of mmCIF
pdb_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-model_{version}.pdb"
response = requests.get(pdb_url)
with open(f"{alphafold_id}.pdb", "wb") as f:
f.write(response.content)
AlphaFold predictions include confidence estimates critical for interpretation:
pLDDT (per-residue confidence):
import json
import requests
# Load confidence scores
alphafold_id = "AF-P00520-F1"
confidence_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-confidence_v4.json"
confidence = requests.get(confidence_url).json()
# Extract pLDDT scores
plddt_scores = confidence['confidenceScore']
# Interpret confidence levels
# pLDDT > 90: Very high confidence
# pLDDT 70-90: High confidence
# pLDDT 50-70: Low confidence
# pLDDT < 50: Very low confidence
high_confidence_residues = [i for i, score in enumerate(plddt_scores) if score > 90]
print(f"High confidence residues: {len(high_confidence_residues)}/{len(plddt_scores)}")
PAE (Predicted Aligned Error):
PAE indicates confidence in relative domain positions:
import numpy as np
import matplotlib.pyplot as plt
# Load PAE matrix
pae_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-predicted_aligned_error_v4.json"
pae = requests.get(pae_url).json()
# Visualize PAE matrix
pae_matrix = np.array(pae['distance'])
plt.figure(figsize=(10, 8))
plt.imshow(pae_matrix, cmap='viridis_r', vmin=0, vmax=30)
plt.colorbar(label='PAE (Å)')
plt.title(f'Predicted Aligned Error: {alphafold_id}')
plt.xlabel('Residue')
plt.ylabel('Residue')
plt.savefig(f'{alphafold_id}_pae.png', dpi=300, bbox_inches='tight')
# Low PAE values (<5 Å) indicate confident relative positioning
# High PAE values (>15 Å) suggest uncertain domain arrangements
For large-scale analyses, use Google Cloud datasets:
Google Cloud Storage:
# Install gsutil
uv pip install gsutil
# List available data
gsutil ls gs://public-datasets-deepmind-alphafold-v4/
# Download entire proteomes (by taxonomy ID)
gsutil -m cp gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-9606-*.tar .
# Download specific files
gsutil cp gs://public-datasets-deepmind-alphafold-v4/accession_ids.csv .
BigQuery Metadata Access:
from google.cloud import bigquery
# Initialize client
client = bigquery.Client()
# Query metadata
query = """
SELECT
entryId,
uniprotAccession,
organismScientificName,
globalMetricValue,
fractionPlddtVeryHigh
FROM `bigquery-public-data.deepmind_alphafold.metadata`
WHERE organismScientificName = 'Homo sapiens'
AND fractionPlddtVeryHigh > 0.8
LIMIT 100
"""
results = client.query(query).to_dataframe()
print(f"Found {len(results)} high-confidence human proteins")
Download by Species:
⚠️ Security Note: The example below uses
shell=Truefor simplicity. In production environments, prefer usingsubprocess.run()with a list of arguments to prevent command injection vulnerabilities. See Python subprocess security.
import subprocess
import shlex
def download_proteome(taxonomy_id, output_dir="./proteomes"):
"""Download all AlphaFold predictions for a species"""
# Validate taxonomy_id is an integer to prevent injection
if not isinstance(taxonomy_id, int):
raise ValueError("taxonomy_id must be an integer")
pattern = f"gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-{taxonomy_id}-*_v4.tar"
# Use list form instead of shell=True for security
subprocess.run(["gsutil", "-m", "cp", pattern, f"{output_dir}/"], check=True)
# Download E. coli proteome (tax ID: 83333)
download_proteome(83333)
# Download human proteome (tax ID: 9606)
download_proteome(9606)
Work with downloaded AlphaFold structures using BioPython:
from Bio.PDB import MMCIFParser, PDBIO
import numpy as np
# Parse mmCIF file
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("protein", "AF-P00520-F1-model_v4.cif")
# Extract coordinates
coords = []
for model in structure:
for chain in model:
for residue in chain:
if 'CA' in residue: # Alpha carbons only
coords.append(residue['CA'].get_coord())
coords = np.array(coords)
print(f"Structure has {len(coords)} residues")
# Calculate distances
from scipy.spatial.distance import pdist, squareform
distance_matrix = squareform(pdist(coords))
# Identify contacts (< 8 Å)
contacts = np.where((distance_matrix > 0) & (distance_matrix < 8))
print(f"Number of contacts: {len(contacts[0]) // 2}")
Extract B-factors (pLDDT values):
AlphaFold stores pLDDT scores in the B-factor column:
from Bio.PDB import MMCIFParser
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("protein", "AF-P00520-F1-model_v4.cif")
# Extract pLDDT from B-factors
plddt_scores = []
for model in structure:
for chain in model:
for residue in chain:
if 'CA' in residue:
plddt_scores.append(residue['CA'].get_bfactor())
# Identify high-confidence regions
high_conf_regions = [(i, score) for i, score in enumerate(plddt_scores, 1) if score > 90]
print(f"High confidence residues: {len(high_conf_regions)}")
Process multiple predictions efficiently:
from Bio.PDB import alphafold_db
import pandas as pd
uniprot_ids = ["P00520", "P12931", "P04637"] # Multiple proteins
results = []
for uniprot_id in uniprot_ids:
try:
# Get prediction
predictions = list(alphafold_db.get_predictions(uniprot_id))
if predictions:
pred = predictions[0]
# Download structure
cif_file = alphafold_db.download_cif_for(pred, directory="./batch_structures")
# Get confidence data
alphafold_id = pred['entryId']
conf_url = f"https://alphafold.ebi.ac.uk/files/{alphafold_id}-confidence_v4.json"
conf_data = requests.get(conf_url).json()
# Calculate statistics
plddt_scores = conf_data['confidenceScore']
avg_plddt = np.mean(plddt_scores)
high_conf_fraction = sum(1 for s in plddt_scores if s > 90) / len(plddt_scores)
results.append({
'uniprot_id': uniprot_id,
'alphafold_id': alphafold_id,
'avg_plddt': avg_plddt,
'high_conf_fraction': high_conf_fraction,
'length': len(plddt_scores)
})
except Exception as e:
print(f"Error processing {uniprot_id}: {e}")
# Create summary DataFrame
df = pd.DataFrame(results)
print(df)
# Install Biopython for structure access
uv pip install biopython
# Install requests for API access
uv pip install requests
# For visualization and analysis
uv pip install numpy matplotlib pandas scipy
# For Google Cloud access (optional)
uv pip install google-cloud-bigquery gsutil
AlphaFold can also be accessed via the 3D-Beacons federated API:
import requests
# Query via 3D-Beacons
uniprot_id = "P00520"
url = f"https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/api/uniprot/summary/{uniprot_id}.json"
response = requests.get(url)
data = response.json()
# Filter for AlphaFold structures
af_structures = [s for s in data['structures'] if s['provider'] == 'AlphaFold DB']
UniProt Accession: Primary identifier for proteins (e.g., "P00520"). Required for querying AlphaFold DB.
AlphaFold ID: Internal identifier format: AF-[UniProt accession]-F[fragment number] (e.g., "AF-P00520-F1").
pLDDT (predicted Local Distance Difference Test): Per-residue confidence metric (0-100). Higher values indicate more confident predictions.
PAE (Predicted Aligned Error): Matrix indicating confidence in relative positions between residue pairs. Low values (<5 Å) suggest confident relative positioning.
Database Version: Current version is v4. File URLs include version suffix (e.g., model_v4.cif).
Fragment Number: Large proteins may be split into fragments. Fragment number appears in AlphaFold ID (e.g., F1, F2).
pLDDT Thresholds:
PAE Guidelines:
Comprehensive API documentation covering:
Consult this reference for detailed API information, bulk download strategies, or when working with large-scale datasets.
_v4.cif)development
Spectral similarity and compound identification for metabolomics. Use for comparing mass spectra, computing similarity scores (cosine, modified cosine), and identifying unknown compounds from spectral libraries. Best for metabolite identification, spectral matching, library searching. For full LC-MS/MS proteomics pipelines use pyopenms.
development
Convert files and office documents to Markdown. Supports PDF, DOCX, PPTX, XLSX, images (with OCR), audio (with transcription), HTML, CSV, JSON, XML, ZIP, YouTube URLs, EPubs and more.
development
Generate comprehensive market research reports (50+ pages) in the style of top consulting firms (McKinsey, BCG, Gartner). Features professional LaTeX formatting, extensive visual generation with scientific-schematics and generate-image, deep integration with research-lookup for data gathering, and multi-framework strategic analysis including Porter Five Forces, PESTLE, SWOT, TAM/SAM/SOM, and BCG Matrix.
testing
Comprehensive markdown and Mermaid diagram writing skill. Use when creating any scientific document, report, analysis, or visualization. Establishes text-based diagrams as the default documentation standard with full style guides (markdown + mermaid), 24 diagram type references, and 9 document templates.