AlphaFold Database Access

Overview

AlphaFold DB is a public repository of AI-predicted 3D protein structures for over 200 million proteins, maintained by DeepMind and EMBL-EBI. Access predictions via BioPython or REST API, download coordinate files in multiple formats, analyze confidence metrics, and retrieve bulk proteome datasets via Google Cloud.

When to Use

Retrieving AI-predicted protein structures by UniProt accession
Downloading PDB/mmCIF coordinate files for structural analysis or docking
Analyzing prediction confidence (pLDDT per-residue, PAE domain-level)
Bulk-downloading entire proteome predictions via Google Cloud
Comparing predicted structures with experimental PDB structures
Building structural models for proteins lacking experimental data
Identifying high-confidence binding sites for drug discovery
For experimental structures only → use PDB directly
For running AlphaFold predictions → use ColabFold or local AlphaFold

Prerequisites

# Core (BioPython for structure access)
pip install biopython requests numpy matplotlib

# Optional: Google Cloud for bulk access
pip install google-cloud-bigquery google-cloud-storage

Quick Start

from Bio.PDB import alphafold_db, MMCIFParser
import requests, numpy as np

# 1. Get prediction for a protein
uniprot_id = "P00520"  # ABL1 kinase
predictions = list(alphafold_db.get_predictions(uniprot_id))
af_id = predictions[0]['entryId']  # AF-P00520-F1

# 2. Download structure
cif_file = alphafold_db.download_cif_for(predictions[0], directory="./structures")

# 3. Check confidence
conf = requests.get(f"https://alphafold.ebi.ac.uk/files/{af_id}-confidence_v4.json").json()
scores = conf['confidenceScore']
print(f"Mean pLDDT: {np.mean(scores):.1f}, High-conf residues: {sum(1 for s in scores if s > 90)}/{len(scores)}")

Core API

1. Prediction Retrieval

BioPython (recommended for single proteins):

from Bio.PDB import alphafold_db

# Get prediction metadata
predictions = list(alphafold_db.get_predictions("P00520"))
pred = predictions[0]
print(f"AlphaFold ID: {pred['entryId']}")
print(f"Gene: {pred['gene']}, Species: {pred['organismScientificName']}")

# Get Structure objects directly
structures = list(alphafold_db.get_structural_models_for("P00520"))

REST API (for metadata or integration):

import requests

uniprot_id = "P00520"
url = f"https://alphafold.ebi.ac.uk/api/prediction/{uniprot_id}"
response = requests.get(url)
data = response.json()

# Response includes download URLs for all file types
pred = data[0]
print(f"CIF: {pred['cifUrl']}")
print(f"PDB: {pred['pdbUrl']}")
print(f"PAE: {pred['paeDocUrl']}")

3D-Beacons federated API (query multiple structure providers):

url = f"https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/api/uniprot/summary/{uniprot_id}.json"
data = requests.get(url).json()
af_structures = [s for s in data['structures'] if s['provider'] == 'AlphaFold DB']

2. Structure File Download

import requests

af_id = "AF-P00520-F1"
version = "v4"
base = "https://alphafold.ebi.ac.uk/files"

# mmCIF (recommended — full metadata, supports large structures)
cif = requests.get(f"{base}/{af_id}-model_{version}.cif")
with open(f"{af_id}.cif", "w") as f:
    f.write(cif.text)

# PDB format (legacy — limited to 99,999 atoms)
pdb = requests.get(f"{base}/{af_id}-model_{version}.pdb")
with open(f"{af_id}.pdb", "wb") as f:
    f.write(pdb.content)

# Confidence JSON (per-residue pLDDT scores)
conf = requests.get(f"{base}/{af_id}-confidence_{version}.json").json()

# PAE matrix JSON (inter-residue confidence)
pae = requests.get(f"{base}/{af_id}-predicted_aligned_error_{version}.json").json()

3. Confidence Metrics Analysis

pLDDT (per-residue confidence, 0–100):

import numpy as np

conf_url = f"https://alphafold.ebi.ac.uk/files/{af_id}-confidence_v4.json"
conf = requests.get(conf_url).json()
scores = conf['confidenceScore']

# Classify residues by confidence
very_high = sum(1 for s in scores if s > 90)
high = sum(1 for s in scores if 70 < s <= 90)
low = sum(1 for s in scores if 50 < s <= 70)
very_low = sum(1 for s in scores if s <= 50)
print(f"Very high (>90): {very_high}, High (70-90): {high}, Low (50-70): {low}, Very low (<50): {very_low}")

PAE (Predicted Aligned Error) visualization:

import matplotlib.pyplot as plt

pae_url = f"https://alphafold.ebi.ac.uk/files/{af_id}-predicted_aligned_error_v4.json"
pae = requests.get(pae_url).json()
pae_matrix = np.array(pae['distance'])

plt.figure(figsize=(10, 8))
plt.imshow(pae_matrix, cmap='viridis_r', vmin=0, vmax=30)
plt.colorbar(label='PAE (Å)')
plt.title(f'Predicted Aligned Error: {af_id}')
plt.xlabel('Residue')
plt.ylabel('Residue')
plt.savefig(f'{af_id}_pae.png', dpi=300, bbox_inches='tight')
# Low PAE (<5 Å) = confident relative positioning; >15 Å = uncertain domain arrangement

4. Bulk Data Access (Google Cloud)

# List available data
gsutil ls gs://public-datasets-deepmind-alphafold-v4/

# Download entire proteome by taxonomy ID
gsutil -m cp gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-9606-*_v4.tar .

# Download accession index
gsutil cp gs://public-datasets-deepmind-alphafold-v4/accession_ids.csv .

BigQuery metadata queries:

from google.cloud import bigquery

client = bigquery.Client()
query = """
SELECT entryId, uniprotAccession, gene, organismScientificName,
       globalMetricValue, fractionPlddtVeryHigh
FROM `bigquery-public-data.deepmind_alphafold.metadata`
WHERE organismScientificName = 'Homo sapiens'
  AND fractionPlddtVeryHigh > 0.8
  AND isReviewed = TRUE
LIMIT 100
"""
df = client.query(query).to_dataframe()
print(f"Found {len(df)} high-confidence human proteins")

5. Structure Parsing & Analysis

from Bio.PDB import MMCIFParser
import numpy as np
from scipy.spatial.distance import pdist, squareform

parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("protein", f"{af_id}-model_v4.cif")

# Extract alpha-carbon coordinates
coords, plddt_scores = [], []
for model in structure:
    for chain in model:
        for residue in chain:
            if 'CA' in residue:
                coords.append(residue['CA'].get_coord())
                plddt_scores.append(residue['CA'].get_bfactor())  # pLDDT stored as B-factor

coords = np.array(coords)
print(f"Residues: {len(coords)}, Mean pLDDT: {np.mean(plddt_scores):.1f}")

# Contact map (Cα-Cα < 8 Å)
dist_matrix = squareform(pdist(coords))
contacts = np.where((dist_matrix > 0) & (dist_matrix < 8))
print(f"Contacts: {len(contacts[0]) // 2}")

Key Concepts

Confidence Interpretation

| Metric | Range | Interpretation | Suitable For | |--------|-------|---------------|--------------| | pLDDT >90 | Very high | Backbone + side-chain reliable | Detailed analysis, docking | | pLDDT 70–90 | High | Backbone generally reliable | Fold analysis, domain ID | | pLDDT 50–70 | Low | Use with caution | May be flexible/disordered | | pLDDT <50 | Very low | Likely disordered | Exclude from analysis | | PAE <5 Å | Confident | Reliable relative domain positions | Multi-domain assembly | | PAE 5–10 Å | Moderate | Uncertain arrangement | Treat domains independently | | PAE >15 Å | Uncertain | Domains may be mobile | Do not trust orientation |

AlphaFold ID Format

Format: AF-{UniProt_accession}-F{fragment_number} (e.g., AF-P00520-F1). Large proteins may be split into fragments (F1, F2, ...). Current database version: v4 — include version suffix in all file URLs.

File Types

| File | URL Suffix | Format | Use | |------|-----------|--------|-----| | Model coordinates | -model_v4.cif | mmCIF | Structural analysis (recommended) | | Model coordinates | -model_v4.pdb | PDB | Legacy tools (<99,999 atoms) | | Model coordinates | -model_v4.bcif | Binary CIF | Compressed (~70% smaller) | | Confidence | -confidence_v4.json | JSON | Per-residue pLDDT array | | Aligned error | -predicted_aligned_error_v4.json | JSON | N×N PAE matrix | | PAE image | -predicted_aligned_error_v4.png | PNG | Quick visual assessment |

Common Workflows

Workflow 1: Single Protein Structure Analysis

from Bio.PDB import alphafold_db, MMCIFParser
import requests, numpy as np

uniprot_id = "P04637"  # p53 tumor suppressor

# Retrieve and download
predictions = list(alphafold_db.get_predictions(uniprot_id))
cif_file = alphafold_db.download_cif_for(predictions[0], directory="./structures")
af_id = predictions[0]['entryId']

# Parse structure
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("p53", cif_file)

# Extract pLDDT from B-factors
plddt = [r['CA'].get_bfactor() for m in structure for c in m for r in c if 'CA' in r]
print(f"Length: {len(plddt)}, Mean pLDDT: {np.mean(plddt):.1f}")

# Identify high-confidence regions for docking
high_conf_regions = [(i+1, s) for i, s in enumerate(plddt) if s > 90]
print(f"High-confidence residues: {len(high_conf_regions)}/{len(plddt)}")

Workflow 2: Batch Protein Processing

from Bio.PDB import alphafold_db
import requests, numpy as np, pandas as pd, time

uniprot_ids = ["P00520", "P12931", "P04637", "P38398"]
results = []

for uid in uniprot_ids:
    try:
        preds = list(alphafold_db.get_predictions(uid))
        if not preds:
            continue
        af_id = preds[0]['entryId']
        conf = requests.get(f"https://alphafold.ebi.ac.uk/files/{af_id}-confidence_v4.json").json()
        scores = conf['confidenceScore']
        results.append({
            'uniprot': uid, 'alphafold_id': af_id,
            'length': len(scores), 'mean_plddt': np.mean(scores),
            'frac_high_conf': sum(1 for s in scores if s > 90) / len(scores)
        })
        time.sleep(0.2)  # Rate limit: 100-200ms between requests
    except Exception as e:
        print(f"Error {uid}: {e}")

df = pd.DataFrame(results)
print(df.to_string(index=False))

Key Parameters

| Parameter | Module | Default | Range | Effect | |-----------|--------|---------|-------|--------| | uniprot_id | All | — | UniProt accession | Primary query identifier | | version | Download | v4 | v1–v4 | Database version (always use latest) | | directory | BioPython | "." | Path | Download destination | | QUIET | MMCIFParser | False | bool | Suppress parser warnings | | vmin/vmax | PAE plot | 0/30 | Å | PAE colormap range | | taxonomy_id | GCS bulk | — | NCBI tax ID | Species for proteome download | | fractionPlddtVeryHigh | BigQuery | — | 0.0–1.0 | Filter by high-confidence fraction | | Concurrent requests | API | — | ≤10 | Max parallel API requests | | Request delay | API | — | 100–200ms | Delay between sequential requests |

Best Practices

Use BioPython for single proteins, Google Cloud for bulk — individual API downloads are slow for >100 proteins; GCS parallel download is orders of magnitude faster
Always check pLDDT before downstream analysis — low-confidence regions (pLDDT <50) are likely disordered and should be excluded from docking, contact analysis, or binding site prediction
Anti-pattern — trusting all regions equally: AlphaFold predictions lack ligands, PTMs, cofactors, and multi-chain context. High pLDDT does not guarantee functional accuracy
Cache downloaded files locally — avoid re-downloading the same structures; AlphaFold files are static per version
Use PAE for multi-domain proteins — pLDDT tells you per-residue confidence, but PAE reveals whether domain orientations are reliable. Low inter-domain PAE (<5 Å) = trust the arrangement; high PAE (>15 Å) = treat domains independently
Pin database version in reproducible analyses — include _v4 in URLs and document which version was used

Common Recipes

Recipe 1: Proteome Download by Species

import subprocess

def download_proteome(taxonomy_id: int, output_dir: str = "./proteomes"):
    """Download all AlphaFold predictions for a species via GCS."""
    if not isinstance(taxonomy_id, int):
        raise ValueError("taxonomy_id must be an integer")
    pattern = f"gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-{taxonomy_id}-*_v4.tar"
    subprocess.run(["gsutil", "-m", "cp", pattern, f"{output_dir}/"], check=True)

# Human (9606), E. coli (83333), Mouse (10090)
download_proteome(9606)

Recipe 2: High-Confidence Region Extraction

def extract_high_conf_residues(plddt_scores, threshold=90):
    """Extract contiguous high-confidence regions."""
    regions, start = [], None
    for i, score in enumerate(plddt_scores):
        if score > threshold and start is None:
            start = i
        elif score <= threshold and start is not None:
            regions.append((start + 1, i, i - start))  # 1-indexed
            start = None
    if start is not None:
        regions.append((start + 1, len(plddt_scores), len(plddt_scores) - start))
    return regions

# Usage: regions = extract_high_conf_residues(plddt_scores)
# Returns: [(start_res, end_res, length), ...]

Recipe 3: PAE-Based Domain Segmentation

import numpy as np

def segment_domains(pae_matrix, threshold=10.0):
    """Simple domain segmentation from PAE matrix."""
    n = pae_matrix.shape[0]
    # Average PAE for each residue pair -> symmetric
    sym_pae = (pae_matrix + pae_matrix.T) / 2
    # Cluster: residues with low mutual PAE are in the same domain
    domains, current_domain = [0] * n, 0
    for i in range(1, n):
        if sym_pae[i-1, i] > threshold:
            current_domain += 1
        domains[i] = current_domain
    return domains

Troubleshooting

| Problem | Cause | Solution | |---------|-------|---------| | 404 Not Found from API | No AlphaFold prediction for this UniProt ID | Check if protein is in AlphaFold DB; some organisms not covered | | 429 Too Many Requests | Exceeded rate limit | Add time.sleep(0.2) between requests; use GCS for bulk | | Empty predictions list | UniProt ID not in database | Verify ID at alphafold.ebi.ac.uk; try canonical isoform | | Large protein split into fragments | Protein >2700 residues | Check all fragments (F1, F2, ...); stitch manually if needed | | pLDDT values all low (<50) | Intrinsically disordered protein | Expected behavior; structure prediction unreliable for IDPs | | PAE matrix asymmetric | PAE[i][j] ≠ PAE[j][i] by design | PAE measures error when aligned on residue i; symmetrize for clustering | | ModuleNotFoundError: Bio.PDB.alphafold_db | BioPython version too old | Upgrade: pip install --upgrade biopython>=1.80 | | GCS download fails | gsutil not configured | Run gcloud auth login or use anonymous access for public data | | BigQuery quota exceeded | Free tier limit (1 TB/month) | Optimize queries with LIMIT; use GCS for bulk file access instead |

Bundled Resources

references/api_schemas_reference.md

Detailed API data schemas and lookup tables: REST API response fields, mmCIF data categories, confidence JSON schema, PAE JSON schema, BigQuery metadata table fields, HTTP error codes, rate limiting guidelines, and version history (v1–v4). Consult for field-level details when parsing API responses or building custom queries. Scripts functionality (none in original). Original api_reference.md content partially relocated to Core API (endpoints, common code patterns) and Key Concepts (confidence thresholds, file types table); schemas, field catalogs, and error codes retained in this reference.

Related Skills

autodock-vina — molecular docking using AlphaFold structures as receptor
biopython — general protein structure parsing and analysis beyond AlphaFold

References

AlphaFold DB: https://alphafold.ebi.ac.uk/
API Documentation: https://alphafold.ebi.ac.uk/api-docs
Jumper et al. (2021) Nature 596, 583–589: https://doi.org/10.1038/s41586-021-03819-2
Varadi et al. (2024) Nucleic Acids Res. 52, D368–D375: https://doi.org/10.1093/nar/gkad1011
BioPython AlphaFold module: https://biopython.org/docs/dev/api/Bio.PDB.alphafold_db.html
Google Cloud AlphaFold: https://console.cloud.google.com/marketplace/product/bigquery-public-data/deepmind-alphafold

AlphaFold Database Access

Overview

When to Use

Retrieving AI-predicted protein structures by UniProt accession
Downloading PDB/mmCIF coordinate files for structural analysis or docking
Analyzing prediction confidence (pLDDT per-residue, PAE domain-level)
Bulk-downloading entire proteome predictions via Google Cloud
Comparing predicted structures with experimental PDB structures
Building structural models for proteins lacking experimental data
Identifying high-confidence binding sites for drug discovery
For experimental structures only → use PDB directly
For running AlphaFold predictions → use ColabFold or local AlphaFold

Prerequisites

# Core (BioPython for structure access)
pip install biopython requests numpy matplotlib

# Optional: Google Cloud for bulk access
pip install google-cloud-bigquery google-cloud-storage

Quick Start

from Bio.PDB import alphafold_db, MMCIFParser
import requests, numpy as np

# 1. Get prediction for a protein
uniprot_id = "P00520"  # ABL1 kinase
predictions = list(alphafold_db.get_predictions(uniprot_id))
af_id = predictions[0]['entryId']  # AF-P00520-F1

# 2. Download structure
cif_file = alphafold_db.download_cif_for(predictions[0], directory="./structures")

# 3. Check confidence
conf = requests.get(f"https://alphafold.ebi.ac.uk/files/{af_id}-confidence_v4.json").json()
scores = conf['confidenceScore']
print(f"Mean pLDDT: {np.mean(scores):.1f}, High-conf residues: {sum(1 for s in scores if s > 90)}/{len(scores)}")

Core API

1. Prediction Retrieval

BioPython (recommended for single proteins):

from Bio.PDB import alphafold_db

# Get prediction metadata
predictions = list(alphafold_db.get_predictions("P00520"))
pred = predictions[0]
print(f"AlphaFold ID: {pred['entryId']}")
print(f"Gene: {pred['gene']}, Species: {pred['organismScientificName']}")

# Get Structure objects directly
structures = list(alphafold_db.get_structural_models_for("P00520"))

REST API (for metadata or integration):

import requests

uniprot_id = "P00520"
url = f"https://alphafold.ebi.ac.uk/api/prediction/{uniprot_id}"
response = requests.get(url)
data = response.json()

# Response includes download URLs for all file types
pred = data[0]
print(f"CIF: {pred['cifUrl']}")
print(f"PDB: {pred['pdbUrl']}")
print(f"PAE: {pred['paeDocUrl']}")

3D-Beacons federated API (query multiple structure providers):

url = f"https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/api/uniprot/summary/{uniprot_id}.json"
data = requests.get(url).json()
af_structures = [s for s in data['structures'] if s['provider'] == 'AlphaFold DB']

2. Structure File Download

import requests

af_id = "AF-P00520-F1"
version = "v4"
base = "https://alphafold.ebi.ac.uk/files"

# mmCIF (recommended — full metadata, supports large structures)
cif = requests.get(f"{base}/{af_id}-model_{version}.cif")
with open(f"{af_id}.cif", "w") as f:
    f.write(cif.text)

# PDB format (legacy — limited to 99,999 atoms)
pdb = requests.get(f"{base}/{af_id}-model_{version}.pdb")
with open(f"{af_id}.pdb", "wb") as f:
    f.write(pdb.content)

# Confidence JSON (per-residue pLDDT scores)
conf = requests.get(f"{base}/{af_id}-confidence_{version}.json").json()

# PAE matrix JSON (inter-residue confidence)
pae = requests.get(f"{base}/{af_id}-predicted_aligned_error_{version}.json").json()

3. Confidence Metrics Analysis

pLDDT (per-residue confidence, 0–100):

import numpy as np

conf_url = f"https://alphafold.ebi.ac.uk/files/{af_id}-confidence_v4.json"
conf = requests.get(conf_url).json()
scores = conf['confidenceScore']

# Classify residues by confidence
very_high = sum(1 for s in scores if s > 90)
high = sum(1 for s in scores if 70 < s <= 90)
low = sum(1 for s in scores if 50 < s <= 70)
very_low = sum(1 for s in scores if s <= 50)
print(f"Very high (>90): {very_high}, High (70-90): {high}, Low (50-70): {low}, Very low (<50): {very_low}")

PAE (Predicted Aligned Error) visualization:

import matplotlib.pyplot as plt

pae_url = f"https://alphafold.ebi.ac.uk/files/{af_id}-predicted_aligned_error_v4.json"
pae = requests.get(pae_url).json()
pae_matrix = np.array(pae['distance'])

plt.figure(figsize=(10, 8))
plt.imshow(pae_matrix, cmap='viridis_r', vmin=0, vmax=30)
plt.colorbar(label='PAE (Å)')
plt.title(f'Predicted Aligned Error: {af_id}')
plt.xlabel('Residue')
plt.ylabel('Residue')
plt.savefig(f'{af_id}_pae.png', dpi=300, bbox_inches='tight')
# Low PAE (<5 Å) = confident relative positioning; >15 Å = uncertain domain arrangement

4. Bulk Data Access (Google Cloud)

# List available data
gsutil ls gs://public-datasets-deepmind-alphafold-v4/

# Download entire proteome by taxonomy ID
gsutil -m cp gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-9606-*_v4.tar .

# Download accession index
gsutil cp gs://public-datasets-deepmind-alphafold-v4/accession_ids.csv .

BigQuery metadata queries:

from google.cloud import bigquery

client = bigquery.Client()
query = """
SELECT entryId, uniprotAccession, gene, organismScientificName,
       globalMetricValue, fractionPlddtVeryHigh
FROM `bigquery-public-data.deepmind_alphafold.metadata`
WHERE organismScientificName = 'Homo sapiens'
  AND fractionPlddtVeryHigh > 0.8
  AND isReviewed = TRUE
LIMIT 100
"""
df = client.query(query).to_dataframe()
print(f"Found {len(df)} high-confidence human proteins")

5. Structure Parsing & Analysis

from Bio.PDB import MMCIFParser
import numpy as np
from scipy.spatial.distance import pdist, squareform

parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("protein", f"{af_id}-model_v4.cif")

# Extract alpha-carbon coordinates
coords, plddt_scores = [], []
for model in structure:
    for chain in model:
        for residue in chain:
            if 'CA' in residue:
                coords.append(residue['CA'].get_coord())
                plddt_scores.append(residue['CA'].get_bfactor())  # pLDDT stored as B-factor

coords = np.array(coords)
print(f"Residues: {len(coords)}, Mean pLDDT: {np.mean(plddt_scores):.1f}")

# Contact map (Cα-Cα < 8 Å)
dist_matrix = squareform(pdist(coords))
contacts = np.where((dist_matrix > 0) & (dist_matrix < 8))
print(f"Contacts: {len(contacts[0]) // 2}")

Key Concepts

Confidence Interpretation

AlphaFold ID Format

File Types

Common Workflows

Workflow 1: Single Protein Structure Analysis

from Bio.PDB import alphafold_db, MMCIFParser
import requests, numpy as np

uniprot_id = "P04637"  # p53 tumor suppressor

# Retrieve and download
predictions = list(alphafold_db.get_predictions(uniprot_id))
cif_file = alphafold_db.download_cif_for(predictions[0], directory="./structures")
af_id = predictions[0]['entryId']

# Parse structure
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("p53", cif_file)

# Extract pLDDT from B-factors
plddt = [r['CA'].get_bfactor() for m in structure for c in m for r in c if 'CA' in r]
print(f"Length: {len(plddt)}, Mean pLDDT: {np.mean(plddt):.1f}")

# Identify high-confidence regions for docking
high_conf_regions = [(i+1, s) for i, s in enumerate(plddt) if s > 90]
print(f"High-confidence residues: {len(high_conf_regions)}/{len(plddt)}")

Workflow 2: Batch Protein Processing

from Bio.PDB import alphafold_db
import requests, numpy as np, pandas as pd, time

uniprot_ids = ["P00520", "P12931", "P04637", "P38398"]
results = []

for uid in uniprot_ids:
    try:
        preds = list(alphafold_db.get_predictions(uid))
        if not preds:
            continue
        af_id = preds[0]['entryId']
        conf = requests.get(f"https://alphafold.ebi.ac.uk/files/{af_id}-confidence_v4.json").json()
        scores = conf['confidenceScore']
        results.append({
            'uniprot': uid, 'alphafold_id': af_id,
            'length': len(scores), 'mean_plddt': np.mean(scores),
            'frac_high_conf': sum(1 for s in scores if s > 90) / len(scores)
        })
        time.sleep(0.2)  # Rate limit: 100-200ms between requests
    except Exception as e:
        print(f"Error {uid}: {e}")

df = pd.DataFrame(results)
print(df.to_string(index=False))

Key Parameters

Best Practices

Use BioPython for single proteins, Google Cloud for bulk — individual API downloads are slow for >100 proteins; GCS parallel download is orders of magnitude faster
Always check pLDDT before downstream analysis — low-confidence regions (pLDDT <50) are likely disordered and should be excluded from docking, contact analysis, or binding site prediction
Anti-pattern — trusting all regions equally: AlphaFold predictions lack ligands, PTMs, cofactors, and multi-chain context. High pLDDT does not guarantee functional accuracy
Cache downloaded files locally — avoid re-downloading the same structures; AlphaFold files are static per version
Use PAE for multi-domain proteins — pLDDT tells you per-residue confidence, but PAE reveals whether domain orientations are reliable. Low inter-domain PAE (<5 Å) = trust the arrangement; high PAE (>15 Å) = treat domains independently
Pin database version in reproducible analyses — include _v4 in URLs and document which version was used

Common Recipes

Recipe 1: Proteome Download by Species

import subprocess

def download_proteome(taxonomy_id: int, output_dir: str = "./proteomes"):
    """Download all AlphaFold predictions for a species via GCS."""
    if not isinstance(taxonomy_id, int):
        raise ValueError("taxonomy_id must be an integer")
    pattern = f"gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-{taxonomy_id}-*_v4.tar"
    subprocess.run(["gsutil", "-m", "cp", pattern, f"{output_dir}/"], check=True)

# Human (9606), E. coli (83333), Mouse (10090)
download_proteome(9606)

Recipe 2: High-Confidence Region Extraction

def extract_high_conf_residues(plddt_scores, threshold=90):
    """Extract contiguous high-confidence regions."""
    regions, start = [], None
    for i, score in enumerate(plddt_scores):
        if score > threshold and start is None:
            start = i
        elif score <= threshold and start is not None:
            regions.append((start + 1, i, i - start))  # 1-indexed
            start = None
    if start is not None:
        regions.append((start + 1, len(plddt_scores), len(plddt_scores) - start))
    return regions

# Usage: regions = extract_high_conf_residues(plddt_scores)
# Returns: [(start_res, end_res, length), ...]

Recipe 3: PAE-Based Domain Segmentation

import numpy as np

def segment_domains(pae_matrix, threshold=10.0):
    """Simple domain segmentation from PAE matrix."""
    n = pae_matrix.shape[0]
    # Average PAE for each residue pair -> symmetric
    sym_pae = (pae_matrix + pae_matrix.T) / 2
    # Cluster: residues with low mutual PAE are in the same domain
    domains, current_domain = [0] * n, 0
    for i in range(1, n):
        if sym_pae[i-1, i] > threshold:
            current_domain += 1
        domains[i] = current_domain
    return domains

Troubleshooting

Bundled Resources

references/api_schemas_reference.md

Related Skills

autodock-vina — molecular docking using AlphaFold structures as receptor
biopython — general protein structure parsing and analysis beyond AlphaFold

References

AlphaFold DB: https://alphafold.ebi.ac.uk/
API Documentation: https://alphafold.ebi.ac.uk/api-docs
Jumper et al. (2021) Nature 596, 583–589: https://doi.org/10.1038/s41586-021-03819-2
Varadi et al. (2024) Nucleic Acids Res. 52, D368–D375: https://doi.org/10.1093/nar/gkad1011
BioPython AlphaFold module: https://biopython.org/docs/dev/api/Bio.PDB.alphafold_db.html
Google Cloud AlphaFold: https://console.cloud.google.com/marketplace/product/bigquery-public-data/deepmind-alphafold

Adoption

jaechang-hits/alphafold-database-access

$ install --global

Security Scan Results

SKILL.md

AlphaFold Database Access

Overview

When to Use

Prerequisites

Quick Start

Core API

1. Prediction Retrieval

2. Structure File Download

3. Confidence Metrics Analysis

4. Bulk Data Access (Google Cloud)

5. Structure Parsing & Analysis

Key Concepts

Confidence Interpretation

AlphaFold ID Format

File Types

Common Workflows

Workflow 1: Single Protein Structure Analysis

Workflow 2: Batch Protein Processing

Key Parameters

Best Practices

Common Recipes

Recipe 1: Proteome Download by Species

Recipe 2: High-Confidence Region Extraction

Recipe 3: PAE-Based Domain Segmentation

Troubleshooting

Bundled Resources

references/api_schemas_reference.md

Related Skills

References

Related Skills

jaechang-hits/deseq2-differential-expression

jaechang-hits/vcf-variant-filtering

jaechang-hits/snpeff-variant-annotation

jaechang-hits/plink2-gwas-analysis

jaechang-hits/alphafold-database-access

$ install --global

Security Scan Results

SKILL.md

AlphaFold Database Access

Overview

When to Use

Prerequisites

Quick Start

Core API

1. Prediction Retrieval

2. Structure File Download

3. Confidence Metrics Analysis

4. Bulk Data Access (Google Cloud)

5. Structure Parsing & Analysis

Key Concepts

Confidence Interpretation

AlphaFold ID Format

File Types

Common Workflows

Workflow 1: Single Protein Structure Analysis

Workflow 2: Batch Protein Processing

Key Parameters

Best Practices

Common Recipes

Recipe 1: Proteome Download by Species

Recipe 2: High-Confidence Region Extraction

Recipe 3: PAE-Based Domain Segmentation

Troubleshooting

Bundled Resources

references/api_schemas_reference.md

Related Skills

References

Related Skills

jaechang-hits/deseq2-differential-expression

jaechang-hits/vcf-variant-filtering

jaechang-hits/snpeff-variant-annotation

jaechang-hits/plink2-gwas-analysis