skills/proteomics-protein-engineering/esm-protein-language-model/SKILL.md
Protein language models (ESM3, ESM C) for sequence generation, structure prediction, inverse folding, and embeddings. Design novel proteins, extract ML features, or fold sequences. Local GPU or EvolutionaryScale Forge API. Use AlphaFold for traditional folding; RDKit for small molecules.
npx skillsauth add jaechang-hits/scicraft esm-protein-language-modelInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
ESM (Evolutionary Scale Modeling) provides pretrained protein language models for generative protein design and representation learning. ESM3 is a multimodal generative model conditioned on sequence, structure, and function simultaneously. ESM C is an efficient embedding model optimized for extracting protein representations for downstream ML tasks.
esm (EvolutionaryScale package)pip install esm
# For Forge cloud API
pip install esm[forge]
from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein
# Load ESM C model for embeddings
model = ESMC.from_pretrained("esmc_600m")
# Create protein from sequence
protein = ESMProtein(sequence="MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQQIAATGFHIIPGDKPDNRAGGYDN")
# Get per-residue embeddings
output = model(protein)
embeddings = output.embeddings # shape: (1, seq_len, embedding_dim)
print(f"Embedding shape: {embeddings.shape}")
# Embedding shape: (1, 101, 1152)
Generate novel protein sequences conditioned on structure, function, or partial sequence.
from esm.models.esm3 import ESM3
from esm.sdk.api import ESM3InferenceClient, ESMProtein, GenerationConfig
# Load ESM3 locally
model = ESM3.from_pretrained("esm3_sm_open_v1")
# Generate from partial sequence (fill in masked positions)
prompt = ESMProtein(sequence="MKTAYIAK____ISFVK____RQLEERLG") # ____ = positions to generate
config = GenerationConfig(track="sequence", num_steps=10, temperature=0.7)
generated = model.generate(prompt, config)
print(f"Generated sequence: {generated.sequence[:50]}...")
# Conditional generation: design sequence for a target structure
from esm.sdk.api import ESMProtein, GenerationConfig
from esm.utils.structure.protein_chain import ProteinChain
# Load target structure from PDB
chain = ProteinChain.from_pdb("target.pdb")
prompt = ESMProtein.from_protein_chain(chain)
prompt.sequence = None # Clear sequence, keep structure
config = GenerationConfig(track="sequence", num_steps=16, temperature=0.5)
designed = model.generate(prompt, config)
print(f"Designed sequence ({len(designed.sequence)} residues): {designed.sequence[:50]}...")
Extract fixed-length representations for downstream ML tasks.
from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein
import torch
model = ESMC.from_pretrained("esmc_600m") # or "esmc_300m" for lighter model
sequences = [
"MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQQIAATGFHIIPGDKPDNRAGGYDN",
"MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKALVLIAFAQYLQQCPFEDHVKLVNEVTEFAKTCVADESAENCDKS",
]
embeddings = []
for seq in sequences:
protein = ESMProtein(sequence=seq)
output = model(protein)
# Mean-pool per-residue embeddings to get fixed-length vector
mean_emb = output.embeddings.mean(dim=1) # shape: (1, embedding_dim)
embeddings.append(mean_emb)
emb_matrix = torch.cat(embeddings, dim=0)
print(f"Embedding matrix: {emb_matrix.shape}") # (2, 1152)
# Compute pairwise similarity
similarity = torch.cosine_similarity(emb_matrix[0:1], emb_matrix[1:2])
print(f"Cosine similarity: {similarity.item():.4f}")
Predict 3D coordinates from amino acid sequence.
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
model = ESM3.from_pretrained("esm3_sm_open_v1")
protein = ESMProtein(sequence="MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQQIAATGFHIIPGDKPDNRAGGYDN")
# Generate structure from sequence
config = GenerationConfig(track="structure", num_steps=16)
result = model.generate(protein, config)
# Save predicted structure
result.to_pdb("predicted.pdb")
print(f"Saved structure: {len(result.sequence)} residues → predicted.pdb")
Design amino acid sequences that fold into a target 3D structure.
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
from esm.utils.structure.protein_chain import ProteinChain
model = ESM3.from_pretrained("esm3_sm_open_v1")
# Load target structure
chain = ProteinChain.from_pdb("target_structure.pdb")
prompt = ESMProtein.from_protein_chain(chain)
# Clear sequence but keep structure coordinates
prompt.sequence = None
# Generate multiple designs
designs = []
for i in range(5):
config = GenerationConfig(track="sequence", num_steps=16, temperature=0.7)
designed = model.generate(prompt, config)
designs.append(designed.sequence)
print(f"Design {i+1}: {designed.sequence[:40]}...")
print(f"Generated {len(designs)} sequence designs for target structure")
Generate proteins with desired functional annotations (GO terms, enzyme activity).
from esm.models.esm3 import ESM3
from esm.sdk.api import ESMProtein, GenerationConfig
model = ESM3.from_pretrained("esm3_sm_open_v1")
# Condition on functional keywords
protein = ESMProtein(
sequence=None, # generate de novo
function_annotations=["ATP binding", "kinase activity", "protein phosphorylation"],
)
config = GenerationConfig(track="sequence", num_steps=32, temperature=0.7)
result = model.generate(protein, config)
print(f"Function-conditioned sequence: {result.sequence[:50]}...")
print(f"Length: {len(result.sequence)} residues")
Use EvolutionaryScale's cloud inference for large models without local GPU.
from esm.sdk.forge import ESM3ForgeInferenceClient
from esm.sdk.api import ESMProtein, GenerationConfig
# Authenticate (requires FORGE_API_TOKEN env var or explicit token)
client = ESM3ForgeInferenceClient(model="esm3-open-2024-03", token="your_token_here")
protein = ESMProtein(sequence="MKTAYIAKQRQISFVKSHFSRQLEERLG")
config = GenerationConfig(track="structure", num_steps=16)
result = client.generate(protein, config)
result.to_pdb("forge_predicted.pdb")
print("Predicted structure via Forge API → forge_predicted.pdb")
| Feature | ESM3 | ESM C | |---------|------|-------| | Primary use | Generative protein design | Embedding extraction | | Capabilities | Sequence generation, structure prediction, inverse folding, function conditioning | Per-residue and mean-pooled embeddings | | Model sizes | esm3_sm_open_v1 (~1.4B params) | esmc_300m, esmc_600m | | GPU requirement | 8GB+ VRAM | 4GB+ VRAM (esmc_300m: 2GB) | | Use case | Design new proteins, predict structures | Downstream ML (classification, clustering, regression) | | Cloud option | Forge API (larger models available) | Local only |
The GenerationConfig controls how ESM3 generates outputs:
track: Which modality to generate ("sequence", "structure", "function")num_steps: Number of iterative refinement steps (higher = better quality, slower)temperature: Sampling temperature (0.0 = greedy, 0.5-0.7 = diverse, 1.0 = maximum diversity)The central data container holding sequence, structure coordinates, and functional annotations:
.sequence — amino acid string (e.g., "MKTAY...").coordinates — 3D atom positions (Nx3 tensor).function_annotations — list of functional keywordsESMProtein.from_protein_chain() to load from PDB structures.to_pdb() to save predicted structuresGoal: Extract embeddings from protein sequences and train a downstream classifier.
from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein
import torch
import numpy as np
model = ESMC.from_pretrained("esmc_600m")
# Embed a set of sequences
sequences = ["MKTAY...", "MKWVT...", "MSGLI..."] # replace with actual sequences
labels = [0, 1, 0] # binary labels
embeddings = []
for seq in sequences:
protein = ESMProtein(sequence=seq)
output = model(protein)
mean_emb = output.embeddings.mean(dim=1).detach().cpu().numpy()
embeddings.append(mean_emb.squeeze())
X = np.array(embeddings)
y = np.array(labels)
print(f"Feature matrix: {X.shape}") # (n_samples, 1152)
# Train a simple classifier
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter=1000).fit(X, y)
print(f"Training accuracy: {clf.score(X, y):.2f}")
Goal: Design multiple novel sequences that fold into a target structure, then rank by predicted quality.
ProteinChain.from_pdb() (Core API module 4)temperature=0.7 for diversity (Core API module 1)| Parameter | Module/Function | Default | Range / Options | Effect |
|-----------|----------------|---------|-----------------|--------|
| num_steps | GenerationConfig | varies | 1–64 | Iterative refinement steps; more = higher quality, slower |
| temperature | GenerationConfig | 1.0 | 0.0–1.5 | Sampling diversity; 0.0=greedy, 0.7=balanced, 1.0+=creative |
| track | GenerationConfig | — | "sequence", "structure", "function" | Which modality to generate |
| model name | from_pretrained | — | "esm3_sm_open_v1", "esmc_300m", "esmc_600m" | Model size/capability tradeoff |
| token | ESM3ForgeInferenceClient | env var | API token string | Forge cloud authentication |
Use ESM C for embedding tasks, ESM3 for generation: ESM C is smaller, faster, and optimized for representation quality. Only use ESM3 when you need generative capabilities (sequence design, structure prediction, inverse folding).
Mean-pool per-residue embeddings for fixed-length representations: ESM C outputs per-residue embeddings (seq_len × dim). For downstream ML that requires fixed-length input, average across the sequence dimension: embeddings.mean(dim=1).
Use temperature 0.5–0.7 for protein design: Temperature 1.0 produces very diverse but potentially non-functional sequences. Temperature 0.5–0.7 balances diversity with quality. Use temperature 0.0 only for deterministic structure prediction.
Increase num_steps for higher-quality generation: More iterative refinement steps improve output quality at the cost of computation time. Use 8–16 steps for quick exploration, 32+ for final designs.
Batch sequences to maximize GPU utilization: Processing one sequence at a time underutilizes the GPU. When embedding many sequences, batch them (limited by VRAM).
Use Forge API for large-scale or large-model inference: The open-weight ESM3 is a smaller variant. For production-quality protein design, the Forge API provides access to larger models.
from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein
import torch
import numpy as np
model = ESMC.from_pretrained("esmc_300m")
sequences = {
"Protein_A": "MKTAYIAKQRQISFVK...",
"Protein_B": "MKWVTFISLLFLFSSAYS...",
"Protein_C": "MSGLILQRAAVIAAGASSAG...",
}
# Extract embeddings
embs = {}
for name, seq in sequences.items():
protein = ESMProtein(sequence=seq)
output = model(protein)
embs[name] = output.embeddings.mean(dim=1).detach().squeeze()
# Compute similarity matrix
names = list(embs.keys())
sim_matrix = np.zeros((len(names), len(names)))
for i, n1 in enumerate(names):
for j, n2 in enumerate(names):
sim_matrix[i, j] = torch.cosine_similarity(embs[n1].unsqueeze(0), embs[n2].unsqueeze(0)).item()
print("Similarity matrix:")
for i, name in enumerate(names):
print(f" {name}: {sim_matrix[i].round(3)}")
from esm.models.esmc import ESMC
from esm.sdk.api import ESMProtein
import torch
import numpy as np
model = ESMC.from_pretrained("esmc_600m")
# Generate and save
protein = ESMProtein(sequence="MKTAYIAKQRQISFVK...")
output = model(protein)
np.save("embedding.npy", output.embeddings.detach().cpu().numpy())
print("Saved embedding.npy")
# Load later (no GPU needed)
embedding = np.load("embedding.npy")
print(f"Loaded embedding: {embedding.shape}")
| Problem | Cause | Solution |
|---------|-------|----------|
| CUDA out of memory | Model too large for GPU | Use smaller model (esmc_300m), reduce batch size, or use Forge cloud API |
| RuntimeError: no CUDA device | No GPU available | Models work on CPU (slower). Set device="cpu" or use Forge API |
| Slow generation | Too many num_steps or CPU inference | Reduce num_steps (8 for drafts), use GPU, or use Forge API for large models |
| ImportError: esm | Package not installed | pip install esm (note: this is EvolutionaryScale's esm, not the older Facebook Research esm) |
| Low-quality generated sequences | Temperature too high or too few steps | Lower temperature to 0.5, increase num_steps to 32+ |
| Forge API authentication error | Invalid or missing API token | Set FORGE_API_TOKEN env var or pass token= explicitly; get token from forge.evolutionaryscale.ai |
| KeyError loading model weights | Wrong model name | Use exact names: "esm3_sm_open_v1", "esmc_300m", "esmc_600m" |
tools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.