skills/structural-biology-drug-discovery/diffdock/SKILL.md
Diffusion-based docking that predicts protein-ligand poses without a predefined site. Use for blind docking, when traditional docking fails, or exploring multiple binding modes. Pipeline: prep protein (PDB) and ligand (SMILES/SDF), run inference, analyze confidence-ranked poses.
npx skillsauth add jaechang-hits/sciagent-skills diffdockInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
DiffDock uses a diffusion generative model to predict protein-ligand binding poses directly from protein structure and ligand SMILES, treating docking as a generative rather than a search problem. Unlike traditional docking tools (AutoDock Vina, Glide), DiffDock does not require a predefined binding site — it samples poses across the full protein surface. It outputs a ranked set of binding poses with associated confidence scores. DiffDock excels at blind docking tasks and produces diverse pose hypotheses, making it valuable for de novo binding site discovery and challenging targets.
diffdock (conda install recommended), rdkit, torch, biopython, nglview (visualization)# Recommended: clone and install from source
git clone https://github.com/gcorso/DiffDock.git
cd DiffDock
conda create -n diffdock python=3.9
conda activate diffdock
pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
# Download pretrained model weights
python -c "from utils.download import download_pretrained; download_pretrained()"
from Bio import PDB
from Bio.PDB import PDBParser, PDBIO, Select
class NonHetSelect(Select):
"""Remove HETATM records (ligands, water) — keep only protein atoms."""
def accept_residue(self, residue):
return residue.id[0] == " "
def clean_pdb(input_pdb: str, output_pdb: str):
parser = PDBParser(QUIET=True)
structure = parser.get_structure("protein", input_pdb)
io = PDBIO()
io.set_structure(structure)
io.save(output_pdb, NonHetSelect())
print(f"Cleaned PDB saved to: {output_pdb}")
clean_pdb("raw_protein.pdb", "protein_clean.pdb")
from rdkit import Chem
from rdkit.Chem import AllChem, SDWriter
def smiles_to_sdf(smiles: str, output_sdf: str, n_confs: int = 1):
"""Convert SMILES to 3D SDF for DiffDock input."""
mol = Chem.MolFromSmiles(smiles)
mol = Chem.AddHs(mol)
AllChem.EmbedMolecule(mol, AllChem.ETKDGv3())
AllChem.MMFFOptimizeMolecule(mol)
writer = SDWriter(output_sdf)
writer.write(mol)
writer.close()
print(f"Ligand SDF written to: {output_sdf}")
return mol
# Example: ibuprofen
smiles = "CC(C)Cc1ccc(cc1)C(C)C(=O)O"
mol = smiles_to_sdf(smiles, "ligand.sdf")
print(f"Ligand formula: {Chem.rdMolDescriptors.CalcMolFormula(mol)}")
# Command-line inference (run from the DiffDock directory)
python inference.py \
--protein_path protein_clean.pdb \
--ligand "CC(C)Cc1ccc(cc1)C(C)C(=O)O" \
--out_dir results/ \
--inference_steps 20 \
--samples_per_complex 40 \
--batch_size 10 \
--no_final_step_noise
import subprocess
def run_diffdock(protein_pdb: str, ligand_smiles: str, out_dir: str,
n_samples: int = 40, n_steps: int = 20):
cmd = [
"python", "inference.py",
"--protein_path", protein_pdb,
"--ligand", ligand_smiles,
"--out_dir", out_dir,
"--inference_steps", str(n_steps),
"--samples_per_complex", str(n_samples),
"--batch_size", "10",
"--no_final_step_noise",
]
result = subprocess.run(cmd, capture_output=True, text=True, cwd="DiffDock/")
if result.returncode == 0:
print(f"DiffDock complete. Results in: {out_dir}")
else:
print(f"Error: {result.stderr}")
return result
run_diffdock("protein_clean.pdb", "CC(C)Cc1ccc(cc1)C(C)C(=O)O", "results/")
import re
from pathlib import Path
import pandas as pd
def parse_diffdock_results(out_dir: str) -> pd.DataFrame:
"""Parse DiffDock output SDF files and confidence scores."""
out_path = Path(out_dir)
records = []
# DiffDock names output files: rank{N}_confidence{score}.sdf
for sdf_file in sorted(out_path.glob("rank*_confidence*.sdf")):
name = sdf_file.stem
# Extract rank and confidence from filename
rank_match = re.search(r"rank(\d+)", name)
conf_match = re.search(r"confidence(-?[\d.]+)", name)
if rank_match and conf_match:
records.append({
"rank": int(rank_match.group(1)),
"confidence": float(conf_match.group(1)),
"sdf_file": str(sdf_file),
})
df = pd.DataFrame(records).sort_values("rank")
print(f"Found {len(df)} poses")
print(df[["rank", "confidence", "sdf_file"]].head(10))
return df
df_results = parse_diffdock_results("results/")
from rdkit import Chem
from rdkit.Chem import AllChem
from Bio.PDB import PDBParser
import numpy as np
def get_binding_residues(protein_pdb: str, ligand_sdf: str, cutoff_angstrom: float = 4.0):
"""Find protein residues within cutoff distance of the top-ranked ligand pose."""
parser = PDBParser(QUIET=True)
structure = parser.get_structure("prot", protein_pdb)
prot_atoms = [(atom.get_coord(), residue.resname, residue.id[1])
for chain in structure for residue in chain
for atom in residue.get_atoms()]
mol = Chem.SDMolSupplier(ligand_sdf, removeHs=False)[0]
lig_coords = mol.GetConformer().GetPositions()
contacts = []
for prot_coord, resname, resnum in prot_atoms:
dists = np.linalg.norm(lig_coords - prot_coord, axis=1)
if dists.min() <= cutoff_angstrom:
contacts.append((resnum, resname))
contacts = sorted(set(contacts))
print(f"Binding site residues within {cutoff_angstrom} A: {contacts[:10]}")
return contacts
# Use top-ranked pose
top_sdf = df_results.loc[df_results.rank == 1, "sdf_file"].iloc[0]
contacts = get_binding_residues("protein_clean.pdb", top_sdf)
import nglview as nv
from rdkit import Chem
# Load protein + top pose in Jupyter notebook
view = nv.NGLWidget()
view.add_pdbfile("protein_clean.pdb")
top_sdf = df_results.loc[df_results.rank == 1, "sdf_file"].iloc[0]
view.add_component(top_sdf)
view.representations = [
{"type": "cartoon", "params": {"color": "chainindex"}},
{"type": "ball+stick", "params": {"sele": "ligand"}},
]
print(f"Visualizing top pose: confidence={df_results.confidence.iloc[0]:.3f}")
view
| Parameter | Default | Range / Options | Effect |
|-----------|---------|-----------------|--------|
| --inference_steps | 20 | 10–40 | Number of diffusion reverse steps; more steps = slower but more accurate |
| --samples_per_complex | 40 | 10–100 | Number of poses sampled; more = better coverage of binding modes |
| --batch_size | 10 | 1–32 | GPU batch size; reduce if OOM error |
| --no_final_step_noise | off | flag | Removes noise at last diffusion step; improves pose quality |
| --actual_steps | equals inference_steps | 1–inference_steps | Steps to actually run (can be fewer than total) |
| --save_visualisation | off | flag | Also saves PDB visualization files alongside SDF |
| cutoff_angstrom | 4.0 | 3.0–6.0 Å | Distance cutoff for defining binding site residues |
When to use: Dock a library of analogs to the same protein for SAR analysis.
import pandas as pd
import subprocess
smiles_list = [
("compound_1", "CC(C)Cc1ccc(cc1)C(C)C(=O)O"),
("compound_2", "CC(C)Cc1ccc(cc1)C(C)C(=O)N"),
("compound_3", "CC(C)Cc1ccc(cc1)C(C)C(=O)OC"),
]
results = []
for name, smiles in smiles_list:
out = f"results/{name}"
cmd = ["python", "inference.py",
"--protein_path", "protein_clean.pdb",
"--ligand", smiles,
"--out_dir", out,
"--inference_steps", "20",
"--samples_per_complex", "20"]
subprocess.run(cmd, cwd="DiffDock/", capture_output=True)
# Parse top confidence score
df_r = parse_diffdock_results(out)
if not df_r.empty:
top_conf = df_r.loc[df_r.rank == 1, "confidence"].iloc[0]
results.append({"name": name, "smiles": smiles, "top_confidence": top_conf})
df_batch = pd.DataFrame(results).sort_values("top_confidence", ascending=False)
df_batch.to_csv("batch_docking_results.csv", index=False)
print(df_batch)
When to use: Keep only high-confidence poses for further analysis or visualization.
# Confidence > 0 generally indicates a plausible binding pose
# DiffDock confidence scores: higher = more confident; ~0 is marginal; < -1 is poor
high_conf = df_results[df_results["confidence"] > 0.0]
print(f"High-confidence poses: {len(high_conf)} / {len(df_results)}")
print(high_conf[["rank", "confidence", "sdf_file"]])
When to use: Rescore DiffDock poses with AutoDock Vina's energy function.
# Convert SDF to PDBQT using OpenBabel
obabel rank1_confidence0.75.sdf -O rank1_ligand.pdbqt
obabel protein_clean.pdb -O protein.pdbqt -xr
# Rescore (no docking search, just energy evaluation)
vina --receptor protein.pdbqt --ligand rank1_ligand.pdbqt \
--score_only --out rank1_rescored.pdbqt
results/rank{N}_confidence{score}.sdf — 3D ligand poses ranked by confidence scoredf_results DataFrame with rank, confidence score, and file path per pose| Problem | Cause | Solution |
|---------|-------|----------|
| CUDA out of memory | Batch size too large for GPU | Reduce --batch_size to 4 or 2 |
| Empty results directory | Protein PDB parsing failed | Ensure PDB contains only ATOM records; remove HETATM with clean_pdb() |
| All confidence scores < -2 | Ligand or protein format issue | Validate SMILES with RDKit; ensure protein is protonated and complete |
| Very slow inference (>30 min) | Running on CPU | GPU is strongly recommended; CUDA environment must be correctly configured |
| ModuleNotFoundError: e3nn | Dependency not installed | pip install e3nn in the DiffDock conda environment |
| Poses cluster at one site | Low --samples_per_complex | Increase to 40–100 for better site coverage |
| Protein missing residues | Incomplete crystal structure | Use MODELLER or Swiss-Model to fill gaps before docking |
tools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.