skills/structural-biology-drug-discovery/mdtraj-trajectory-analysis/SKILL.md
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
npx skillsauth add jaechang-hits/sciagent-skills mdtraj-trajectory-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
mdtraj is a dependency-light Python library for analyzing MD trajectories. Reads DCD/XTC/TRR/NetCDF/H5/AMBER/GROMACS/CHARMM/OpenMM into a Trajectory object backed by NumPy arrays, then exposes geometry, RMSD/Rg/RMSF/SASA, contacts, hydrogen bonds, torsions, and 8-state DSSP as pure-Python functions.
Units: mdtraj uses nm and ps internally. Multiply distances by 10 for Å, divide time by 1000 for ns. Torsions are in radians —
np.degrees().
mdtraj, numpy, pandas, matplotlibCheck before installing — inside a pixi/conda env mdtraj is usually present:
python3 -c "import mdtraj" 2>/dev/null || conda install -c conda-forge mdtraj numpy pandas matplotlib
import mdtraj as md
traj = md.load("traj.xtc", top="topology.pdb")
ca = traj.topology.select("name CA")
traj.superpose(traj, frame=0, atom_indices=ca)
rmsd_ang = md.rmsd(traj, traj, frame=0, atom_indices=ca) * 10.0 # nm -> Å
print(f"Frames: {traj.n_frames} RMSD: {rmsd_ang.min():.2f}–{rmsd_ang.max():.2f} Å")
Load whole or streamed. Format auto-detected from extension.
import mdtraj as md
traj = md.load("rep1.xtc", top="protein.pdb")
# Stream large trajectories — avoids OOM
for chunk in md.iterload("rep1.xtc", top="protein.pdb", chunk=500):
rmsd_chunk = md.rmsd(chunk, chunk, frame=0)
# Save subset
ca = traj.topology.select("name CA")
traj.atom_slice(ca).save_dcd("ca_only.dcd")
Selecting atoms and slicing frames:
backbone = traj.topology.select("backbone")
chain_a = traj.topology.select("chainid 0")
first_ns = traj[:1000]
every_10th = traj[::10]
last_half_bb = traj[traj.n_frames // 2:].atom_slice(backbone)
md.rmsd superposes internally; RMSF you compute manually after explicit superpose.
import mdtraj as md, numpy as np
traj = md.load("rep1.xtc", top="protein.pdb")
ca = traj.topology.select("name CA")
rmsd_ang = md.rmsd(traj, traj, frame=0, atom_indices=ca) * 10.0 # Å
rg_ang = md.compute_rg(traj) * 10.0 # Å
time_ns = traj.time / 1000.0
print(f"<RMSD>={rmsd_ang.mean():.2f} Å, <Rg>={rg_ang.mean():.2f} Å")
# Per-CA RMSF — average-structure reference
ca_traj = traj.atom_slice(ca)
ca_traj.superpose(ca_traj, frame=0)
diff = ca_traj.xyz - ca_traj.xyz.mean(axis=0)
rmsf_ang = np.sqrt((diff ** 2).sum(axis=2).mean(axis=0)) * 10.0
res_ids = [a.residue.resSeq for a in ca_traj.topology.atoms]
print(f"Max RMSF: residue {res_ids[np.argmax(rmsf_ang)]} = {rmsf_ang.max():.2f} Å")
Threshold distances to get contact frequency.
import mdtraj as md, numpy as np
traj = md.load("rep1.xtc", top="protein.pdb")
distances_nm, pairs = md.compute_contacts(traj, contacts="all", scheme="closest-heavy")
# distances_nm: (n_frames, n_pairs); pairs: (n_pairs, 2) of residue indices
contact_freq = (distances_nm < 0.5).mean(axis=0) # 5 Å cutoff
n_res = traj.n_residues
freq_map = np.zeros((n_res, n_res))
for (i, j), f in zip(pairs, contact_freq):
freq_map[i, j] = freq_map[j, i] = f
print(f"Persistent contacts (>0.8): {(contact_freq > 0.8).sum()}")
phi/psi returned in radians; intersect on residue since first residue has no phi and last has no psi.
import mdtraj as md, numpy as np
traj = md.load("rep1.xtc", top="protein.pdb")
phi_ix, phi_rad = md.compute_phi(traj)
psi_ix, psi_rad = md.compute_psi(traj)
def res_of(indices): return np.array([traj.topology.atom(ix[1]).residue.index for ix in indices])
phi_res, psi_res = res_of(phi_ix), res_of(psi_ix)
common = np.intersect1d(phi_res, psi_res)
phi_deg = np.degrees(phi_rad[:, np.isin(phi_res, common)])
psi_deg = np.degrees(psi_rad[:, np.isin(psi_res, common)])
Filter to Gly / Pro residues:
res_names = [traj.topology.residue(r).name for r in common]
gly_cols = [i for i, n in enumerate(res_names) if n == "GLY"]
pro_cols = [i for i, n in enumerate(res_names) if n == "PRO"]
phi_gly, psi_gly = phi_deg[:, gly_cols].ravel(), psi_deg[:, gly_cols].ravel()
phi_pro, psi_pro = phi_deg[:, pro_cols].ravel(), psi_deg[:, pro_cols].ravel()
md.compute_dssp(traj, simplified=False) returns (n_frames, n_residues) of one-character codes:
| Code | Meaning |
|------|--------------------|
| H | alpha-helix |
| B | beta-bridge |
| E | beta-sheet |
| G | 3_10-helix |
| I | pi-helix |
| T | turn |
| S | bend |
| (space) or C | coil |
import mdtraj as md
traj = md.load("rep1.xtc", top="protein.pdb")
dssp = md.compute_dssp(traj, simplified=False)
dssp[dssp == " "] = "C"
print(f"DSSP grid: {dssp.shape} unique codes: {set(dssp.flatten())}")
import mdtraj as md
import pandas as pd
traj = md.load("rep1.xtc", top="protein.pdb")
# Baker–Hubbard: returns (n_hbonds, 3) of [donor, H, acceptor] atom indices
hbonds = md.baker_hubbard(traj, freq=0.5, periodic=False)
label = lambda a: f"{traj.topology.atom(a).residue}-{traj.topology.atom(a).name}"
hbonds_df = pd.DataFrame({"donor": [label(d) for d, _, _ in hbonds],
"acceptor": [label(a) for _, _, a in hbonds]})
# SASA in nm^2 per residue per frame
sasa_nm2 = md.shrake_rupley(traj, mode="residue")
total_ang2 = sasa_nm2.sum(axis=1) * 100.0 # nm^2 -> Å^2
print(f"Persistent H-bonds: {len(hbonds_df)}, <SASA>: {total_ang2.mean():.0f} Ų")
traj.xyz is in nm, traj.time in ps, torsions in radians. Convert before plotting:
import numpy as np
xyz_ang = traj.xyz * 10.0
time_ns = traj.time / 1000.0
phi_deg = np.degrees(phi_rad)
resSeqmd.compute_contacts(..., contacts="all") returns 0-based residue indices (contiguous across chains), not PDB resSeq. Reshape into (n_res, n_res) before heatmaps. Map back via traj.topology.residue(i).resSeq only when labeling.
simplified=False → 8 states. simplified=True → 3 states (H/E/C), which loses 3_10 (G), pi-helix (I), beta-bridge (B), bend (S), turn (T).
Goal: structural-analysis report for three MD replicates — RMSD/Rg over time, per-residue RMSF, residue contact frequency map (rep1), Ramachandran general + Gly + Pro (rep1), DSSP 8-state time series (rep1). Plots use matplotlib defaults; the consumer picks palette downstream.
import mdtraj as md, numpy as np, matplotlib.pyplot as plt
from pathlib import Path
replicas = {"rep1": "rep1.xtc", "rep2": "rep2.xtc", "rep3": "rep3.xtc"}
topology = "protein.pdb"
outdir = Path("figures"); outdir.mkdir(exist_ok=True)
trajs = {name: md.load(p, top=topology) for name, p in replicas.items()}
ca = next(iter(trajs.values())).topology.select("name CA")
# 1. RMSD vs time + Rg (Å) for all replicates
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
for name, traj in trajs.items():
traj.superpose(traj, frame=0, atom_indices=ca)
ax1.plot(traj.time / 1000.0, md.rmsd(traj, traj, frame=0, atom_indices=ca) * 10.0, lw=1, label=name)
ax2.plot(traj.time / 1000.0, md.compute_rg(traj) * 10.0, lw=1, label=name)
ax1.set(xlabel="Time (ns)", ylabel="RMSD (Å)", title="Backbone RMSD")
ax2.set(xlabel="Time (ns)", ylabel="Rg (Å)", title="Radius of gyration")
for ax in (ax1, ax2): ax.legend(frameon=False)
fig.tight_layout(); fig.savefig(outdir / "01_rmsd_rg.png"); plt.close(fig)
# 2. Per-residue RMSF (Å) across replicates
fig, ax = plt.subplots(figsize=(10, 4))
for name, traj in trajs.items():
ct = traj.atom_slice(ca); ct.superpose(ct, frame=0)
rmsf = np.sqrt(((ct.xyz - ct.xyz.mean(axis=0)) ** 2).sum(axis=2).mean(axis=0)) * 10.0
ax.plot([a.residue.resSeq for a in ct.topology.atoms], rmsf, lw=1, label=name)
ax.set(xlabel="Residue", ylabel="RMSF (Å)", title="Per-residue flexibility")
ax.legend(frameon=False); fig.tight_layout()
fig.savefig(outdir / "02_rmsf.png"); plt.close(fig)
# 3. Contact frequency map (rep1, 5 Å cutoff)
rep1 = trajs["rep1"]
dist_nm, pairs = md.compute_contacts(rep1, contacts="all", scheme="closest-heavy")
freq = (dist_nm < 0.5).mean(axis=0)
freq_map = np.zeros((rep1.n_residues, rep1.n_residues))
for (i, j), f in zip(pairs, freq):
freq_map[i, j] = freq_map[j, i] = f
np.fill_diagonal(freq_map, 1.0)
fig, ax = plt.subplots(figsize=(6, 5))
im = ax.imshow(freq_map, vmin=0, vmax=1, origin="lower")
ax.set(xlabel="Residue", ylabel="Residue", title="Contact frequency (rep1)")
fig.colorbar(im, ax=ax, label="Fraction of frames in contact")
fig.tight_layout(); fig.savefig(outdir / "03_contact_map.png"); plt.close(fig)
# 4. Ramachandran (rep1) — phi/psi 2D density
phi_ix, phi_rad = md.compute_phi(rep1)
psi_ix, psi_rad = md.compute_psi(rep1)
def res_of(ix, t): return np.array([t.topology.atom(i[1]).residue.index for i in ix])
phi_res, psi_res = res_of(phi_ix, rep1), res_of(psi_ix, rep1)
common = np.intersect1d(phi_res, psi_res)
phi_deg = np.degrees(phi_rad[:, np.isin(phi_res, common)]).ravel()
psi_deg = np.degrees(psi_rad[:, np.isin(psi_res, common)]).ravel()
H, _, _ = np.histogram2d(phi_deg, psi_deg, bins=72, range=[[-180, 180], [-180, 180]])
fig, ax = plt.subplots(figsize=(5, 5))
ax.imshow(H.T, origin="lower", extent=[-180, 180, -180, 180], aspect="equal")
ax.set(xlabel=r"$\phi$ (°)", ylabel=r"$\psi$ (°)", title="Ramachandran (rep1)")
fig.tight_layout(); fig.savefig(outdir / "04_ramachandran_all.png"); plt.close(fig)
# 5. Glycine + proline Ramachandran
res_names = [rep1.topology.residue(r).name for r in common]
for label, cols, fname in [("Glycine", [i for i, n in enumerate(res_names) if n == "GLY"], "05_rama_gly.png"),
("Proline", [i for i, n in enumerate(res_names) if n == "PRO"], "06_rama_pro.png")]:
if not cols: continue
phi = np.degrees(phi_rad[:, cols]).ravel()
psi = np.degrees(psi_rad[:, cols]).ravel()
H, _, _ = np.histogram2d(phi, psi, bins=60, range=[[-180, 180], [-180, 180]])
fig, ax = plt.subplots(figsize=(5, 5))
ax.imshow(H.T, origin="lower", extent=[-180, 180, -180, 180], aspect="equal")
ax.set(xlabel=r"$\phi$ (°)", ylabel=r"$\psi$ (°)", title=f"{label} Ramachandran (rep1)")
fig.tight_layout(); fig.savefig(outdir / fname); plt.close(fig)
# 6. DSSP 8-state time series (rep1)
dssp = md.compute_dssp(rep1, simplified=False); dssp[dssp == " "] = "C"
codes_order = ["C", "E", "B", "S", "T", "H", "I", "G"]
code_to_int = {c: i for i, c in enumerate(codes_order)}
dssp_int = np.vectorize(lambda c: code_to_int.get(c, 0))(dssp)
fig, ax = plt.subplots(figsize=(10, 5))
ax.imshow(dssp_int.T, aspect="auto", origin="lower", interpolation="nearest",
extent=[rep1.time[0] / 1000.0, rep1.time[-1] / 1000.0, 0, rep1.n_residues])
ax.set(xlabel="Time (ns)", ylabel="Residue", title="DSSP 8-state (rep1)")
fig.tight_layout(); fig.savefig(outdir / "07_dssp_8state.png"); plt.close(fig)
# Dump DSSP code mapping so downstream code can pick its own legend/palette
np.savez(outdir / "07_dssp_8state.npz", dssp=dssp, dssp_int=dssp_int, codes_order=codes_order)
print("All figures written to figures/")
Goal: persistent hydrogen bonds (≥50% of frames) and total SASA time series per replicate.
import mdtraj as md, numpy as np, pandas as pd
replicas = {"rep1": "rep1.xtc", "rep2": "rep2.xtc", "rep3": "rep3.xtc"}
top = "protein.pdb"
rows = []
for name, p in replicas.items():
t = md.load(p, top=top)
hbonds = md.baker_hubbard(t, freq=0.5, periodic=False)
sasa_ang2 = (md.shrake_rupley(t, mode="residue").sum(axis=1) * 100.0)
rows.append({"replica": name, "n_persistent_hbonds": len(hbonds),
"sasa_mean_ang2": sasa_ang2.mean(), "sasa_std_ang2": sasa_ang2.std()})
print(pd.DataFrame(rows).to_string(index=False))
| Parameter | Module | Default | Range / Options | Effect |
|-----------|--------|---------|-----------------|--------|
| chunk (md.iterload) | I/O | 100 | 100-5000 | Frames per stream chunk; trade speed vs RAM |
| atom_indices (md.rmsd) | Stability | None | array of ints | Restrict superposition + RMSD (e.g., CA only) |
| frame (md.rmsd) | Stability | 0 | 0-n_frames-1 | Reference frame |
| contacts (md.compute_contacts) | Contacts | "all" | "all", [(i,j),...] | All residue pairs or custom list |
| scheme (md.compute_contacts) | Contacts | "closest-heavy" | "ca", "closest", "closest-heavy", "sidechain", "sidechain-heavy" | Atom subset for inter-residue distance |
| cutoff (contact threshold, user-set) | Contacts | 0.5 nm | 0.4-0.8 nm | In-contact distance (5 Å for CA, 4 Å for closest-heavy) |
| simplified (md.compute_dssp) | DSSP | True | True/False | False → 8-state codes |
| freq (md.baker_hubbard) | H-bonds | 0.1 | 0.0-1.0 | Min frame fraction to report |
| mode (md.shrake_rupley) | SASA | "atom" | "atom", "residue" | Per-atom or summed per residue |
| probe_radius (md.shrake_rupley) | SASA | 0.14 nm | 0.12-0.20 nm | Solvent probe (1.4 Å standard) |
| periodic (geometry funcs) | Geometry | True | bool | Minimum-image convention if box info present |
np.degrees). #1 cause of "my RMSD is 10× too small".md.rmsd superposes internally; RMSF doesn't. Without explicit traj.superpose(...), RMSF picks up rigid-body motion.md.iterload for big trajectories — 1 µs at 1 ps/frame is millions of frames. Streaming avoids OOM.simplified=False for fine-grained DSSP — 3-state collapses 3_10 / pi-helix / bend transitions.resSeq — md.compute_contacts returns 0-based contiguous indices. Map to resSeq only for labeling.traj.atom_slice(traj.topology.select("protein")) cuts memory/time 5–20× on solvated systems.import mdtraj as md
traj = md.load("input.nc", top="topology.parm7")
traj.save_xtc("output.xtc")
traj.save_pdb("output.pdb")
print(f"Converted {traj.n_frames} frames -> output.xtc")
Ensemble flexibility input for flexibility-aware docking or homology refinement.
import mdtraj as md, numpy as np, pandas as pd
paths = {"rep1": "rep1.xtc", "rep2": "rep2.xtc", "rep3": "rep3.xtc"}
top = "protein.pdb"
rmsf_by_rep = {}
for name, p in paths.items():
t = md.load(p, top=top)
ct = t.atom_slice(t.topology.select("name CA")); ct.superpose(ct, frame=0)
rmsf_by_rep[name] = np.sqrt(((ct.xyz - ct.xyz.mean(axis=0)) ** 2).sum(axis=2).mean(axis=0)) * 10.0
df = pd.DataFrame(rmsf_by_rep)
df["mean"], df["std"] = df.mean(axis=1), df.std(axis=1)
df.to_csv("rmsf_replicates.csv", index_label="residue_index")
Curate "open" conformations for clustering or downstream analysis.
import mdtraj as md
traj = md.load("rep1.xtc", top="protein.pdb")
ca = traj.topology.select("name CA")
traj.superpose(traj, frame=0, atom_indices=ca)
rmsd_ang = md.rmsd(traj, traj, frame=0, atom_indices=ca) * 10.0
open_state = traj[rmsd_ang > 4.0]
open_state.save_dcd("open_state.dcd")
print(f"Saved {open_state.n_frames}/{traj.n_frames} frames")
import mdtraj as md, numpy as np
paths = ["rep1.xtc", "rep2.xtc", "rep3.xtc"]
helix_fracs = [(md.compute_dssp(md.load(p, top="protein.pdb"), simplified=False) == "H").mean() for p in paths]
print(f"α-helix per replicate: {[f'{h:.2f}' for h in helix_fracs]}")
print(f"Mean ± std: {np.mean(helix_fracs):.2f} ± {np.std(helix_fracs):.2f}")
| Problem | Cause | Solution |
|---------|-------|----------|
| RuntimeError: No topology | XTC/DCD/TRR carry no topology | Pass top="protein.pdb" (or .prmtop) to md.load |
| RMSD 10× too small | mdtraj returns nm | Multiply by 10.0 |
| RMSF spikes at termini | Intrinsic terminal flexibility, not a bug | Slice termini off or report explicitly |
| DSSP returns only H/E/C | simplified=True default | Pass simplified=False |
| Empty/' ' DSSP cells | Space = coil, not missing | dssp[dssp == " "] = "C" |
| MemoryError on big trajectory | Whole trajectory in RAM | Switch to md.iterload(..., chunk=500) |
| phi/psi arrays differ in residue count | First residue has no phi; last has no psi | Intersect residue index sets before pairing |
| Ramachandran off-center | Angles still in radians | Wrap with np.degrees; set range=[[-180, 180], [-180, 180]] |
| md.compute_contacts returns NaN | Some pairs lack heavy atoms for closest-heavy (e.g., glycine sidechain) | Use scheme="closest" or "ca"; drop NaN columns |
| Contact map asymmetric | Only filled upper triangle | Mirror: freq_map[j, i] = freq_map[i, j] |
| traj.time all zeros | Some writers leave time blank | Reconstruct: traj.time = np.arange(traj.n_frames) * dt_ps |
tools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.
testing
Structured hypothesis formulation: turn observations into testable hypotheses with predictions, propose mechanisms, design experiments. Follows the scientific method. Use scientific-brainstorming for open ideation; hypogenic for automated LLM hypothesis testing on datasets.