chemoinformatics/conformer-generation/SKILL.md
Generates 3D conformer ensembles using RDKit ETKDGv3 with knowledge-enhanced distance geometry, MMFF94/UFF force-field optimization, CREST + GFN2-xTB semi-empirical refinement, and macrocycle-aware torsion preferences. Provides explicit decision rules for single vs ensemble conformer use, RMSD pruning, energy windows, conformer count, and force-field choice. Use when preparing 3D ligands for docking, generating descriptor input for 3D QSAR, or sampling macrocycle/peptide conformational ensembles.
npx skillsauth add GPTomics/bioSkills bio-conformer-generationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: RDKit 2024.09+, xtb 6.7+, CREST 3.0+, OpenMM 8.1+ for follow-up MD.
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturesxtb --version; crest --versionIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Generate 3D conformer ensembles for molecules from 2D structures. The choice of method depends on molecule size, flexibility, and downstream use: ETKDG / ETKDGv3 (Riniker & Landrum 2015 J Chem Inf Model 55:2562-2574; knowledge-enhanced distance geometry) is the modern default for drug-like molecules, MMFF94/UFF for fast energy minimization, CREST + GFN2-xTB for high-accuracy semi-empirical sampling of macrocycles and peptides. A single conformer is rarely sufficient: descriptor variance across the ensemble can exceed the descriptor signal, and docking pose accuracy degrades if the starting conformer is non-bioactive.
For docking pose validation, see chemoinformatics/pose-validation. For free-energy methods (which require ensemble sampling), see chemoinformatics/free-energy-calculations.
| Method | Cost / mol | Quality | Use case | Fails when | |--------|-----------|---------|----------|------------| | ETKDGv3 + MMFF94 | <1s | Good for drug-like | Default; docking input; descriptors | Macrocycles, peptides, transition metals | | ETKDGv3 + UFF | <1s | Lower-quality MMFF94 alternative | Fallback when MMFF94 fails to parameterize | Same as MMFF94 | | Omega (OpenEye) | 1s | Industry-standard commercial | Commercial pipelines | License cost | | Confab (Open Babel) | 5s | Systematic torsion search | Patent expiration | Quality limited | | RDKit ETKDGv3 + macrocycle preferences | 10-60s | Drug-like macrocycles | Macrocyclic peptides | Still limited; CREST better | | CREST + GFN2-xTB | minutes | High-accuracy semi-empirical | Macrocycles, peptides, conformer ensembles for QSAR | Computationally expensive; metal centers | | CREST + GFN-FF | seconds | GFN2 quality at FF speed | Quick screening | Limited element coverage | | GeoMol (Ganea 2021) | <0.1s GPU | ML-fast, ETKDGv3-quality | Large library 3D conformers | ML training distribution | | TorsionNet (Gogineni 2020) | <0.1s GPU | ML-fast | Drug-like | ML training distribution | | MD sampling (OpenMM) | hours | High-quality dynamic | Free energy, induced fit | Computational cost |
Decision: For drug-like molecules (<500 Da, <8 rotatable bonds), ETKDGv3 + MMFF94 with 20-100 conformers is the modern default. For macrocycles, peptides, or molecules with >12 rotatable bonds, CREST + GFN2-xTB captures the conformational diversity. For ML-scale (>1M molecules), GeoMol trades accuracy for speed.
| Scenario | Method | Conformer count | Energy window | |----------|--------|-----------------|----------------| | Single docking pose (initial 3D) | ETKDGv3 + MMFF94 | 1 | n/a | | Multi-conformer docking | ETKDGv3 + MMFF94 | 10-50 | 10 kcal/mol | | 3D QSAR descriptor input | ETKDGv3 + MMFF94 | 50-200 | 5 kcal/mol | | Pharmacophore search | ETKDGv3 + MMFF94 | 100-500 | 5 kcal/mol | | Macrocycle / peptide | CREST + GFN2-xTB | 50-200 (auto from CREST) | 5-8 kcal/mol | | FEP input | CREST + GFN2-xTB then MD relax | 1-3 representative | 3 kcal/mol | | Bioactive conformer search | ETKDGv3 + MMFF94 then dock with rescore | 100-500 | 10 kcal/mol | | Shape similarity / ROCS | ETKDGv3 + MMFF94 | 50-200 | 10 kcal/mol | | Conformer-dependent descriptors | ETKDGv3 ensemble + Boltzmann avg | 20-100 | 5 kcal/mol |
ETKDGv3 (Riniker & Landrum 2015) incorporates experimental torsion preferences into distance geometry: starts from random embeddings, refines by satisfying experimentally-derived bond, angle, and torsion preferences.
Goal: Generate an ensemble of 3D conformers from a SMILES with the modern default embedding algorithm.
Approach: Add explicit hydrogens, configure ETKDGv3 params (random seed, max attempts, random coords), and embed multiple conformers via EmbedMultipleConfs.
from rdkit import Chem
from rdkit.Chem import AllChem
def gen_conformers(smiles, n_conf=20, seed=42):
mol = Chem.MolFromSmiles(smiles)
mol = Chem.AddHs(mol)
params = AllChem.ETKDGv3()
params.randomSeed = seed
params.useRandomCoords = True
params.maxAttempts = 1000
ids = AllChem.EmbedMultipleConfs(mol, numConfs=n_conf, params=params)
return mol, list(ids)
useRandomCoords=True improves convergence for macrocycles and heavily-rotated molecules. maxAttempts=1000 handles difficult embeddings.
After embedding, minimize each conformer to a local minimum.
Goal: Reduce strain in each embedded conformer to a stable local minimum and record the resulting energies.
Approach: Build MMFF94s force-field parameters, minimize each conformer in place, and collect energies; fall back to UFF when MMFF94 cannot parameterize the molecule.
def optimize_conformers(mol, conf_ids, force_field='mmff94'):
energies = []
if force_field == 'mmff94':
mmff_props = AllChem.MMFFGetMoleculeProperties(mol, mmffVariant='MMFF94s')
for cid in conf_ids:
ff = AllChem.MMFFGetMoleculeForceField(mol, mmff_props, confId=cid)
ff.Minimize()
energies.append(ff.CalcEnergy())
else: # UFF fallback
for cid in conf_ids:
ff = AllChem.UFFGetMoleculeForceField(mol, confId=cid)
ff.Minimize()
energies.append(ff.CalcEnergy())
return energies
MMFF94 vs MMFF94s: MMFF94s is the "standard" set with simpler aromatic nitrogen handling; preferred for most drug-like.
UFF (Universal Force Field): Lower quality but handles any element including transition metals. Use as fallback when MMFF94 cannot parameterize (uncommon elements, charged species).
Remove near-duplicate conformers within a chosen RMSD cutoff to keep the ensemble diverse:
import numpy as np
def prune_conformers_rmsd(mol, conf_ids, rmsd_cutoff=0.5):
n = len(conf_ids)
keep = []
for i, cid in enumerate(conf_ids):
is_unique = True
for kept_cid in keep:
rmsd = AllChem.GetBestRMS(mol, mol, cid, kept_cid)
if rmsd < rmsd_cutoff:
is_unique = False
break
if is_unique:
keep.append(cid)
return keep
Typical RMSD cutoff (Source / Rationale):
| Cutoff | Use case | Source | |--------|----------|--------| | 0.5 Å | Drug-like ensemble for descriptors / docking | Empirical: below this conformers represent same minimum (Hawkins 2007) | | 1.0 Å | Drug-like ensemble for pharmacophore | Standard ROCS / pharmacophore practice | | 1.5-2.0 Å | Macrocycles / peptides | Higher conformational freedom; Tan 2018 macrocycle benchmarks | | 2.0+ Å | Cluster-centroid representative ensembles | Coarse representative sampling |
Remove conformers above an energy cutoff (high-energy conformers are unlikely to be bioactive):
def filter_by_energy(mol, conf_ids, energies, window_kcal=10.0):
min_e = min(energies)
keep = []
for cid, e in zip(conf_ids, energies):
if e - min_e <= window_kcal:
keep.append(cid)
return keep
Window choice:
Macrocycles (>=12 atom rings) have distinct conformational issues: ETKDGv3 default knowledge base under-samples macrocycle torsions. Use macrocycle-specific torsion preferences:
from rdkit.Chem import AllChem
def macrocycle_conformers(smiles, n_conf=200, seed=42):
mol = Chem.MolFromSmiles(smiles)
mol = Chem.AddHs(mol)
params = AllChem.ETKDGv3()
params.randomSeed = seed
params.useRandomCoords = True
params.useMacrocycleTorsions = True
params.useSmallRingTorsions = True
params.maxAttempts = 5000
ids = AllChem.EmbedMultipleConfs(mol, numConfs=n_conf, params=params)
return mol, list(ids)
For pharmaceutical macrocycles (cyclosporine, paclitaxel, large peptides), CREST + GFN2-xTB is the gold standard.
CREST (Grimme 2024) performs iterative meta-dynamics + GFN2-xTB optimization for conformer sampling.
Goal: Sample high-quality conformer ensembles for macrocycles, peptides, or molecules where ETKDGv3 + MMFF94 is inadequate.
Approach: Start from an RDKit-generated MMFF94-relaxed conformer, write to XYZ, and run CREST with GFN2-xTB driver to perform iterative meta-dynamics + reoptimization.
xtb mol.xyz --opt extreme
crest opt.xyz --gfn2 --T 12 -ewin 6
--gfn2: use GFN2-xTB (most accurate of GFN family for drug-like molecules).
--gfn-ff: use GFN-FF (faster, less accurate).
-ewin 6: 6 kcal/mol energy window above global min.
-T 12: use 12 CPU threads.
Output: crest_conformers.xyz with sampled ensemble.
Workflow: Start from RDKit ETKDGv3 + MMFF94 (cheap initial structure) -> save as XYZ -> CREST refinement.
from rdkit import Chem
from rdkit.Chem import AllChem
import subprocess
def crest_workflow(smiles, out_dir='crest_out'):
mol = Chem.MolFromSmiles(smiles)
mol = Chem.AddHs(mol)
AllChem.EmbedMolecule(mol, AllChem.ETKDGv3())
AllChem.MMFFOptimizeMolecule(mol)
xyz = Chem.MolToXYZBlock(mol)
with open(f'{out_dir}/input.xyz', 'w') as f:
f.write(xyz)
subprocess.run(['crest', f'{out_dir}/input.xyz', '--gfn2', '-T', '12'],
cwd=out_dir, check=True)
return f'{out_dir}/crest_conformers.xyz'
For ensemble descriptors (3D shape, dipole moment, polar surface area in 3D), Boltzmann-weight by energy:
import numpy as np
def boltzmann_weights(energies, T=300.0):
energies = np.array(energies)
kt = 0.001987 * T # kcal/mol at 300K
rel = energies - energies.min()
w = np.exp(-rel / kt)
return w / w.sum()
def boltzmann_average(values, energies, T=300.0):
w = boltzmann_weights(energies, T)
return float(np.sum(np.array(values) * w))
For Boltzmann averaging, energies should be MMFF94 or higher quality. UFF energies are unreliable for Boltzmann weighting.
For very large libraries (>1M compounds), classical methods become bottlenecks. ML-based methods generate conformers in <0.1s/mol on GPU:
# Pseudo-code for GeoMol-style ML conformer generation
# (Requires pre-trained model + dependencies)
# from geomol import generate_conformers
# conformers = generate_conformers(smiles, n_conformers=10)
Trade-off: ML methods (GeoMol, TorsionNet) match ETKDGv3 quality on drug-like molecules but extrapolate poorly outside training distribution (macrocycles, organometallics).
Trigger: Macrocycle, highly constrained polycyclic, or sterically crowded molecule.
Mechanism: Distance geometry cannot find consistent 3D structure within max attempts.
Symptom: EmbedMolecule returns -1; EmbedMultipleConfs returns empty list.
Fix: Set useRandomCoords=True, increase maxAttempts to 5000+; for macrocycles, set useMacrocycleTorsions=True. As fallback, use CREST.
Trigger: Molecule contains element not parameterized (transition metals, certain S+ species).
Mechanism: MMFF94 only covers H, C, N, O, F, Si, P, S, Cl, Br, I + select cations.
Symptom: MMFFGetMoleculeProperties returns None; optimization silently no-ops.
Fix: Fall back to UFF; or for metals, use GFN2-xTB.
Trigger: n_conf=10 for a flexible molecule (>5 rotatable bonds).
Mechanism: 10 conformers insufficient to sample conformational space; many minima missed.
Symptom: RMSD distribution narrow; descriptor variance underestimated.
Fix: Use n_conf = max(10, 5 * NumRotatableBonds + 10) heuristic (Hawkins 2017).
Trigger: Calculating 3D descriptors from a single conformer.
Mechanism: 3D descriptor variance across conformers can be 50%+ of mean.
Symptom: Same molecule produces different 3D descriptors on rerun.
Fix: Always compute descriptor over ensemble; report mean ± std, or Boltzmann-weighted mean.
Trigger: Cyclosporin or large peptide.
Mechanism: CREST metadynamics scales poorly with rotational complexity.
Symptom: Hours of CPU time per molecule; incomplete sampling.
Fix: Use --gfn-ff for faster initial sampling; reduce metadynamics time --mdtime 5 or skip metadyn with --noopt.
Trigger: Comparing conformer energies between GFN2-xTB and DFT.
Mechanism: GFN2-xTB is parameterized for energies; relative conformer ordering can differ from DFT by 1-2 kcal/mol.
Symptom: "Wrong" conformer reported as global minimum vs DFT reference.
Fix: For high-stakes work, re-rank top GFN2-xTB conformers with DFT single-points (e.g., r2SCAN-3c).
| Use case | ETKDGv3 | CREST | |----------|---------|-------| | Drug-like, <500 Da, <8 RotBonds | Sufficient | Overkill | | 8-12 RotBonds | OK with n_conf>=100 | Better at expense of cost | | Macrocycle, peptide, >12 RotBonds | Inadequate | Required | | Boltzmann-weighted descriptors | OK but energies less accurate | Better | | FEP input | Possible | Preferred (after MMFF cleanup) |
For ETKDGv3 ensembles, run CREST on a subset for benchmarking; if RMSD < 1A across methods, ETKDGv3 is adequate.
| Symptom | Cause | Fix |
|---------|-------|-----|
| EmbedMolecule returns -1 | Embed failed | Set useRandomCoords=True; raise maxAttempts |
| MMFFOptimize no-op | MMFF parameters missing | Use UFF fallback |
| All conformers identical | Stiff molecule | OK; molecule is rigid |
| Conformers physically wrong | Stereochemistry lost | Re-add explicit stereo before embedding |
| 3D descriptors differ per run | Random seed not set | params.randomSeed = 42 |
| CREST out-of-memory | Too many conformers in search | Reduce --T threads; raise --ewin window |
| Macrocycle ring inverted | Default torsion preferences wrong | Set useMacrocycleTorsions=True |
| AddHs not called | Implicit H not embedded | mol = Chem.AddHs(mol) before EmbedMolecule |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.