skills/structural-biology-drug-discovery/rdkit-cheminformatics/SKILL.md
Cheminformatics toolkit for molecular analysis and virtual screening: SMILES/SDF parsing, descriptors (MW, LogP, TPSA), fingerprints (Morgan/ECFP, MACCS), Tanimoto similarity, SMARTS substructure filtering, Lipinski drug-likeness, reaction enumeration, 2D/3D coordinates. For simpler API use datamol; use RDKit for fine-grained sanitization, custom fingerprints, or SMARTS/reaction control.
npx skillsauth add jaechang-hits/scicraft rdkit-cheminformaticsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
RDKit is the standard open-source cheminformatics library for Python, providing comprehensive APIs for molecular parsing, descriptor calculation, fingerprinting, substructure searching, and chemical reactions. This skill walks through a complete compound library profiling and virtual screening workflow — from loading molecules through drug-likeness filtering, similarity screening, and result visualization.
datamol-cheminformatics instead for a higher-level RDKit wrapper with batching and error handling; use openbabel instead for multi-format conversion (MOL2, XYZ, PDB)rdkit-pypi (or rdkit via conda), pandas, matplotlib, numpy# Option 1: pip (lightweight)
pip install rdkit-pypi pandas matplotlib numpy
# Option 2: conda (full features including cartridge)
conda install -c conda-forge rdkit pandas matplotlib numpy
Read molecular structures from SMILES or SDF and validate parsing.
from rdkit import Chem
import pandas as pd
# --- From SMILES list ---
smiles_list = [
"CC(=O)Oc1ccccc1C(=O)O", # Aspirin
"CC12CCC3C(C1CCC2O)CCC4=CC(=O)CCC34C", # Testosterone
"c1ccc2[nH]c(-c3ccccn3)nc2c1", # Benzimidazole derivative
"CC(C)Cc1ccc(C(C)C(=O)O)cc1", # Ibuprofen
"INVALID_SMILES", # Will fail
]
mols = []
failed = []
for smi in smiles_list:
mol = Chem.MolFromSmiles(smi)
if mol is not None:
mol.SetProp("_SMILES", smi)
mols.append(mol)
else:
failed.append(smi)
print(f"Successfully parsed: {len(mols)}/{len(smiles_list)}")
print(f"Failed: {failed}")
# --- From SDF file ---
# suppl = Chem.SDMolSupplier("library.sdf")
# mols = [mol for mol in suppl if mol is not None]
# print(f"Loaded {len(mols)} molecules from SDF")
Canonicalize SMILES and remove duplicates to ensure a clean dataset.
from rdkit.Chem.MolStandardize import rdMolStandardize
# Standardize: neutralize charges, remove fragments, canonicalize
uncharger = rdMolStandardize.Uncharger()
chooser = rdMolStandardize.LargestFragmentChooser()
standardized = []
seen_smiles = set()
for mol in mols:
# Keep largest fragment (remove salts/counterions)
mol = chooser.choose(mol)
# Neutralize charges
mol = uncharger.uncharge(mol)
# Canonical SMILES for deduplication
canon_smi = Chem.MolToSmiles(mol)
if canon_smi not in seen_smiles:
seen_smiles.add(canon_smi)
mol.SetProp("canonical_smiles", canon_smi)
standardized.append(mol)
print(f"After standardization: {len(standardized)} unique molecules")
print(f"Removed {len(mols) - len(standardized)} duplicates/salts")
Compute physicochemical properties for each molecule.
from rdkit.Chem import Descriptors
records = []
for mol in standardized:
desc = {
"SMILES": Chem.MolToSmiles(mol),
"MW": round(Descriptors.MolWt(mol), 2),
"LogP": round(Descriptors.MolLogP(mol), 2),
"TPSA": round(Descriptors.TPSA(mol), 2),
"HBD": Descriptors.NumHDonors(mol),
"HBA": Descriptors.NumHAcceptors(mol),
"RotBonds": Descriptors.NumRotatableBonds(mol),
"AromaticRings": Descriptors.NumAromaticRings(mol),
"HeavyAtoms": mol.GetNumHeavyAtoms(),
"RingCount": Descriptors.RingCount(mol),
}
records.append(desc)
df = pd.DataFrame(records)
print(df.to_string(index=False))
print(f"\nDescriptor summary:\n{df.describe().round(2)}")
Filter compounds using Lipinski's Rule of Five and Veber criteria.
def lipinski_filter(row):
"""Lipinski Ro5: MW<=500, LogP<=5, HBD<=5, HBA<=10"""
return (row["MW"] <= 500 and row["LogP"] <= 5 and
row["HBD"] <= 5 and row["HBA"] <= 10)
def veber_filter(row):
"""Veber: RotBonds<=10, TPSA<=140"""
return row["RotBonds"] <= 10 and row["TPSA"] <= 140
df["Lipinski"] = df.apply(lipinski_filter, axis=1)
df["Veber"] = df.apply(veber_filter, axis=1)
df["DrugLike"] = df["Lipinski"] & df["Veber"]
print(f"Lipinski pass: {df['Lipinski'].sum()}/{len(df)}")
print(f"Veber pass: {df['Veber'].sum()}/{len(df)}")
print(f"Drug-like: {df['DrugLike'].sum()}/{len(df)}")
drug_like_mols = [standardized[i] for i in df[df["DrugLike"]].index]
print(f"\n{len(drug_like_mols)} drug-like compounds retained")
Compute Morgan fingerprints and screen against a reference compound.
from rdkit.Chem import AllChem
from rdkit import DataStructs
# Reference compound (e.g., known active)
ref_smi = "CC(=O)Oc1ccccc1C(=O)O" # Aspirin
ref_mol = Chem.MolFromSmiles(ref_smi)
ref_fp = AllChem.GetMorganFingerprintAsBitVect(ref_mol, radius=2, nBits=2048)
# Screen library
results = []
for mol in drug_like_mols:
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048)
tanimoto = DataStructs.TanimotoSimilarity(ref_fp, fp)
results.append({
"SMILES": Chem.MolToSmiles(mol),
"Tanimoto": round(tanimoto, 3),
})
sim_df = pd.DataFrame(results).sort_values("Tanimoto", ascending=False)
print("Similarity ranking:")
print(sim_df.to_string(index=False))
# Filter by threshold
threshold = 0.3
hits = sim_df[sim_df["Tanimoto"] >= threshold]
print(f"\n{len(hits)} compounds with Tanimoto >= {threshold}")
Filter compounds containing specific functional groups.
# Define SMARTS patterns for functional groups of interest
patterns = {
"Carboxylic acid": "[CX3](=O)[OX2H1]",
"Amide": "[CX3](=[OX1])[NX3]",
"Aromatic ring": "c1ccccc1",
"Hydroxyl": "[OX2H]",
"Ester": "[CX3](=O)[OX2][C]",
}
print("Substructure matches:")
for name, smarts in patterns.items():
query = Chem.MolFromSmarts(smarts)
match_count = sum(1 for mol in drug_like_mols if mol.HasSubstructMatch(query))
print(f" {name}: {match_count}/{len(drug_like_mols)} compounds")
# Get specific matches with atom indices
query = Chem.MolFromSmarts("[CX3](=O)[OX2H1]") # Carboxylic acid
for mol in drug_like_mols:
matches = mol.GetSubstructMatches(query)
if matches:
smi = Chem.MolToSmiles(mol)
print(f"\n{smi}: {len(matches)} carboxylic acid group(s)")
for match in matches:
print(f" Atom indices: {match}")
Generate publication-quality molecular depictions.
from rdkit.Chem import Draw
from rdkit.Chem.Draw import rdMolDraw2D
# Grid image of top hits
legends = [f"Tan={row['Tanimoto']}" for _, row in sim_df.head(4).iterrows()]
top_mols = [Chem.MolFromSmiles(smi) for smi in sim_df.head(4)["SMILES"]]
img = Draw.MolsToGridImage(
top_mols,
molsPerRow=2,
subImgSize=(300, 300),
legends=legends,
)
img.save("top_hits_grid.png")
print("Saved top_hits_grid.png")
# Highlight substructure in a molecule
mol = top_mols[0]
query = Chem.MolFromSmarts("[CX3](=O)[OX2H1]")
match = mol.GetSubstructMatch(query)
if match:
highlight_img = Draw.MolToImage(mol, size=(400, 400), highlightAtoms=match)
highlight_img.save("substructure_highlight.png")
print("Saved substructure_highlight.png")
Save the profiling results and filtered compounds.
import os
os.makedirs("results", exist_ok=True)
# Save descriptor table
df.to_csv("results/descriptors.csv", index=False)
print(f"Saved descriptors for {len(df)} compounds to results/descriptors.csv")
# Save drug-like compounds as SDF
writer = Chem.SDWriter("results/drug_like_compounds.sdf")
for i, mol in enumerate(drug_like_mols):
# Attach descriptors as SDF properties
row = df[df["DrugLike"]].iloc[i]
mol.SetProp("MW", str(row["MW"]))
mol.SetProp("LogP", str(row["LogP"]))
mol.SetProp("TPSA", str(row["TPSA"]))
writer.write(mol)
writer.close()
print(f"Saved {len(drug_like_mols)} drug-like compounds to results/drug_like_compounds.sdf")
# Save similarity results
sim_df.to_csv("results/similarity_results.csv", index=False)
print(f"Saved similarity rankings to results/similarity_results.csv")
| Parameter | Default | Range / Options | Effect |
|-----------|---------|-----------------|--------|
| MolFromSmiles(sanitize=) | True | True, False | Automatic validation and aromaticity perception on parsing |
| Morgan radius | 2 | 1-3 | Fingerprint radius; 2 ≈ ECFP4, 3 ≈ ECFP6 |
| Morgan nBits | 2048 | 1024-4096 | Fingerprint bit length; higher = fewer collisions |
| Tanimoto threshold | 0.7 | 0.3-0.9 | Similarity cutoff; lower = more permissive |
| Lipinski MW cutoff | 500 | 300-600 | Max molecular weight for drug-likeness |
| Lipinski LogP cutoff | 5 | 3-6 | Max lipophilicity |
| Veber RotBonds cutoff | 10 | 7-15 | Max rotatable bonds for oral bioavailability |
| Veber TPSA cutoff | 140 | 120-160 | Max polar surface area (Ų) |
| EmbedMolecule(randomSeed=) | None | Any integer | Seed for reproducible 3D conformer generation |
| Butina distThresh | 0.3 | 0.2-0.5 | Distance cutoff for Butina clustering |
When to use: select a diverse subset from a large compound library.
from rdkit.ML.Cluster import Butina
from rdkit.Chem import AllChem
from rdkit import DataStructs, Chem
# Generate fingerprints
fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048)
for mol in standardized]
# Build distance matrix (lower triangle)
dists = []
for i in range(1, len(fps)):
sims = DataStructs.BulkTanimotoSimilarity(fps[i], fps[:i])
dists.extend([1 - s for s in sims])
# Cluster
clusters = Butina.ClusterData(dists, len(fps), distThresh=0.3, isDistData=True)
print(f"{len(clusters)} clusters from {len(fps)} compounds")
# Pick centroid from each cluster (first element = centroid)
diverse_indices = [c[0] for c in clusters]
diverse_mols = [standardized[i] for i in diverse_indices]
print(f"Selected {len(diverse_mols)} diverse representatives")
When to use: generate a combinatorial library from building blocks via reaction SMARTS.
from rdkit.Chem import AllChem, Chem
# Amide coupling: carboxylic acid + amine → amide
rxn = AllChem.ReactionFromSmarts(
"[C:1](=[O:2])[OH].[N:3]([H])([H])[C:4]>>[C:1](=[O:2])[N:3][C:4]"
)
acids = [Chem.MolFromSmiles(s) for s in ["OC(=O)c1ccccc1", "OC(=O)CC"]]
amines = [Chem.MolFromSmiles(s) for s in ["NCC", "NC1CCCCC1"]]
products = []
for acid in acids:
for amine in amines:
ps = rxn.RunReactants((acid, amine))
for product_set in ps:
for prod in product_set:
Chem.SanitizeMol(prod)
products.append(Chem.MolToSmiles(prod))
print(f"Generated {len(products)} products:")
for p in products:
print(f" {p}")
When to use: prepare molecules for docking or 3D pharmacophore analysis.
from rdkit import Chem
from rdkit.Chem import AllChem
mol = Chem.MolFromSmiles("CC(=O)Oc1ccccc1C(=O)O")
mol = Chem.AddHs(mol) # Required for 3D embedding
# Generate multiple conformers
params = AllChem.ETKDGv3()
params.randomSeed = 42
params.numThreads = 0 # Use all available cores
conf_ids = AllChem.EmbedMultipleConfs(mol, numConfs=10, params=params)
print(f"Generated {len(conf_ids)} conformers")
# Optimize with MMFF94 force field
energies = []
for conf_id in conf_ids:
result = AllChem.MMFFOptimizeMolecule(mol, confId=conf_id)
ff = AllChem.MMFFGetMoleculeForceField(mol, AllChem.MMFFGetMoleculeProperties(mol), confId=conf_id)
energy = ff.CalcEnergy()
energies.append((conf_id, energy))
print(f" Conformer {conf_id}: {energy:.2f} kcal/mol (converged={result == 0})")
# Get lowest energy conformer
best_id = min(energies, key=lambda x: x[1])[0]
print(f"\nBest conformer: {best_id} ({min(e for _, e in energies):.2f} kcal/mol)")
# Save to SDF
writer = Chem.SDWriter("conformers.sdf")
for conf_id, energy in energies:
mol.SetProp("Energy", f"{energy:.2f}")
writer.write(mol, confId=conf_id)
writer.close()
When to use: debug SMARTS matches, annotate atom positions for reports.
from rdkit import Chem
from rdkit.Chem.Draw import rdMolDraw2D
mol = Chem.MolFromSmiles("CC(=O)Oc1ccccc1C(=O)O")
AllChem.Compute2DCoords(mol)
# Custom drawer with atom indices and stereo annotations
drawer = rdMolDraw2D.MolDraw2DCairo(500, 400)
opts = drawer.drawOptions()
opts.addAtomIndices = True
opts.addStereoAnnotation = True
opts.bondLineWidth = 2.0
drawer.DrawMolecule(mol)
drawer.FinishDrawing()
with open("annotated_molecule.png", "wb") as f:
f.write(drawer.GetDrawingText())
print("Saved annotated_molecule.png with atom indices")
results/descriptors.csv — Tabular descriptors (SMILES, MW, LogP, TPSA, HBD, HBA, RotBonds, Lipinski, Veber, DrugLike)results/drug_like_compounds.sdf — Filtered compounds in SDF format with attached propertiesresults/similarity_results.csv — Tanimoto similarity rankings against reference compoundtop_hits_grid.png — 2D grid image of top similar compoundssubstructure_highlight.png — Molecule image with highlighted functional groupconformers.sdf — 3D conformer ensemble with MMFF energiesannotated_molecule.png — Atom-indexed 2D depiction| Problem | Cause | Solution |
|---------|-------|----------|
| MolFromSmiles returns None | Invalid SMILES string or valence error | Check SMILES validity; use Chem.MolFromSmiles(smi, sanitize=False) then DetectChemistryProblems() to diagnose |
| Kekulization error | Invalid aromatic ring system | Check for non-standard aromaticity; try Chem.SanitizeMol(mol, sanitizeOps=Chem.SANITIZE_ALL ^ Chem.SANITIZE_KEKULIZE) |
| EmbedMolecule returns -1 | 3D embedding failed (ring strain, steric clash) | Use AllChem.EmbedMolecule(mol, maxAttempts=50, useRandomCoords=True) |
| MMFF has null force field | Missing MMFF parameters for atom types | Switch to UFF: AllChem.UFFOptimizeMolecule(mol) |
| Wrong descriptor values | Missing explicit hydrogens | Call Chem.AddHs(mol) before descriptor calculation for H-dependent properties |
| ForwardSDMolSupplier empty | File not found or wrong format | Verify path; use Chem.SDMolSupplier(path, sanitize=False) to skip problematic molecules |
| Slow fingerprint computation on large library | Sequential processing | Use AllChem.GetMorganFingerprintAsBitVect with pre-allocated arrays; consider rdkit.Chem.MultithreadedSDMolSupplier |
| SMARTS pattern no matches | Incorrect SMARTS syntax or aromaticity mismatch | Test pattern with simple SMILES first; use [#6] instead of C for any carbon |
| MemoryError on large SDF | Loading entire file into memory | Use ForwardSDMolSupplier for streaming; process in batches |
| Inconsistent canonical SMILES | Different RDKit versions | Pin RDKit version; use Chem.MolToSmiles(mol, canonical=True) explicitly |
This skill includes reference files in the references/ subdirectory:
references/api_reference.md — Key RDKit modules and function lookup organized by capability (I/O, descriptors, fingerprints, drawing, reactions)references/descriptors_guide.md — Complete list of 200+ molecular descriptors with names, descriptions, and typical rangesreferences/smarts_patterns.md — Common SMARTS patterns for functional group detection, organized by chemical classtools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.