chemoinformatics/molecular-standardization/SKILL.md
Standardizes molecular structures using ChEMBL chembl_structure_pipeline and RDKit rdMolStandardize covering sanitization, salt/solvent stripping, neutralization, tautomer canonicalization, stereochemistry standardization, mixture handling, and isotope normalization. Explicitly compares ChEMBL pipeline, canSARchem, and PubChem standardization choices. Use when preparing libraries for QSAR training, joining datasets across sources, deduplicating compound collections, or building canonical compound registries.
npx skillsauth add GPTomics/bioSkills bio-molecular-standardizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: RDKit 2024.09+, chembl_structure_pipeline 1.2+, MolVS 0.1.1 (legacy reference only -- rdMolStandardize is current).
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturesIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Convert raw molecular structures into a single canonical form for ML training data, deduplication, registry, and cross-database joining. Standardization is the single most underrated upstream step: skipping it causes silent ML data leakage (training and test compounds with different tautomers count as separate), bogus QSAR predictions, and database join misses. The ChEMBL pipeline (Bento 2020) and canSARchem (Ravi 2022) are the two industry references; canSARchem extends ChEMBL with canonical-tautomer-before-parent extraction. RDKit's rdMolStandardize implements ChEMBL-equivalent logic in C++ (the older MolVS Python implementation was deprecated Q1 2024).
For format-level I/O and aromaticity perception, see chemoinformatics/molecular-io. For descriptor calculation after standardization, see chemoinformatics/molecular-descriptors.
| Stage | RDKit Tool | Operation | Common errors caught |
|-------|-----------|-----------|----------------------|
| 1. Sanitization | Chem.SanitizeMol | Kekulize, assign aromaticity, fix valences | Wrong valence on N/O |
| 2. Salt stripping | rdMolStandardize.FragmentRemover or LargestFragmentChooser | Remove counterions | Cl-, Na+, K+, OH- |
| 3. Mixture choice | LargestFragmentChooser | Pick parent fragment | Co-crystals, hydrates |
| 4. Charge neutralization | Uncharger | Neutralize while preserving net charge | Permanent charges preserved (quaternary N+) |
| 5. Tautomer canonicalization | TautomerEnumerator.Canonicalize | Pick canonical tautomer | Keto/enol; amide/imidate |
| 6. Stereo standardization | Chem.AssignStereochemistry | Consistent stereo descriptors | Lost wedges, ambiguous R/S |
| 7. Isotope normalization | manual or MolToSmiles(isomericSmiles=False) | Remove 13C, 2H labels | Tracer studies |
| 8. Output canonicalization | Chem.MolToSmiles(canonical=True) | Canonical SMILES + InChIKey | Round-trip stability |
| Pipeline | Origin | Tautomer canonicalization | Salt definition | Use case | |----------|--------|---------------------------|-----------------|----------| | ChEMBL pipeline | EBI ChEMBL | Pre-rdMolStandardize legacy; now uses rdMolStandardize | ChEMBL salt list (extensive) | Drug-like compounds, FDA approvals | | canSARchem | ICR Cancer Research UK | Canonical tautomer BEFORE parent extraction | Extended salt list | Cancer drug discovery | | PubChem (OpenEye) | NIH NCBI | OpenEye QUACPAC tautomer | PubChem salt list | Bioassay data, large-scale | | RDKit rdMolStandardize default | Greg Landrum | RDKit TautomerEnumerator | RDKit default | General purpose, open source |
Key difference (canSARchem vs ChEMBL):
For 95% of drug-like molecules these produce identical results. For tautomer-ambiguous molecules (amide/imidate, ketoenol, lactam/lactim), the order matters; canSARchem produces more stable canonical forms.
ChEMBL's standardization is the most widely-used reference. The Python package chembl_structure_pipeline exposes the validated pipeline.
Goal: Apply the industry-reference ChEMBL standardization pipeline to a SMILES.
Approach: Parse SMILES with RDKit, run standardize_mol (sanitize + uncharge + normalize + canonical tautomer), then get_parent_mol (strip salts/counter-ions), and emit canonical SMILES.
from chembl_structure_pipeline import standardize_mol, get_parent_mol
from rdkit import Chem
def chembl_pipeline(smi):
mol = Chem.MolFromSmiles(smi)
if mol is None:
return None, 'parse_failure'
standardized, _ = standardize_mol(mol)
parent, _ = get_parent_mol(standardized)
return Chem.MolToSmiles(parent), 'ok'
standardize_mol: sanitize + uncharge + normalize functional groups + canonicalize tautomers.
get_parent_mol: strip salts/counter-ions; choose largest fragment.
Output: canonical SMILES of the parent (free acid/free base, neutral form).
For more granular control or non-ChEMBL workflows.
Goal: Execute each standardization step explicitly to control salt stripping, charge handling, tautomer canonicalization, and isotope normalization.
Approach: Run the 8-stage pipeline (sanitize, largest fragment, normalize, uncharge, tautomer canonicalize, isotope strip, stereo standardize, canonical SMILES) sequentially with rdMolStandardize primitives.
from rdkit import Chem
from rdkit.Chem.MolStandardize import rdMolStandardize
def full_standardize(smi, keep_isotopes=False):
mol = Chem.MolFromSmiles(smi)
if mol is None:
return None
Chem.SanitizeMol(mol)
largest = rdMolStandardize.LargestFragmentChooser(preferOrganic=True)
mol = largest.choose(mol)
normalizer = rdMolStandardize.Normalizer()
mol = normalizer.normalize(mol)
uncharger = rdMolStandardize.Uncharger(canonicalOrdering=True)
mol = uncharger.uncharge(mol)
enumerator = rdMolStandardize.TautomerEnumerator()
mol = enumerator.Canonicalize(mol)
if not keep_isotopes:
for atom in mol.GetAtoms():
atom.SetIsotope(0)
Chem.AssignStereochemistry(mol, cleanIt=True, force=True)
return Chem.MolToSmiles(mol)
canonicalOrdering=True ensures the uncharger produces the same result regardless of atom ordering in input -- critical for stable canonical output.
| Salt form | Action | Example |
|-----------|--------|---------|
| Mono-salt | Strip counter-ion | [Na+].CC(=O)[O-] -> CC(=O)O |
| Di-salt | Strip both | [Na+].[Na+].CC(=O)[O-].CC(=O)[O-] -> CC(=O)O |
| Mixed salt | Largest organic fragment | CCO.CC(=O)O -> CCO (or CC(=O)O depending on rule) |
| Co-crystal | Hardest case | CC(=O)O.CCOC(C)=O -- both organic; default returns largest |
| Hydrate | Strip waters | CC(=O)O.O -> CC(=O)O |
| Solvate | Strip solvents | CC(=O)O.CO -> CC(=O)O |
| Quaternary ammonium | Preserve charge | [N+](C)(C)(C)C (permanent charge; do NOT neutralize) |
LargestFragmentChooser(preferOrganic=True) prefers organic fragments over inorganic counter-ions even if smaller; for co-crystals, default rule picks largest organic fragment.
Tautomer canonicalization is the most controversial standardization step. There is no universally-correct canonical tautomer for many drug-like molecules.
| Tautomer pair | Default canonical | Issue | |---------------|-------------------|-------| | Keto/enol | Keto preferred | Most kinase ATP-mimetic enols destabilize on canonicalization | | Lactam/lactim | Lactam preferred | Some natural products (rifampin) are inherently lactim | | Amidine/iminol | Amidine preferred | Some bioactive amidines convert | | Phenol/keto (e.g., naphthol/naphthalenone) | Phenol preferred | Some quinone-form pharmaceuticals reverted | | 2H-pyrazole / 1H-pyrazole | 1H-pyrazole | Both equally stable in vivo |
Practical rules:
epik from Schrödinger or Open Babel pkBABEL)from rdkit.Chem.MolStandardize import rdMolStandardize
def canonical_tautomer(smi):
mol = Chem.MolFromSmiles(smi)
enumerator = rdMolStandardize.TautomerEnumerator()
canon = enumerator.Canonicalize(mol)
return Chem.MolToSmiles(canon)
from rdkit import Chem
def standardize_stereo(mol, remove_undefined=False):
Chem.AssignStereochemistry(mol, cleanIt=True, force=True)
if remove_undefined:
Chem.RemoveStereochemistry(mol)
return mol
Cases:
@ / \ / / -> preservedFor ML, often drop stereo entirely (Chem.RemoveStereochemistry(mol)) since most QSAR endpoints are not stereo-specific. For docking and FEP, preserve stereo always.
Goal: Build a standardized + deduplicated training set with replicate-averaged activity for QSAR or ADMET model training.
Approach: Standardize every SMILES through the ChEMBL pipeline, compute InChIKey as canonical identity, group by InChIKey, and mean-aggregate activities; report replicate count for confidence weighting.
import pandas as pd
from chembl_structure_pipeline import standardize_mol, get_parent_mol
def prepare_qsar_data(df, smiles_col='smiles', activity_col='pIC50'):
standardized = []
for i, row in df.iterrows():
mol = Chem.MolFromSmiles(row[smiles_col])
if mol is None:
continue
try:
mol, _ = standardize_mol(mol)
mol, _ = get_parent_mol(mol)
standardized.append({
'smiles': Chem.MolToSmiles(mol),
'inchikey': Chem.MolToInchiKey(mol),
'activity': row[activity_col],
})
except Exception:
continue
df_std = pd.DataFrame(standardized)
df_std = df_std.groupby('inchikey').agg(
smiles=('smiles', 'first'),
activity=('activity', 'mean'),
n_replicates=('activity', 'count'),
).reset_index()
return df_std
Deduplication by InChIKey collapses tautomer-equivalent compounds. Replicate count signals measurement reliability.
Trigger: Molecule is genuinely an inorganic salt (e.g., NaCl, K2SO4).
Mechanism: get_parent_mol chooses largest organic; falls back to largest fragment for fully inorganic.
Symptom: Returns the salt itself (not a drug).
Fix: Pre-filter to compounds with ≥1 carbon atom.
Trigger: Quaternary ammonium (permanent positive) or sulfonate at physiological pH.
Mechanism: Default uncharger attempts to neutralize without distinguishing permanent vs pH-dependent charges.
Symptom: Permanently charged ligands neutralized; structure incorrect for downstream docking.
Fix: Use Uncharger(canonicalOrdering=True, force=False); manually inspect borderline cases.
Trigger: Molecule with many tautomerizable groups (polyhydroxylated heterocycle).
Mechanism: TautomerEnumerator.Enumerate generates all possible tautomers; can produce thousands.
Symptom: OOM or hour-long compute on single molecule.
Fix: Use Canonicalize (returns single canonical) instead of Enumerate; for Enumerate, cap maxTransforms parameter.
Trigger: Code still using legacy from molvs import Standardizer.
Mechanism: RDKit MolStandardize Python implementation removed Q1 2024.
Symptom: ImportError or AttributeError on newer RDKit.
Fix: Migrate to from rdkit.Chem.MolStandardize import rdMolStandardize; methods renamed (e.g., standardize -> Cleanup).
Trigger: Compound canonicalized to different tautomer per run.
Mechanism: RDKit tautomer canonicalization depends on atom ordering for very symmetric molecules.
Symptom: Re-running standardization yields different InChIKey.
Fix: Set canonicalOrdering=True in Uncharger; sort atoms via canonical SMILES first.
| Symptom | Cause | Fix |
|---------|-------|-----|
| ImportError on MolStandardize | Python MolStandardize deprecated | Use from rdkit.Chem.MolStandardize import rdMolStandardize |
| standardize_mol returns None | Sanitize failure on input | Try Chem.MolFromSmiles(smi, sanitize=False) first |
| Stripped wrong fragment | LargestFragmentChooser ambiguity | Manually inspect; consider custom logic |
| Tautomer differs between runs | Atom-order-dependent | Set canonicalOrdering=True; sort atoms |
| Charge lost on quaternary N | Aggressive neutralization | Use force=False |
| InChIKey collisions across "different" mols | Same canonical InChI but different stereo / tautomer | Use longer InChIKey (or full InChI) |
| Pipeline slow on large library | Per-mol Python overhead | Use chembl_structure_pipeline (vectorized) or process in chunks |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.