chemoinformatics/scaffold-analysis/SKILL.md
Analyzes chemical libraries by scaffold using Bemis-Murcko scaffolds, generic frameworks, cyclic skeletons, matched molecular pair (MMP) analysis via mmpdb, R-group decomposition, Free-Wilson analysis, scaffold hopping, and chemotype-aware ML train/test splits. Use when identifying chemotype clusters in a library, deriving SAR transformation rules, decomposing series into R-groups, performing scaffold-balanced QSAR splits, or planning analog campaigns.
npx skillsauth add GPTomics/bioSkills bio-scaffold-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: RDKit 2024.09+, mmpdb 3.1+, scikit-learn 1.4+, datamol 0.12+.
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturesIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Analyze chemical libraries by their underlying scaffolds. Bemis-Murcko (1996) is the canonical scaffold decomposition: ring systems + linkers, with all R-groups stripped. Generic framework + cyclic skeleton are progressively-more-abstract views. Scaffold analysis underpins QSAR train/test splits (preventing data leakage), library diversity assessment, chemotype clustering, R-group decomposition for SAR modeling, and matched molecular pair analysis (MMPA). The choice of scaffold representation determines whether two compounds are "the same series" -- a critical decision for medicinal chemistry workflows.
For reaction-based enumeration and Free-Wilson, see chemoinformatics/reaction-enumeration. For scaffold-hopping via fingerprints, see chemoinformatics/similarity-searching. For 3D shape-based scaffold hopping, see chemoinformatics/shape-similarity.
| Representation | Origin | Definition | Use case | Fails when |
|----------------|--------|------------|----------|------------|
| Bemis-Murcko scaffold | Bemis & Murcko 1996 | Ring systems + linkers, R-groups stripped | Default chemotype identifier | Linear molecules (no rings) -> empty scaffold |
| Generic framework | Bemis & Murcko 1996 | Bemis-Murcko with all atoms set to C, all bonds single | Topology comparison | Loses heteroatom info |
| Cyclic skeleton (CSK) | RDKit | Ring atoms only, all C, all single | Pure ring-topology view | Loses linker info |
| Murcko atom set | RDKit GetScaffoldForMol(returnMol=False) | Atom indices | Programmatic operations | Not a SMILES |
| Sphynx fingerprints | Maggiora 2020 | Scaffold + connection signature | Cross-target scaffold hopping | Specialty use |
from rdkit import Chem
from rdkit.Chem.Scaffolds import MurckoScaffold
def all_scaffold_views(smi):
mol = Chem.MolFromSmiles(smi)
bm = MurckoScaffold.GetScaffoldForMol(mol)
bm_smi = Chem.MolToSmiles(bm)
generic = MurckoScaffold.MakeScaffoldGeneric(bm)
generic_smi = Chem.MolToSmiles(generic)
return {
'bemis_murcko': bm_smi,
'generic_framework': generic_smi,
}
Example: Cc1ccc(C(=O)NCC2CCCC2)cc1 -> Bemis-Murcko c1ccc(C(=O)NCC2CCCC2)cc1; generic C1CCC(C(C)NCC2CCCC2)CC1.
Goal: Group compounds by shared Bemis-Murcko scaffold.
Approach: Compute scaffold for each compound; group by scaffold SMILES.
from collections import defaultdict
def scaffold_clusters(smiles_list):
clusters = defaultdict(list)
for smi in smiles_list:
mol = Chem.MolFromSmiles(smi)
if mol is None:
continue
scaffold = MurckoScaffold.GetScaffoldForMol(mol)
scaffold_smi = Chem.MolToSmiles(scaffold)
clusters[scaffold_smi].append(smi)
return clusters
Output: dict {scaffold_smiles: [compound_smiles, ...]}. Cluster sizes inform library diversity.
For QSAR / ML, random train/test split causes data leakage: compounds from the same chemotype (analogs in same series) end up in both. Bemis-Murcko split puts entire scaffolds in train or test, never both.
from rdkit.Chem.Scaffolds import MurckoScaffold
def scaffold_split(df, smiles_col='smiles', train_frac=0.8, seed=42):
import random
random.seed(seed)
scaffolds = defaultdict(list)
for i, row in df.iterrows():
mol = Chem.MolFromSmiles(row[smiles_col])
if mol is None:
continue
scaff = Chem.MolToSmiles(MurckoScaffold.GetScaffoldForMol(mol))
scaffolds[scaff].append(i)
scaffold_sets = sorted(scaffolds.values(), key=lambda x: len(x), reverse=True)
n_total = sum(len(s) for s in scaffold_sets)
n_train = int(n_total * train_frac)
train_idx = []
test_idx = []
for scaff_set in scaffold_sets:
if len(train_idx) + len(scaff_set) <= n_train:
train_idx.extend(scaff_set)
else:
test_idx.extend(scaff_set)
return df.iloc[train_idx], df.iloc[test_idx]
Effect on benchmark metrics: Random split AUC 0.95; Bemis-Murcko split AUC 0.75-0.85 typical. The gap measures true generalization vs memorization.
Caveat: Bemis-Murcko split is one scaffold-split; for production ML, consider time split (newer compounds in test) or activity-cliff-balanced split.
Class-imbalanced datasets: For binary outcomes (e.g. hERG blocker, AMES mutagen) with class imbalance, scaffold-only assignment can yield test sets with skewed class distribution and unreliable metrics. Use stratified scaffold split: cluster scaffolds, then assign clusters preserving class balance in train + test. Available as chemprop --split scaffold_balanced (does class-aware scaffold partitioning); for custom workflows, combine sklearn.model_selection.StratifiedKFold with scaffold-grouped folds (GroupKFold then StratifiedShuffleSplit on residual).
Goal: Given a defined scaffold and a set of analog compounds, extract the R-group at each numbered attachment point into a tabular SAR matrix.
from rdkit.Chem import rdRGroupDecomposition as rgd
def decompose_series(compounds, scaffold_smiles_with_R):
scaffold = Chem.MolFromSmiles(scaffold_smiles_with_R)
mols = [Chem.MolFromSmiles(s) for s in compounds]
decomp, _ = rgd.RGroupDecompose([scaffold], mols, asSmiles=True)
return decomp
scaffold = 'c1ccc(C(=O)N[*:1])cc1-[*:2]'
compounds = ['c1ccc(C(=O)NCC)cc1F', 'c1ccc(C(=O)NCCC)cc1Cl']
table = decompose_series(compounds, scaffold)
Output: list of {'Core': scaffold, 'R1': r1_smiles, 'R2': r2_smiles} dicts. Used for Free-Wilson analysis (see reaction-enumeration skill).
Goal: Mine a SAR dataset for substructure transformations and their associated activity changes.
Approach: Fragment all compounds into core + variable side; index pairs differing by one transformation; report delta(activity) per transformation.
mmpdb fragment data.smi -o data.fragments
mmpdb index data.fragments -o data.mmpdb
mmpdb transform --smiles 'COc1ccccc1' --property pIC50 data.mmpdb
Output: ranked transformations with delta(pIC50), N pairs, confidence.
Confidence interpretation:
Classical MMPA: "Me -> F always +0.5 log units." Context-based MMPA: "Me -> F adjacent to amide is +0.5; Me -> F adjacent to ester is -0.1."
Awale et al. 2024 showed context-conditioned transformations have 60% higher predictive accuracy. mmpdb supports context via --context flag for pre-defined contexts; for arbitrary contexts, custom analysis.
Goal: Find compounds with different scaffold but similar 3D shape / pharmacophore / activity.
| Method | Approach | Tools | |--------|----------|-------| | 2D similarity with FCFP4 | Functional-class fingerprint Tanimoto | similarity-searching skill | | 3D shape (ROCS) | Tanimoto on shape + color volumes | shape-similarity skill | | Pharmacophore | Common pharmacophore features | pharmacophore-modeling skill | | Maximum Common Substructure (MCS) | Largest shared substructure | similarity-searching skill (rdFMCS) | | Deep scaffold hopping | Multi-modal transformer NN | DeepScaffoldHop (Devereux 2024) |
For systematic scaffold-hop discovery, combine:
Goal: Identify "analog series" within a library -- compounds sharing a scaffold + co-varying R-groups.
def detect_series(smiles_list, min_size=3):
clusters = scaffold_clusters(smiles_list)
series = {scaff: cmpds for scaff, cmpds in clusters.items()
if len(cmpds) >= min_size}
return series
For a 10k-compound library, expect 100-500 series of size >= 3. Series are units for SAR modeling.
Trigger: Compound has no rings (e.g., fatty acid, simple amine).
Mechanism: Bemis-Murcko strips R-groups; no rings = nothing remains.
Symptom: Scaffold is empty string; molecules cluster together as "no scaffold".
Fix: For linear-rich libraries, augment with linear chain length / functional group features.
Trigger: Compound has spiro or bridged ring system.
Mechanism: All ring atoms included; result is the entire ring system without R-groups.
Symptom: Apparently different drugs share a "scaffold" because of common spiro center.
Fix: Validate visually; use generic framework for topology-only comparison.
Trigger: Distinguishing pyridine vs benzene scaffolds.
Mechanism: MakeScaffoldGeneric sets all atoms to C.
Symptom: Pyridine and benzene scaffolds reported as identical.
Fix: Use Bemis-Murcko (heteroatoms preserved); generic framework for topology only.
Trigger: Library has many singletons + few large scaffolds.
Mechanism: Large scaffolds dominate; greedy assignment puts them in train.
Symptom: Test set is mostly singleton scaffolds; metrics misleading.
Fix: Use stratified scaffold split (balance test classes); or scaffold-balanced cross-validation.
Trigger: Transformation rare in dataset.
Mechanism: Need enough pairs to estimate delta(activity).
Symptom: Transformation reports N=2 with very large delta.
Fix: Filter N >= 10; supplement with vendor catalogs (Enamine + ChEMBL).
Trigger: Multiple positions in scaffold could match same R-group.
Mechanism: RGroupDecompose returns first match; not necessarily the "intended" one.
Symptom: R1/R2 columns mixed up.
Fix: Specify scaffold with explicit [*:1] and [*:2] placeholders at desired positions.
| Concept | Definition A | Definition B | Pick which |
|---------|--------------|--------------|------------|
| Bemis-Murcko scaffold | Atoms in rings + linkers | Same | RDKit default |
| Generic framework | All C, all single bonds | All C, original bonds | RDKit makeAtomsGeneric=False for variant |
| Cyclic skeleton | Only ring atoms | Only ring atoms, generic | RDKit specific |
| "Series" | Same Bemis-Murcko | Tanimoto > 0.8 + same MW | Bemis-Murcko for SAR; Tanimoto for screening |
For ML splits: Bemis-Murcko. For library diversity: Bemis-Murcko + cluster size. For series detection: Bemis-Murcko + R-group decomposition.
| Symptom | Cause | Fix |
|---------|-------|-----|
| GetScaffoldForMol returns mol with extra atoms | Linker definition includes 2-bond span | Use BMScaffoldNetwork to control linker depth |
| Singleton scaffolds dominate library | Aggressive standardization | Check for tautomer-induced scaffold variation; canonicalize first |
| R-group decomposition empty | Mol doesn't match scaffold | Use FMCS to find actual shared core |
| mmpdb missing transformations | Cores too restrictive | Try smaller core requirement |
| Scaffold split gives all to train | Few scaffolds; large clusters | Add singleton-spread strategy; use Murcko-and-Linker variant |
| Generic framework same for different drugs | Stripped heteroatom info | Use Bemis-Murcko (preserves heteroatoms) |
| MakeScaffoldGeneric error | RDKit version issue | RDKit 2024.09+ uses Chem.Scaffolds.MurckoScaffold |
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.