skills/tabular/rdkit-molecular-descriptors/SKILL.md
Computes all numeric RDKit molecular descriptors from SMILES strings, filtering out NaN, constant, and infinite values to produce a clean feature matrix.
npx skillsauth add wenmin-wu/ds-skills tabular-rdkit-molecular-descriptorsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
RDKit provides 200+ molecular descriptors — physicochemical properties (logP, molecular weight, TPSA), topological indices (Wiener, Balaban), and fragment counts (aromatic rings, H-bond donors/acceptors). Computing all descriptors from SMILES gives a rich feature set for tabular ML without domain-specific feature engineering. The key is robust filtering: some descriptors return NaN or infinity for certain molecules, and some are constant across datasets. After cleanup, this typically yields 150-180 usable numeric features.
from rdkit import Chem
from rdkit.Chem import Descriptors
import numpy as np
import pandas as pd
def compute_descriptors(smiles):
"""Compute all RDKit descriptors for a SMILES string."""
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return [np.nan] * len(Descriptors.descList)
return [func(mol) for name, func in Descriptors.descList]
desc_names = [name for name, _ in Descriptors.descList]
desc_matrix = [compute_descriptors(s) for s in df['SMILES']]
features = pd.DataFrame(desc_matrix, columns=desc_names)
# Clean up
features = features.replace([np.inf, -np.inf], np.nan)
features = features.dropna(axis=1, thresh=int(0.9 * len(features))) # drop >10% NaN cols
features = features.loc[:, features.nunique() > 1] # drop constant cols
features = features.fillna(features.median())
Descriptors.descListdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF