skills/tabular/morgan-fingerprint-features/SKILL.md
Converts molecular SMILES strings to fixed-length Morgan fingerprint bit vectors using RDKit for use as tabular ML features.
npx skillsauth add wenmin-wu/ds-skills tabular-morgan-fingerprint-featuresInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Morgan fingerprints (Extended-Connectivity Fingerprints / ECFP) encode the structural neighborhood of each atom in a molecule as a fixed-length binary vector. Each bit represents whether a particular circular substructure of radius R exists in the molecule. This converts variable-length SMILES strings into fixed-size feature vectors suitable for any tabular ML model (LGBM, XGBoost, random forest). Morgan FPs are the most widely used molecular representation in cheminformatics — they capture functional groups, ring systems, and local topology in a compact, hashable form.
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
def smiles_to_morgan(smiles, radius=2, n_bits=1024):
"""Convert SMILES to Morgan fingerprint bit vector."""
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return np.zeros(n_bits)
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=n_bits)
return np.array(fp)
# Vectorize a dataset
X = np.vstack([smiles_to_morgan(s) for s in df['SMILES']])
# X shape: (n_samples, 1024) — ready for LGBM/XGBoost
GetMorganFingerprint returns counts; AsBitVect returns binary — binary is usually sufficientdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF