skills/tabular/smiles-randomization-augmentation/SKILL.md
Augments molecular datasets by generating multiple randomized SMILES strings for the same molecule, exploiting SMILES non-uniqueness to multiply training samples.
npx skillsauth add wenmin-wu/ds-skills tabular-smiles-randomization-augmentationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A single molecule can be written as many valid SMILES strings depending on the atom traversal order — CCO, OCC, and C(O)C all represent ethanol. For sequence-based models (LSTM, Transformer) that process SMILES character-by-character, each randomized SMILES is a distinct training example with the same label. This effectively multiplies the dataset size by N without changing the chemistry. RDKit's Chem.MolToSmiles(mol, doRandom=True) generates these variants. Typical augmentation factors of 3-10x improve model generalization on small molecular datasets.
from rdkit import Chem
import numpy as np
def augment_smiles(smiles_list, labels, n_augments=3):
"""Generate randomized SMILES variants for data augmentation."""
aug_smiles, aug_labels = [], []
for smi, label in zip(smiles_list, labels):
mol = Chem.MolFromSmiles(smi)
if mol is None:
continue
aug_smiles.append(Chem.MolToSmiles(mol, canonical=True))
aug_labels.append(label)
for _ in range(n_augments):
rand_smi = Chem.MolToSmiles(mol, doRandom=True)
aug_smiles.append(rand_smi)
aug_labels.append(label)
return aug_smiles, np.array(aug_labels)
# 3x augmentation: 1000 molecules → 4000 training examples
train_smi, train_y = augment_smiles(df['SMILES'], df['target'], n_augments=3)
doRandom=Truedata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF