Molfeat — Molecular Featurization Hub

Overview

Molfeat is a comprehensive Python library for molecular featurization that unifies 100+ pre-trained embeddings and hand-crafted featurizers under a scikit-learn compatible API. Convert SMILES strings into numerical representations (fingerprints, descriptors, deep learning embeddings) for QSAR modeling, virtual screening, similarity searching, and chemical space analysis.

When to Use

Building QSAR/QSPR models requiring molecular features as input
Virtual screening — ranking compound libraries by predicted activity
Similarity searching against molecular databases
Chemical space analysis — clustering, visualization, dimensionality reduction
Deep learning on molecules using pretrained embeddings (ChemBERTa, GIN)
Featurization pipelines integrating with scikit-learn or PyTorch
Comparing multiple molecular representations for benchmarking
For molecular manipulation and filtering use datamol instead; for substructure-based molecular operations use rdkit-cheminformatics

Prerequisites

uv pip install molfeat

# Optional extras for specific featurizer types
uv pip install "molfeat[transformer]"   # ChemBERTa, ChemGPT, MolT5
uv pip install "molfeat[dgl]"           # GIN graph neural networks
uv pip install "molfeat[graphormer]"    # Graphormer models
uv pip install "molfeat[fcd]"           # FCD descriptors
uv pip install "molfeat[map4]"          # MAP4 fingerprints
uv pip install "molfeat[all]"           # All dependencies

Quick Start

from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer

smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CC(C)O"]

# Create fingerprint calculator + transformer
calc = FPCalculator("ecfp", radius=3, fpSize=2048)
transformer = MoleculeTransformer(calc, n_jobs=-1)

# Featurize batch in parallel
features = transformer(smiles)
print(f"Shape: {features.shape}")  # (4, 2048)

# Save configuration for reproducibility
transformer.to_state_yaml_file("featurizer_config.yml")

Key Concepts

Architecture: Calculator → Transformer → Store

Molfeat organizes featurization into three layers:

| Layer | Class | Purpose | Use When | |-------|-------|---------|----------| | Calculator | molfeat.calc.* | Single molecule → feature vector | Custom loops, single molecules | | Transformer | molfeat.trans.MoleculeTransformer | Batch processing with parallelization | Datasets, scikit-learn pipelines | | Store | molfeat.store.ModelStore | Discovery and loading of pretrained models | Finding available featurizers |

Calculators are callable: calc("CCO") returns a numpy array. Transformers wrap calculators for batch processing: transformer(smiles_list) returns a 2D array. Pretrained transformers (PretrainedMolTransformer) add batched GPU inference and caching.

Featurizer Selection Guide

| Task | Recommended | Dimensions | Speed | |------|-------------|------------|-------| | General QSAR | ecfp (radius=3) | 2048 | Fast | | Scaffold similarity | maccs | 167 | Very fast | | Large-scale screening | map4 | 1024 | Fast | | Interpretable models | desc2D (RDKitDescriptors2D) | 200+ | Fast | | Comprehensive descriptors | mordred | 1800+ | Medium | | Transfer learning | ChemBERTa-77M-MLM | 768 | Slow* | | Graph-based DL | gin-supervised-masking | Variable | Slow* | | Pharmacophore | fcfp or cats2D | 2048 / 21 | Fast | | 3D shape | usr / usrcat | 12 / 60 | Fast |

*First run slow; subsequent runs cached.

State Persistence

Save and reload exact featurizer configuration for reproducibility:

# Save
transformer.to_state_yaml_file("config.yml")
transformer.to_state_json_file("config.json")

# Reload
loaded = MoleculeTransformer.from_state_yaml_file("config.yml")

Core API

1. Fingerprint Calculators

from molfeat.calc import FPCalculator

# ECFP — most popular, general-purpose
ecfp = FPCalculator("ecfp", radius=3, fpSize=2048)
fp = ecfp("CCO")
print(f"ECFP shape: {fp.shape}")  # (2048,)

# MACCS keys — 167-bit structural keys, fast scaffold similarity
maccs = FPCalculator("maccs")
fp = maccs("c1ccccc1")
print(f"MACCS shape: {fp.shape}")  # (167,)

# Count-based fingerprints (non-binary)
ecfp_count = FPCalculator("ecfp-count", radius=3, fpSize=2048)

# MAP4 — MinHashed atom-pair, efficient for large databases
map4 = FPCalculator("map4")
print(f"MAP4 shape: {map4('CCO').shape}")  # (1024,)

Available fingerprint types: ecfp, fcfp, maccs, rdkit, avalon, pattern, layered, atompair, topological, map4, secfp, erg, estate (and count variants with -count suffix).

2. Descriptor Calculators

from molfeat.calc import RDKitDescriptors2D, MordredDescriptors

# RDKit 2D — 200+ named properties (MW, logP, TPSA, etc.)
desc2d = RDKitDescriptors2D()
descriptors = desc2d("CCO")
print(f"2D descriptors: {len(descriptors)}")  # 200+
print(f"Feature names: {desc2d.columns[:5]}")

# Mordred — 1800+ comprehensive descriptors
mordred = MordredDescriptors()
descriptors = mordred("c1ccccc1O")
print(f"Mordred descriptors: {len(descriptors)}")  # 1800+

3. Pharmacophore & Shape Calculators

from molfeat.calc import CATSCalculator, USRDescriptors

# CATS — pharmacophore point pair distributions
cats = CATSCalculator(mode="2D", scale="raw")
descriptors = cats("CC(C)Cc1ccc(C)cc1C")
print(f"CATS shape: {descriptors.shape}")  # (21,)

# USR — ultrafast shape recognition
usr = USRDescriptors()
shape = usr("CC(=O)Oc1ccccc1C(=O)O")
print(f"USR shape: {shape.shape}")  # (12,)

4. Batch Processing with Transformers

from molfeat.trans import MoleculeTransformer, FeatConcat
from molfeat.calc import FPCalculator

smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CC(C)O", "CCCC"]

# Parallel batch processing
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
features = transformer(smiles)
print(f"Batch shape: {features.shape}")  # (5, 2048)

# Concatenate multiple featurizers
concat = FeatConcat([
    FPCalculator("maccs"),      # 167 dims
    FPCalculator("ecfp")        # 2048 dims
])
combo_transformer = MoleculeTransformer(concat, n_jobs=-1)
combo_features = combo_transformer(smiles)
print(f"Combined shape: {combo_features.shape}")  # (5, 2215)

# Error-tolerant processing
safe_transformer = MoleculeTransformer(
    FPCalculator("ecfp"), n_jobs=-1,
    ignore_errors=True, verbose=True
)
features = safe_transformer(["CCO", "invalid", "c1ccccc1"])
# Returns None for failed molecules

5. Pretrained Model Embeddings

from molfeat.trans.pretrained import PretrainedMolTransformer

# ChemBERTa — RoBERTa trained on 77M PubChem compounds
chemberta = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
embeddings = chemberta(["CCO", "CC(=O)O", "c1ccccc1"])
print(f"ChemBERTa shape: {embeddings.shape}")  # (3, 768)

# GIN — graph neural network pretrained on ChEMBL
gin = PretrainedMolTransformer("gin-supervised-masking", n_jobs=-1)
graph_emb = gin(["CCO", "CC(=O)O"])
print(f"GIN shape: {graph_emb.shape}")

6. ModelStore — Discovering Featurizers

from molfeat.store.modelstore import ModelStore

store = ModelStore()
print(f"Total available: {len(store.available_models)}")

# Search for specific model
results = store.search(name="ChemBERTa")
for model in results:
    print(f"  {model.name}: {model.description}")

# View usage and load
card = store.search(name="ChemBERTa-77M-MLM")[0]
card.usage()
transformer = store.load("ChemBERTa-77M-MLM")

Common Workflows

Workflow 1: QSAR Model Building

from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

# Featurize molecules
transformer = MoleculeTransformer(FPCalculator("ecfp", radius=3), n_jobs=-1)
X = transformer(smiles_train)
print(f"Features shape: {X.shape}")

# Train and evaluate
model = RandomForestRegressor(n_estimators=100)
scores = cross_val_score(model, X, y_train, cv=5, scoring='r2')
print(f"R² = {scores.mean():.3f} ± {scores.std():.3f}")

# Save for deployment
transformer.to_state_yaml_file("production_featurizer.yml")

Workflow 2: Virtual Screening Pipeline

from sklearn.ensemble import RandomForestClassifier

# Step 1: Featurize known actives/inactives
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
X_train = transformer(train_smiles)

# Step 2: Train classifier
clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
clf.fit(X_train, train_labels)

# Step 3: Screen library (e.g., 1M compounds)
X_screen = transformer(screening_smiles)
predictions = clf.predict_proba(X_screen)[:, 1]

# Step 4: Rank and select top hits
top_indices = predictions.argsort()[::-1][:1000]
top_hits = [screening_smiles[i] for i in top_indices]
print(f"Top 1000 hits selected from {len(screening_smiles)} compounds")

Workflow 3: Featurizer Benchmarking

from molfeat.calc import FPCalculator, RDKitDescriptors2D
from sklearn.metrics import roc_auc_score

featurizers = {
    'ECFP': FPCalculator("ecfp"),
    'MACCS': FPCalculator("maccs"),
    'Descriptors': RDKitDescriptors2D(),
}

for name, calc in featurizers.items():
    transformer = MoleculeTransformer(calc, n_jobs=-1)
    X_train = transformer(smiles_train)
    X_test = transformer(smiles_test)
    clf = RandomForestClassifier(n_estimators=100)
    clf.fit(X_train, y_train)
    auc = roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])
    print(f"{name}: AUC = {auc:.3f}")

Common Recipes

Recipe: Scikit-learn Pipeline Integration

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
    ('classifier', RandomForestClassifier(n_estimators=100))
])
pipeline.fit(smiles_train, y_train)
predictions = pipeline.predict(smiles_test)

Recipe: Similarity Search

from sklearn.metrics.pairwise import cosine_similarity

calc = FPCalculator("ecfp")
query_fp = calc("CC(=O)Oc1ccccc1C(=O)O").reshape(1, -1)  # Aspirin

transformer = MoleculeTransformer(calc, n_jobs=-1)
db_fps = transformer(database_smiles)

similarities = cosine_similarity(query_fp, db_fps)[0]
top_k = similarities.argsort()[-10:][::-1]
for i in top_k:
    print(f"  {database_smiles[i]}: {similarities[i]:.3f}")

Recipe: Chunk Processing for Large Datasets

import numpy as np

def featurize_chunks(smiles_list, transformer, chunk_size=10000):
    all_features = []
    for i in range(0, len(smiles_list), chunk_size):
        chunk = smiles_list[i:i+chunk_size]
        features = transformer(chunk)
        all_features.append(features)
        print(f"Processed {min(i+chunk_size, len(smiles_list))}/{len(smiles_list)}")
    return np.vstack(all_features)

Key Parameters

| Parameter | Module | Default | Description | |-----------|--------|---------|-------------| | method | FPCalculator | — | Fingerprint type: ecfp, maccs, map4, etc. | | radius | FPCalculator | 3 | Circular fingerprint radius | | fpSize | FPCalculator | 2048 | Fingerprint bit length | | counting | FPCalculator | False | Count vector instead of binary | | n_jobs | MoleculeTransformer | 1 | Parallel workers (-1 = all cores) | | ignore_errors | MoleculeTransformer | False | Skip invalid molecules (returns None) | | verbose | MoleculeTransformer | False | Log processing details | | dtype | MoleculeTransformer | float64 | Output type (float32 for memory) | | mode | CATSCalculator | "2D" | Distance calculation mode | | scale | CATSCalculator | "raw" | Scaling: raw, num, count |

Best Practices

Use n_jobs=-1 for parallel processing on all CPU cores — significant speedup for batch featurization
Start with ECFP for initial baselines — best general-purpose fingerprint before trying deep learning
Use ignore_errors=True for large datasets — invalid SMILES won't crash the pipeline
Save configurations with to_state_yaml_file() for reproducibility — recreate exact featurizer later
Use float32 when memory matters: MoleculeTransformer(calc, dtype=np.float32)
Cache pretrained embeddings — first ChemBERTa/GIN inference is slow, subsequent runs use cache
Process in chunks for datasets >100K — prevents memory exhaustion (see Recipes)
Combine fingerprints with FeatConcat to capture complementary molecular information

Troubleshooting

| Problem | Cause | Solution | |---------|-------|----------| | ValueError: unsupported featurizer | Unknown method name | Check FPCalculator supported types or use ModelStore.search() | | ImportError for pretrained model | Missing optional dependency | Install extras: pip install "molfeat[transformer]" or "molfeat[dgl]" | | None in output array | Invalid SMILES with ignore_errors=True | Filter results: [f for f in features if f is not None] | | Memory error on large dataset | Too many molecules at once | Process in chunks of 10K-50K (see Recipes) | | Slow pretrained model inference | First run downloads model weights | Normal — subsequent runs use cache | | Shape mismatch in pipeline | Mixed valid/invalid molecules | Ensure ignore_errors=True and filter None before ML model | | Reproducibility issues | Different molfeat versions | Pin version and save config: transformer.to_state_yaml_file() |

Related Skills

datamol-cheminformatics — High-level molecular manipulation (standardization, I/O, conformers)
rdkit-cheminformatics — Low-level cheminformatics (substructure, reactions, 3D)
scikit-learn — ML models consuming molfeat features

References

Official documentation: https://molfeat-docs.datamol.io/
GitHub repository: https://github.com/datamol-io/molfeat
PyPI package: https://pypi.org/project/molfeat/
Tutorial: https://portal.valencelabs.com/datamol/post/types-of-featurizers-b1e8HHrbFMkbun6

Bundled Resources

Main SKILL.md + 2 reference files. Original total: 1,273 lines (SKILL.md 510 + api_reference.md 429 + available_featurizers.md 334). Scripts: none. Examples: 724 lines (examples.md).

references/available_featurizers.md: Complete catalog of all 100+ featurizers organized by category — transformer models, GNNs, descriptors, fingerprints, pharmacophore, shape, scaffold, graph featurizers. Includes dimensions, dependencies, and selection guidance per category. Purely lookup-oriented content preserved as reference.

references/api_reference.md: Detailed API reference for molfeat.calc, molfeat.trans, and molfeat.store modules. Covers SerializableCalculator base class, all calculator subclasses with parameters, MoleculeTransformer methods, PretrainedMolTransformer, FeatConcat, ModelStore/ModelCard API, data type control, and PyTorch integration patterns.

Original file disposition:

SKILL.md (510 lines) → Core API modules 1-6, Key Concepts (architecture, selection guide), Quick Start, Workflows 1-3. "Choosing the Right Featurizer" → Key Concepts selection guide table. "Advanced Features" (custom preprocessing, batch processing, caching) → Recipes + Best Practices. "Common Featurizers Reference" table → Key Concepts selection guide. "Performance Tips" → Best Practices. Per-use-case disposition: QSAR Modeling → Workflow 1, Virtual Screening → Workflow 2, Similarity Search → Recipe, Chemical Space → When to Use bullet, scikit-learn Pipeline → Recipe, Featurizer Comparison → Workflow 3
references/api_reference.md (429 lines) → Migrated to new references/api_reference.md. Core patterns (FPCalculator, MoleculeTransformer, basic ModelStore) relocated to SKILL.md Core API modules 1-6. Detailed class methods, SerializableCalculator base class, PrecomputedMolTransformer, and PyTorch integration retained in reference
references/available_featurizers.md (334 lines) → Migrated to new references/available_featurizers.md. Top-level summary → Key Concepts selection guide table. Full categorized catalog retained in reference
references/examples.md (724 lines) → Fully consolidated inline: installation → Prerequisites; calculator examples → Core API 1-3; transformer examples → Core API 4; pretrained examples → Core API 5; ML integration → Workflows 1-3 + Recipes; advanced patterns (custom preprocessing, caching, chunk processing) → Recipes + Best Practices; troubleshooting → Troubleshooting table. No separate reference file needed — all content absorbed into SKILL.md sections

Retention: ~490 lines (SKILL.md) + ~170 lines (available_featurizers) + ~190 lines (api_reference) = ~850 / 1,273 original (excl. examples.md treated as consolidated) = ~67%. Including examples.md in denominator: ~850 / 1,997 = ~43%.

Molfeat — Molecular Featurization Hub

Overview

When to Use

Building QSAR/QSPR models requiring molecular features as input
Virtual screening — ranking compound libraries by predicted activity
Similarity searching against molecular databases
Chemical space analysis — clustering, visualization, dimensionality reduction
Deep learning on molecules using pretrained embeddings (ChemBERTa, GIN)
Featurization pipelines integrating with scikit-learn or PyTorch
Comparing multiple molecular representations for benchmarking
For molecular manipulation and filtering use datamol instead; for substructure-based molecular operations use rdkit-cheminformatics

Prerequisites

uv pip install molfeat

# Optional extras for specific featurizer types
uv pip install "molfeat[transformer]"   # ChemBERTa, ChemGPT, MolT5
uv pip install "molfeat[dgl]"           # GIN graph neural networks
uv pip install "molfeat[graphormer]"    # Graphormer models
uv pip install "molfeat[fcd]"           # FCD descriptors
uv pip install "molfeat[map4]"          # MAP4 fingerprints
uv pip install "molfeat[all]"           # All dependencies

Quick Start

from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer

smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CC(C)O"]

# Create fingerprint calculator + transformer
calc = FPCalculator("ecfp", radius=3, fpSize=2048)
transformer = MoleculeTransformer(calc, n_jobs=-1)

# Featurize batch in parallel
features = transformer(smiles)
print(f"Shape: {features.shape}")  # (4, 2048)

# Save configuration for reproducibility
transformer.to_state_yaml_file("featurizer_config.yml")

Key Concepts

Architecture: Calculator → Transformer → Store

Molfeat organizes featurization into three layers:

Featurizer Selection Guide

*First run slow; subsequent runs cached.

State Persistence

Save and reload exact featurizer configuration for reproducibility:

# Save
transformer.to_state_yaml_file("config.yml")
transformer.to_state_json_file("config.json")

# Reload
loaded = MoleculeTransformer.from_state_yaml_file("config.yml")

Core API

1. Fingerprint Calculators

from molfeat.calc import FPCalculator

# ECFP — most popular, general-purpose
ecfp = FPCalculator("ecfp", radius=3, fpSize=2048)
fp = ecfp("CCO")
print(f"ECFP shape: {fp.shape}")  # (2048,)

# MACCS keys — 167-bit structural keys, fast scaffold similarity
maccs = FPCalculator("maccs")
fp = maccs("c1ccccc1")
print(f"MACCS shape: {fp.shape}")  # (167,)

# Count-based fingerprints (non-binary)
ecfp_count = FPCalculator("ecfp-count", radius=3, fpSize=2048)

# MAP4 — MinHashed atom-pair, efficient for large databases
map4 = FPCalculator("map4")
print(f"MAP4 shape: {map4('CCO').shape}")  # (1024,)

2. Descriptor Calculators

from molfeat.calc import RDKitDescriptors2D, MordredDescriptors

# RDKit 2D — 200+ named properties (MW, logP, TPSA, etc.)
desc2d = RDKitDescriptors2D()
descriptors = desc2d("CCO")
print(f"2D descriptors: {len(descriptors)}")  # 200+
print(f"Feature names: {desc2d.columns[:5]}")

# Mordred — 1800+ comprehensive descriptors
mordred = MordredDescriptors()
descriptors = mordred("c1ccccc1O")
print(f"Mordred descriptors: {len(descriptors)}")  # 1800+

3. Pharmacophore & Shape Calculators

from molfeat.calc import CATSCalculator, USRDescriptors

# CATS — pharmacophore point pair distributions
cats = CATSCalculator(mode="2D", scale="raw")
descriptors = cats("CC(C)Cc1ccc(C)cc1C")
print(f"CATS shape: {descriptors.shape}")  # (21,)

# USR — ultrafast shape recognition
usr = USRDescriptors()
shape = usr("CC(=O)Oc1ccccc1C(=O)O")
print(f"USR shape: {shape.shape}")  # (12,)

4. Batch Processing with Transformers

from molfeat.trans import MoleculeTransformer, FeatConcat
from molfeat.calc import FPCalculator

smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CC(C)O", "CCCC"]

# Parallel batch processing
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
features = transformer(smiles)
print(f"Batch shape: {features.shape}")  # (5, 2048)

# Concatenate multiple featurizers
concat = FeatConcat([
    FPCalculator("maccs"),      # 167 dims
    FPCalculator("ecfp")        # 2048 dims
])
combo_transformer = MoleculeTransformer(concat, n_jobs=-1)
combo_features = combo_transformer(smiles)
print(f"Combined shape: {combo_features.shape}")  # (5, 2215)

# Error-tolerant processing
safe_transformer = MoleculeTransformer(
    FPCalculator("ecfp"), n_jobs=-1,
    ignore_errors=True, verbose=True
)
features = safe_transformer(["CCO", "invalid", "c1ccccc1"])
# Returns None for failed molecules

5. Pretrained Model Embeddings

from molfeat.trans.pretrained import PretrainedMolTransformer

# ChemBERTa — RoBERTa trained on 77M PubChem compounds
chemberta = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
embeddings = chemberta(["CCO", "CC(=O)O", "c1ccccc1"])
print(f"ChemBERTa shape: {embeddings.shape}")  # (3, 768)

# GIN — graph neural network pretrained on ChEMBL
gin = PretrainedMolTransformer("gin-supervised-masking", n_jobs=-1)
graph_emb = gin(["CCO", "CC(=O)O"])
print(f"GIN shape: {graph_emb.shape}")

6. ModelStore — Discovering Featurizers

from molfeat.store.modelstore import ModelStore

store = ModelStore()
print(f"Total available: {len(store.available_models)}")

# Search for specific model
results = store.search(name="ChemBERTa")
for model in results:
    print(f"  {model.name}: {model.description}")

# View usage and load
card = store.search(name="ChemBERTa-77M-MLM")[0]
card.usage()
transformer = store.load("ChemBERTa-77M-MLM")

Common Workflows

Workflow 1: QSAR Model Building

from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

# Featurize molecules
transformer = MoleculeTransformer(FPCalculator("ecfp", radius=3), n_jobs=-1)
X = transformer(smiles_train)
print(f"Features shape: {X.shape}")

# Train and evaluate
model = RandomForestRegressor(n_estimators=100)
scores = cross_val_score(model, X, y_train, cv=5, scoring='r2')
print(f"R² = {scores.mean():.3f} ± {scores.std():.3f}")

# Save for deployment
transformer.to_state_yaml_file("production_featurizer.yml")

Workflow 2: Virtual Screening Pipeline

from sklearn.ensemble import RandomForestClassifier

# Step 1: Featurize known actives/inactives
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
X_train = transformer(train_smiles)

# Step 2: Train classifier
clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
clf.fit(X_train, train_labels)

# Step 3: Screen library (e.g., 1M compounds)
X_screen = transformer(screening_smiles)
predictions = clf.predict_proba(X_screen)[:, 1]

# Step 4: Rank and select top hits
top_indices = predictions.argsort()[::-1][:1000]
top_hits = [screening_smiles[i] for i in top_indices]
print(f"Top 1000 hits selected from {len(screening_smiles)} compounds")

Workflow 3: Featurizer Benchmarking

from molfeat.calc import FPCalculator, RDKitDescriptors2D
from sklearn.metrics import roc_auc_score

featurizers = {
    'ECFP': FPCalculator("ecfp"),
    'MACCS': FPCalculator("maccs"),
    'Descriptors': RDKitDescriptors2D(),
}

for name, calc in featurizers.items():
    transformer = MoleculeTransformer(calc, n_jobs=-1)
    X_train = transformer(smiles_train)
    X_test = transformer(smiles_test)
    clf = RandomForestClassifier(n_estimators=100)
    clf.fit(X_train, y_train)
    auc = roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])
    print(f"{name}: AUC = {auc:.3f}")

Common Recipes

Recipe: Scikit-learn Pipeline Integration

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
    ('classifier', RandomForestClassifier(n_estimators=100))
])
pipeline.fit(smiles_train, y_train)
predictions = pipeline.predict(smiles_test)

Recipe: Similarity Search

from sklearn.metrics.pairwise import cosine_similarity

calc = FPCalculator("ecfp")
query_fp = calc("CC(=O)Oc1ccccc1C(=O)O").reshape(1, -1)  # Aspirin

transformer = MoleculeTransformer(calc, n_jobs=-1)
db_fps = transformer(database_smiles)

similarities = cosine_similarity(query_fp, db_fps)[0]
top_k = similarities.argsort()[-10:][::-1]
for i in top_k:
    print(f"  {database_smiles[i]}: {similarities[i]:.3f}")

Recipe: Chunk Processing for Large Datasets

import numpy as np

def featurize_chunks(smiles_list, transformer, chunk_size=10000):
    all_features = []
    for i in range(0, len(smiles_list), chunk_size):
        chunk = smiles_list[i:i+chunk_size]
        features = transformer(chunk)
        all_features.append(features)
        print(f"Processed {min(i+chunk_size, len(smiles_list))}/{len(smiles_list)}")
    return np.vstack(all_features)

Key Parameters

Best Practices

Use n_jobs=-1 for parallel processing on all CPU cores — significant speedup for batch featurization
Start with ECFP for initial baselines — best general-purpose fingerprint before trying deep learning
Use ignore_errors=True for large datasets — invalid SMILES won't crash the pipeline
Save configurations with to_state_yaml_file() for reproducibility — recreate exact featurizer later
Use float32 when memory matters: MoleculeTransformer(calc, dtype=np.float32)
Cache pretrained embeddings — first ChemBERTa/GIN inference is slow, subsequent runs use cache
Process in chunks for datasets >100K — prevents memory exhaustion (see Recipes)
Combine fingerprints with FeatConcat to capture complementary molecular information

Troubleshooting

Related Skills

datamol-cheminformatics — High-level molecular manipulation (standardization, I/O, conformers)
rdkit-cheminformatics — Low-level cheminformatics (substructure, reactions, 3D)
scikit-learn — ML models consuming molfeat features

References

Official documentation: https://molfeat-docs.datamol.io/
GitHub repository: https://github.com/datamol-io/molfeat
PyPI package: https://pypi.org/project/molfeat/
Tutorial: https://portal.valencelabs.com/datamol/post/types-of-featurizers-b1e8HHrbFMkbun6

Bundled Resources

Main SKILL.md + 2 reference files. Original total: 1,273 lines (SKILL.md 510 + api_reference.md 429 + available_featurizers.md 334). Scripts: none. Examples: 724 lines (examples.md).

Original file disposition:

SKILL.md (510 lines) → Core API modules 1-6, Key Concepts (architecture, selection guide), Quick Start, Workflows 1-3. "Choosing the Right Featurizer" → Key Concepts selection guide table. "Advanced Features" (custom preprocessing, batch processing, caching) → Recipes + Best Practices. "Common Featurizers Reference" table → Key Concepts selection guide. "Performance Tips" → Best Practices. Per-use-case disposition: QSAR Modeling → Workflow 1, Virtual Screening → Workflow 2, Similarity Search → Recipe, Chemical Space → When to Use bullet, scikit-learn Pipeline → Recipe, Featurizer Comparison → Workflow 3
references/api_reference.md (429 lines) → Migrated to new references/api_reference.md. Core patterns (FPCalculator, MoleculeTransformer, basic ModelStore) relocated to SKILL.md Core API modules 1-6. Detailed class methods, SerializableCalculator base class, PrecomputedMolTransformer, and PyTorch integration retained in reference
references/available_featurizers.md (334 lines) → Migrated to new references/available_featurizers.md. Top-level summary → Key Concepts selection guide table. Full categorized catalog retained in reference
references/examples.md (724 lines) → Fully consolidated inline: installation → Prerequisites; calculator examples → Core API 1-3; transformer examples → Core API 4; pretrained examples → Core API 5; ML integration → Workflows 1-3 + Recipes; advanced patterns (custom preprocessing, caching, chunk processing) → Recipes + Best Practices; troubleshooting → Troubleshooting table. No separate reference file needed — all content absorbed into SKILL.md sections

Adoption

jaechang-hits/molfeat-molecular-featurization

$ install --global

Security Scan Results

SKILL.md

Molfeat — Molecular Featurization Hub

Overview

When to Use

Prerequisites

Quick Start

Key Concepts

Architecture: Calculator → Transformer → Store

Featurizer Selection Guide

State Persistence

Core API

1. Fingerprint Calculators

2. Descriptor Calculators

3. Pharmacophore & Shape Calculators

4. Batch Processing with Transformers

5. Pretrained Model Embeddings

6. ModelStore — Discovering Featurizers

Common Workflows

Workflow 1: QSAR Model Building

Workflow 2: Virtual Screening Pipeline

Workflow 3: Featurizer Benchmarking

Common Recipes

Recipe: Scikit-learn Pipeline Integration

Recipe: Similarity Search

Recipe: Chunk Processing for Large Datasets

Key Parameters

Best Practices

Troubleshooting

Related Skills

References

Bundled Resources

Related Skills

jaechang-hits/deseq2-differential-expression

jaechang-hits/vcf-variant-filtering

jaechang-hits/snpeff-variant-annotation

jaechang-hits/plink2-gwas-analysis

jaechang-hits/molfeat-molecular-featurization

$ install --global

Security Scan Results

SKILL.md

Molfeat — Molecular Featurization Hub

Overview

When to Use

Prerequisites

Quick Start

Key Concepts

Architecture: Calculator → Transformer → Store

Featurizer Selection Guide

State Persistence

Core API

1. Fingerprint Calculators

2. Descriptor Calculators

3. Pharmacophore & Shape Calculators

4. Batch Processing with Transformers

5. Pretrained Model Embeddings

6. ModelStore — Discovering Featurizers

Common Workflows

Workflow 1: QSAR Model Building

Workflow 2: Virtual Screening Pipeline

Workflow 3: Featurizer Benchmarking

Common Recipes

Recipe: Scikit-learn Pipeline Integration

Recipe: Similarity Search

Recipe: Chunk Processing for Large Datasets

Key Parameters

Best Practices

Troubleshooting

Related Skills

References

Bundled Resources

Related Skills

jaechang-hits/deseq2-differential-expression

jaechang-hits/vcf-variant-filtering

jaechang-hits/snpeff-variant-annotation

jaechang-hits/plink2-gwas-analysis