Molfeat

Molfeat - Molecular Featurization and Molecular Machine Learning Toolkit

Overview

Molfeat is a unified Python library for molecular featurization that provides 100+ pretrained embeddings and handcrafted feature extractors. It converts SMILES strings or RDKit molecule objects into numerical representations usable by machine learning, supporting QSAR modeling, virtual screening, and deep learning applications.

Use Cases

1. Drug Discovery and QSAR Modeling

Build quantitative structure–activity/property relationship (QSAR/QSPR) models to predict molecular properties and bioactivity. Supports classic fingerprints like ECFP and MACCS, RDKit 2D descriptors and Mordred descriptors, and integrates seamlessly with scikit-learn for rapid property prediction workflows.

2. Large-Scale Virtual Screening

Perform parallel featurization and activity prediction on libraries of millions of compounds. Leverage multi-core parallel processing and built-in caching to quickly compute fingerprints and run similarity searches, supporting lead discovery and scaffold-hopping analysis.

3. Molecular Deep Learning

Generate molecular embedding vectors with pretrained models such as ChemBERTa, ChemGPT, and GIN for GNN training, transfer learning, and chemical space analysis. Supports Transformer language models and graph neural networks, suitable for complex molecular representation tasks.

Core Features

1. Unified Featurization Interface

Provides a three-layer API—Calculator, Transformer, and PretrainedTransformer—covering single-molecule feature calculation to batch parallel processing for different scenarios. Accepts SMILES strings and RDKit molecule objects, and automatically handles invalid molecules and error recovery.

2. 100+ Built-in Featurizers

Includes molecular fingerprints (ECFP, MACCS, MAP4, etc.), molecular descriptors (RDKit 2D, Mordred 1800+), pretrained models (ChemBERTa, ChemGPT, GIN, Graphormer), as well as pharmacophore and shape descriptors. All available models can be discovered and loaded via ModelStore.

3. scikit-learn Compatibility and Production Deployment

Fully compatible with scikit-learn Pipeline, supports saving and loading configuration files to ensure reproducibility. Provides parallel processing, batching, and caching mechanisms to optimize large-scale data processing performance.

Frequently Asked Questions

What is Molfeat? Who is it for?

Molfeat is a Python library for molecular featurization, aimed at computational chemists, drug discovery scientists, and AI researchers. It unifies 100+ molecular featurization methods—including classic fingerprints, descriptors, and pretrained deep learning models—to convert chemical structures (SMILES) into machine learning feature vectors.

How do I choose the right molecular featurizer?

For traditional machine learning (random forest, XGBoost), start with ECFP fingerprints; for interpretability, use RDKit 2D descriptors or Mordred; for deep learning tasks, use pretrained models like ChemBERTa or GIN. For virtual screening, ECFP or MAP4 are recommended; for similarity search, use ECFP or MACCS.

How does Molfeat handle large-scale compound data?

Use MoleculeTransformer with n_jobs=-1 to enable multi-core parallel processing. For very large datasets (>100k molecules), use the chunked processing function featurize_in_chunks to control memory usage. Pretrained models support a caching mechanism so embeddings can be reused after the first run.

Author

Category

Install