datamol

Pythonic wrapper around RDKit with simplified interface and sensible defaults. Preferred for standard drug discovery including SMILES parsing, standardization, descriptors, fingerprints, clustering, 3D conformers, parallel processing. Returns native rdkit.Chem.Mol objects. For advanced control or custom parameters, use rdkit directly.

Install

Hot:6

Download and extract to your skills directory

Copy command and send to OpenClaw for auto-install:

Download and install this skill https://openskills.cc/api/download?slug=k-dense-ai-scientific-skills-datamol&locale=en&source=copy

Datamol - Molecular Cheminformatics Processing Capabilities

Overview

Datamol is a Python tool designed for drug discovery and cheminformatics that provides a simplified interface to RDKit, making molecular data processing easier and more efficient. It can read chemical structure files, compute molecular descriptors, generate chemical fingerprints, perform molecular clustering and 3D conformation analysis, supports parallel processing and cloud storage, and is an ideal choice for computational chemistry and drug development.

Use Cases

Virtual screening and library filtering
Suitable for the early screening stages of drug discovery to rapidly screen large compound libraries. It can read chemical files in SDF, SMILES, and other formats, batch-compute molecular fingerprints and descriptors, filter drug-like molecules according to Lipinski's rules, or find structures similar to a lead compound via similarity searches.

SAR analysis and visualization
Useful for medicinal chemists analyzing structure–activity relationships. By extracting Murcko scaffolds, clustering similar molecules, and visualizing aligned structures, it helps identify functional groups and substituents critical for activity and optimize lead compounds.

Machine learning feature engineering
Generates molecular features for AI-assisted drug design. It can batch-compute hundreds of molecular descriptors or ECFP fingerprints as inputs for machine learning models to predict activity, toxicity, or ADMET properties.

Core Features

Molecule file I/O and format conversion
Supports reading and writing multiple chemical file formats such as SDF, SMILES, CSV, and Excel, and can directly read data from cloud storage (e.g., S3). It can automatically parse and standardize molecular structures and convert to formats like SMILES, InChI, and SELFIES, making it easier to handle external data.

Molecular descriptors and fingerprint calculation
One-click calculation of key descriptors such as molecular weight, LogP, number of hydrogen bond donors/acceptors, TPSA, number of aromatic rings, with support for parallel processing of large libraries. Generates fingerprints like ECFP and MACCS for similarity searches and machine learning.

Molecular clustering and diversity selection
Uses the Butina clustering algorithm to group compounds by structural similarity or perform diversity sampling to select representative molecules. Supports scaffold-based splitting to ensure scaffolds do not overlap between training and test sets, improving the generalization of machine learning models.

Frequently Asked Questions

What is the difference between Datamol and RDKit? Which should I choose?

Datamol is a Python wrapper around RDKit that provides a cleaner, easier-to-use API and sensible default parameters. It returns native RDKit molecule objects and is fully compatible with the RDKit ecosystem, but operations are more concise. If you need to handle molecular data, perform batch computations, and visualize results, Datamol is more efficient. If you require advanced customization or need to call RDKit’s underlying C++ features directly, use RDKit.

How do I read and process SDF files with Datamol?

Use dm.read_sdf("compounds.sdf") to read an SDF file into a DataFrame containing molecule objects. It is recommended to standardize structures first with dm.standardize_mol(), filter out molecules that failed parsing, and then batch-compute properties with dm.descriptors.batch_compute_many_descriptors().

How large a compound library can Datamol handle?

Datamol supports parallel processing (n_jobs=-1) and can efficiently handle datasets of thousands to tens of thousands of molecules. For clustering analyses of more than 10,000 molecules, diversity selection rather than full clustering is recommended. Fingerprint and descriptor calculations can easily handle millions of molecules and support progress bar display.