PyTDC Drug Discovery Datasets - AI-Driven Therapeutic ML Benchmarking Platform

PyTDC Drug Discovery Datasets Skill

Skill Overview

PyTDC (Therapeutics Data Commons) is an open-source platform that provides AI-ready drug discovery datasets and standardized benchmarks for therapeutic machine learning, covering the full drug pipeline from absorption, distribution, metabolism, and excretion (ADME) to toxicity prediction, drug-target interactions, and molecular generation.

Suitable Scenarios

1. Drug Discovery and Machine Learning Research

When you need preprocessed datasets with standardized evaluation metrics to train and test drug discovery AI models, PyTDC offers over 100 curated datasets covering three main categories: single-instance prediction (molecular properties), multi-instance prediction (drug-target/drug-drug interactions), and generation tasks (molecular design, retrosynthesis).

2. Model Benchmarking and Evaluation

When you need fair, reproducible protocols to evaluate the performance of drug discovery models, PyTDC provides multiple benchmark groups (e.g., the ADMET group contains 22 datasets), supports professional data-splitting strategies such as scaffold split and cold split, and includes various evaluation metrics like ROC-AUC, PR-AUC, and RMSE.

3. Molecular Generation and Optimization

When you need goal-directed molecular generation or to evaluate molecular properties, PyTDC provides 17+ molecular oracles (e.g., targets like GSK3B, DRD2, JNK3) that can compute binding affinities between molecules and specific biological targets, useful for molecular property optimization and generative model evaluation.

Core Features

1. AI-Ready Drug Discovery Datasets

Provides curated datasets covering the entire therapeutic pipeline, including ADME (absorption/distribution/metabolism/excretion), toxicity (hERG, AMES, DILI), drug-target interactions (BindingDB includes 500k+ binding records), drug-drug interactions, high-throughput screening (HTS), quantum mechanical properties, and more. All datasets are preprocessed and ready for machine learning training, supporting multiple output formats (DataFrame, PyG graphs, DGL graphs, etc.).

2. Professional Data Splits and Benchmarks

Supports scaffold split (chemical diversity split based on molecular scaffolds), cold split (test set contains unseen drugs/targets), temporal split (time-series split), and other professional splitting strategies. Offers benchmark group functionality — for example, the ADMET group contains 22 datasets — and supports multi-seed (5 seeds) evaluation protocols to ensure fairness and reproducibility in model evaluation.

3. Molecular Generation Evaluation and Data Processing Tools

Includes 17+ molecular oracles for evaluating molecular binding affinity to specific targets, supporting goal-directed molecular generation and optimization. Provides 11 core data processing tools, including molecular format conversion (SMILES, PyG, DGL, ECFP, etc.), molecular filtering (PAINS, drug-likeness), label transformation, data balancing, negative sampling, and more.

Frequently Asked Questions

What is PyTDC and how is it different from ChEMBL?

PyTDC is the Python library for Therapeutics Data Commons. It integrates multiple drug databases, including ChEMBL, and performs unified preprocessing, adds standardized evaluation metrics, and provides professional data splits (e.g., scaffold split). Unlike the raw databases, PyTDC datasets are "AI-ready" and can be used directly for machine learning training without additional cleaning or format conversion.

How do I install PyTDC and load the first dataset?

Install with uv pip install PyTDC or pip install PyTDC. Loading data is straightforward: from tdc.single_pred import ADME; data = ADME(name='Caco2_Wang'); split = data.get_split(method='scaffold'). This returns a dictionary containing train, valid, and test, with each element as a pandas DataFrame.

Why is PyTDC's scaffold split important?

Scaffold split divides data based on molecular scaffold structures, ensuring the test set contains compounds with different scaffolds than the training set. This better simulates real drug discovery scenarios (predicting properties of compounds with novel chemical scaffolds), prevents models from overfitting to specific substructure patterns, and is the gold-standard split for evaluating model generalization.

Can PyTDC datasets be used for commercial purposes?

Yes. PyTDC is released under the MIT license, which allows commercial use. However, note that some original datasets integrated into PyTDC may have their own usage terms; it is recommended to check the specific dataset citations and licensing information before use.

How do I get a list of all available datasets in PyTDC?

Use from tdc.utils import retrieve_dataset_names; adme_datasets = retrieve_dataset_names('ADME') to get the names of all datasets for a specific task (e.g., ADME). You can also check the references/datasets.md file in the skill package, which contains a detailed catalog of all datasets.

pytdc

Author

Category

Install