使用深度生成模型探索GDB-13化学空间。

Exploring the GDB-13 chemical space using deep generative models.

作者信息

Arús-Pous Josep, Blaschke Thomas, Ulander Silas, Reymond Jean-Louis, Chen Hongming, Engkvist Ola

机构信息

Hit Discovery, Discovery Sciences, IMED Biotech Unit, AstraZeneca, Gothenburg, Pepparedsleden 1, 43183, Mölndal, Sweden.

Department of Chemistry and Biochemistry, University of Bern, Freiestrasse 3, 3012, Bern, Switzerland.

出版信息

J Cheminform. 2019 Mar 12;11(1):20. doi: 10.1186/s13321-019-0341-z.

DOI:10.1186/s13321-019-0341-z

PMID:30868314

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6419837/

Abstract

Recent applications of recurrent neural networks (RNN) enable training models that sample the chemical space. In this study we train RNN with molecular string representations (SMILES) with a subset of the enumerated database GDB-13 (975 million molecules). We show that a model trained with 1 million structures (0.1% of the database) reproduces 68.9% of the entire database after training, when sampling 2 billion molecules. We also developed a method to assess the quality of the training process using negative log-likelihood plots. Furthermore, we use a mathematical model based on the "coupon collector problem" that compares the trained model to an upper bound and thus we are able to quantify how much it has learned. We also suggest that this method can be used as a tool to benchmark the learning capabilities of any molecular generative model architecture. Additionally, an analysis of the generated chemical space was performed, which shows that, mostly due to the syntax of SMILES, complex molecules with many rings and heteroatoms are more difficult to sample.

摘要

循环神经网络（RNN）的近期应用使得能够训练对化学空间进行采样的模型。在本研究中，我们使用分子字符串表示（SMILES）以及枚举数据库GDB - 13（9.75亿个分子）的一个子集来训练RNN。我们表明，当对20亿个分子进行采样时，用100万个结构（占数据库的0.1%）训练的模型在训练后能够重现整个数据库的68.9%。我们还开发了一种使用负对数似然图来评估训练过程质量的方法。此外，我们使用基于“优惠券收集问题”的数学模型，将训练后的模型与上限进行比较，从而能够量化它学到了多少。我们还建议这种方法可以用作衡量任何分子生成模型架构学习能力的基准工具。此外，还对生成的化学空间进行了分析，结果表明，主要由于SMILES的语法，具有许多环和杂原子的复杂分子更难采样。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d0aa/6419837/93e459cc701f/13321_2019_341_Fig1_HTML.jpg

相似文献

Exploring the GDB-13 chemical space using deep generative models.

J Cheminform. 2019 Mar 12;11(1):20. doi: 10.1186/s13321-019-0341-z.

Randomized SMILES strings improve the quality of molecular generative models.

J Cheminform. 2019 Nov 21;11(1):71. doi: 10.1186/s13321-019-0393-0.

Comparative Study of Deep Generative Models on Chemical Space Coverage.

J Chem Inf Model. 2021 Jun 28;61(6):2572-2581. doi: 10.1021/acs.jcim.0c01328. Epub 2021 May 20.

GEN: highly efficient SMILES explorer using autodidactic generative examination networks.

J Cheminform. 2020 Apr 10;12(1):22. doi: 10.1186/s13321-020-00425-8.

AI-Guided Design of MALDI Matrices: Exploring the Electron Transfer Chemical Space for Mass Spectrometric Analysis of Low-Molecular-Weight Compounds.

J Am Soc Mass Spectrom. 2024 Dec 4;35(12):2836-2848. doi: 10.1021/jasms.4c00186. Epub 2024 Oct 14.

Magicmol: a light-weighted pipeline for drug-like molecule evolution and quick chemical space exploration.

BMC Bioinformatics. 2023 Apr 26;24(1):173. doi: 10.1186/s12859-023-05286-0.

Bidirectional Molecule Generation with Recurrent Neural Networks.

J Chem Inf Model. 2020 Mar 23;60(3):1175-1183. doi: 10.1021/acs.jcim.9b00943. Epub 2020 Jan 16.

Automated Generation of Novel Fragments Using Screening Data, a Dual SMILES Autoencoder, Transfer Learning and Syntax Correction.

J Chem Inf Model. 2021 Jun 28;61(6):2547-2559. doi: 10.1021/acs.jcim.0c01226. Epub 2021 May 24.

SMILES-based deep generative scaffold decorator for de-novo drug design.

J Cheminform. 2020 May 29;12(1):38. doi: 10.1186/s13321-020-00441-8.

MERMAID: an open source automated hit-to-lead method based on deep reinforcement learning.

J Cheminform. 2021 Nov 27;13(1):94. doi: 10.1186/s13321-021-00572-6.

引用本文的文献

Design and optimization of novel succinate dehydrogenase inhibitors against agricultural fungi based on transformer model.

Mol Divers. 2025 Aug 19. doi: 10.1007/s11030-025-11323-2.

Transfer Learning-Enhanced Prediction of Glass Transition Temperature in Bismaleimide-Based Polyimides.

Polymers (Basel). 2025 Jun 30;17(13):1833. doi: 10.3390/polym17131833.

Generative Deep Learning for de Novo Drug Design─A Chemical Space Odyssey.

J Chem Inf Model. 2025 Jul 28;65(14):7352-7372. doi: 10.1021/acs.jcim.5c00641. Epub 2025 Jul 9.

Identification of nanomolar adenosine A receptor ligands using reinforcement learning and structure-based drug design.

Nat Commun. 2025 Jul 1;16(1):5485. doi: 10.1038/s41467-025-60629-0.

PepINVENT: generative peptide design beyond natural amino acids.

Chem Sci. 2025 Apr 16;16(20):8682-8696. doi: 10.1039/d4sc07642g. eCollection 2025 May 21.

A beginner's approach to deep learning applied to VS and MD techniques.

J Cheminform. 2025 Apr 8;17(1):47. doi: 10.1186/s13321-025-00985-7.

CardioGenAI: a machine learning-based framework for re-engineering drugs for reduced hERG liability.

J Cheminform. 2025 Mar 5;17(1):30. doi: 10.1186/s13321-025-00976-8.

Positional embeddings and zero-shot learning using BERT for molecular-property prediction.

J Cheminform. 2025 Feb 5;17(1):17. doi: 10.1186/s13321-025-00959-9.

A systematic review of deep learning chemical language models in recent era.

J Cheminform. 2024 Nov 18;16(1):129. doi: 10.1186/s13321-024-00916-y.

Optimal Molecular Design: Generative Active Learning Combining REINVENT with Precise Binding Free Energy Ranking Simulations.

J Chem Theory Comput. 2024 Sep 3;20(18):8308-28. doi: 10.1021/acs.jctc.4c00576.

本文引用的文献

Fréchet ChemNet Distance: A Metric for Generative Models for Molecules in Drug Discovery.

J Chem Inf Model. 2018 Sep 24;58(9):1736-1741. doi: 10.1021/acs.jcim.8b00234. Epub 2018 Aug 28.

Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks.

ACS Cent Sci. 2018 Jan 24;4(1):120-131. doi: 10.1021/acscentsci.7b00512. Epub 2017 Dec 28.

The rise of deep learning in drug discovery.

Drug Discov Today. 2018 Jun;23(6):1241-1250. doi: 10.1016/j.drudis.2018.01.039. Epub 2018 Jan 31.

Application of Generative Autoencoder in De Novo Molecular Design.

Mol Inform. 2018 Jan;37(1-2). doi: 10.1002/minf.201700123. Epub 2017 Dec 13.

Molecular de-novo design through deep reinforcement learning.

J Cheminform. 2017 Sep 4;9(1):48. doi: 10.1186/s13321-017-0235-x.

Virtual Exploration of the Ring Systems Chemical Universe.

J Chem Inf Model. 2017 Nov 27;57(11):2707-2718. doi: 10.1021/acs.jcim.7b00457. Epub 2017 Oct 30.

Mastering the game of Go with deep neural networks and tree search.

Nature. 2016 Jan 28;529(7587):484-9. doi: 10.1038/nature16961.

PubChem Substance and Compound databases.

Nucleic Acids Res. 2016 Jan 4;44(D1):D1202-13. doi: 10.1093/nar/gkv951. Epub 2015 Sep 22.

The chemical space project.

Acc Chem Res. 2015 Mar 17;48(3):722-30. doi: 10.1021/ar500432k. Epub 2015 Feb 17.

Deep learning in neural networks: an overview.

Neural Netw. 2015 Jan;61:85-117. doi: 10.1016/j.neunet.2014.09.003. Epub 2014 Oct 13.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用深度生成模型探索GDB-13化学空间。

Exploring the GDB-13 chemical space using deep generative models.

作者信息

Arús-Pous Josep, Blaschke Thomas, Ulander Silas, Reymond Jean-Louis, Chen Hongming, Engkvist Ola

机构信息

Hit Discovery, Discovery Sciences, IMED Biotech Unit, AstraZeneca, Gothenburg, Pepparedsleden 1, 43183, Mölndal, Sweden.

Department of Chemistry and Biochemistry, University of Bern, Freiestrasse 3, 3012, Bern, Switzerland.

出版信息

J Cheminform. 2019 Mar 12;11(1):20. doi: 10.1186/s13321-019-0341-z.

DOI:10.1186/s13321-019-0341-z

PMID:30868314

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6419837/

Abstract

摘要

使用深度生成模型探索GDB-13化学空间。

Exploring the GDB-13 chemical space using deep generative models.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

使用深度生成模型探索GDB-13化学空间。

Exploring the GDB-13 chemical space using deep generative models.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献