Arús-Pous Josep, Blaschke Thomas, Ulander Silas, Reymond Jean-Louis, Chen Hongming, Engkvist Ola
Hit Discovery, Discovery Sciences, IMED Biotech Unit, AstraZeneca, Gothenburg, Pepparedsleden 1, 43183, Mölndal, Sweden.
Department of Chemistry and Biochemistry, University of Bern, Freiestrasse 3, 3012, Bern, Switzerland.
J Cheminform. 2019 Mar 12;11(1):20. doi: 10.1186/s13321-019-0341-z.
Recent applications of recurrent neural networks (RNN) enable training models that sample the chemical space. In this study we train RNN with molecular string representations (SMILES) with a subset of the enumerated database GDB-13 (975 million molecules). We show that a model trained with 1 million structures (0.1% of the database) reproduces 68.9% of the entire database after training, when sampling 2 billion molecules. We also developed a method to assess the quality of the training process using negative log-likelihood plots. Furthermore, we use a mathematical model based on the "coupon collector problem" that compares the trained model to an upper bound and thus we are able to quantify how much it has learned. We also suggest that this method can be used as a tool to benchmark the learning capabilities of any molecular generative model architecture. Additionally, an analysis of the generated chemical space was performed, which shows that, mostly due to the syntax of SMILES, complex molecules with many rings and heteroatoms are more difficult to sample.
循环神经网络(RNN)的近期应用使得能够训练对化学空间进行采样的模型。在本研究中,我们使用分子字符串表示(SMILES)以及枚举数据库GDB - 13(9.75亿个分子)的一个子集来训练RNN。我们表明,当对20亿个分子进行采样时,用100万个结构(占数据库的0.1%)训练的模型在训练后能够重现整个数据库的68.9%。我们还开发了一种使用负对数似然图来评估训练过程质量的方法。此外,我们使用基于“优惠券收集问题”的数学模型,将训练后的模型与上限进行比较,从而能够量化它学到了多少。我们还建议这种方法可以用作衡量任何分子生成模型架构学习能力的基准工具。此外,还对生成的化学空间进行了分析,结果表明,主要由于SMILES的语法,具有许多环和杂原子的复杂分子更难采样。