School of Chemistry, University of Bristol, Cantock's Close, Bristol BS8 1TS, United Kingdom.
Computational Sciences, GlaxoSmithKline, Gunnels Wood Road, Stevenage, Herts SG1 2NY, United Kingdom.
J Chem Inf Model. 2020 Dec 28;60(12):5699-5713. doi: 10.1021/acs.jcim.0c00343. Epub 2020 Jul 24.
Deep learning approaches have become popular in recent years in the field of molecular design. While a variety of different methods are available, it is still a challenge to assess and compare their performance. A particularly promising approach for automated drug design is to use recurrent neural networks (RNNs) as SMILES generators and train them with the learning procedure called "transfer learning". This involves first training the initial model on a large generic data set of molecules to learn the general syntax of SMILES, followed by fine-tuning on a smaller set of molecules, coming from, e.g., a lead optimization program. To create a well-performing transfer learning application which can be automated, it is important to understand how the size of the second data set affects the training process. In addition, extensive postfiltering using similarity metrics of the molecules generated after transfer learning should be avoided, as it can introduce new biases toward the selection of drug candidates. Here, we present results from the application of a gated recurrent unit cell (GRU)-RNN to transfer learning on data sets of varying sizes and complexity. Analysis of the results has allowed us to provide some general guidelines for transfer learning. In particular, we show that data set sizes containing at least 190 molecules are needed for effective GRU-RNN-based molecular generation using transfer learning. The methods presented here should be applicable generally to the benchmarking of other deep learning methodologies for molecule generation.
深度学习方法在近年来的分子设计领域变得越来越流行。虽然有多种不同的方法可用,但评估和比较它们的性能仍然是一项挑战。一种特别有前途的自动化药物设计方法是使用递归神经网络 (RNN) 作为 SMILES 生成器,并通过称为“迁移学习”的学习过程对其进行训练。这涉及到首先在分子的大型通用数据集上训练初始模型,以学习 SMILES 的一般语法,然后在较小的分子数据集上进行微调,这些分子来自例如先导优化程序。为了创建可以自动化的性能良好的迁移学习应用程序,了解第二个数据集的大小如何影响训练过程非常重要。此外,应避免使用分子生成后的相似性度量进行广泛的后过滤,因为它可能会导致对候选药物选择的新偏见。在这里,我们展示了门控循环单元 (GRU)-RNN 在不同大小和复杂程度的数据集中应用迁移学习的结果。对结果的分析使我们能够为迁移学习提供一些一般准则。特别是,我们表明,需要至少包含 190 个分子的数据集大小才能有效地进行基于 GRU-RNN 的分子生成的迁移学习。这里提出的方法应该可以普遍适用于对其他分子生成的深度学习方法进行基准测试。