Schoenmaker Linde, Béquignon Olivier J M, Jespers Willem, van Westen Gerard J P
Computational Drug Discovery, Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Einsteinweg 55, Leiden, The Netherlands.
J Cheminform. 2023 Feb 14;15(1):22. doi: 10.1186/s13321-023-00696-x.
Generative deep learning models have emerged as a powerful approach for de novo drug design as they aid researchers in finding new molecules with desired properties. Despite continuous improvements in the field, a subset of the outputs that sequence-based de novo generators produce cannot be progressed due to errors. Here, we propose to fix these invalid outputs post hoc. In similar tasks, transformer models from the field of natural language processing have been shown to be very effective. Therefore, here this type of model was trained to translate invalid Simplified Molecular-Input Line-Entry System (SMILES) into valid representations. The performance of this SMILES corrector was evaluated on four representative methods of de novo generation: a recurrent neural network (RNN), a target-directed RNN, a generative adversarial network (GAN), and a variational autoencoder (VAE). This study has found that the percentage of invalid outputs from these specific generative models ranges between 4 and 89%, with different models having different error-type distributions. Post hoc correction of SMILES was shown to increase model validity. The SMILES corrector trained with one error per input alters 60-90% of invalid generator outputs and fixes 35-80% of them. However, a higher error detection and performance was obtained for transformer models trained with multiple errors per input. In this case, the best model was able to correct 60-95% of invalid generator outputs. Further analysis showed that these fixed molecules are comparable to the correct molecules from the de novo generators based on novelty and similarity. Additionally, the SMILES corrector can be used to expand the amount of interesting new molecules within the targeted chemical space. Introducing different errors into existing molecules yields novel analogs with a uniqueness of 39% and a novelty of approximately 20%. The results of this research demonstrate that SMILES correction is a viable post hoc extension and can enhance the search for better drug candidates.
生成式深度学习模型已成为从头设计药物的一种强大方法,因为它们有助于研究人员找到具有所需特性的新分子。尽管该领域不断取得进展,但基于序列的从头生成器产生的一部分输出由于错误而无法推进。在此,我们建议事后修正这些无效输出。在类似任务中,自然语言处理领域的Transformer模型已被证明非常有效。因此,这里训练了这种类型的模型,将无效的简化分子输入线性输入系统(SMILES)转换为有效的表示形式。在四种代表性的从头生成方法上评估了这种SMILES校正器的性能:递归神经网络(RNN)、目标导向RNN、生成对抗网络(GAN)和变分自编码器(VAE)。本研究发现,这些特定生成模型的无效输出百分比在4%至89%之间,不同模型具有不同的错误类型分布。SMILES的事后校正被证明可以提高模型的有效性。每个输入训练一个错误的SMILES校正器会改变60 - 90%的无效生成器输出,并修复其中35 - 80%。然而,对于每个输入训练多个错误的Transformer模型,检测到的错误和性能更高。在这种情况下,最佳模型能够校正60 - 95%的无效生成器输出。进一步分析表明,这些修正后的分子在新颖性和相似性方面与从头生成器产生的正确分子相当。此外,SMILES校正器可用于在目标化学空间内扩展有趣的新分子数量。在现有分子中引入不同的错误会产生独特性为39%、新颖性约为20%的新型类似物。本研究结果表明,SMILES校正是一种可行的事后扩展,可以加强对更好候选药物的搜索。