AI Drug Discovery, Anticancer Bioscience Ltd, Chengdu, China.
Bioinformatics. 2022 Jun 27;38(13):3438-3443. doi: 10.1093/bioinformatics/btac332.
Automated molecule generation is a crucial step in in-silico drug discovery. Graph-based generation algorithms have seen significant progress over recent years. However, they are often complex to implement, hard to train and can under-perform when generating long-sequence molecules. The development of a simple and powerful alternative can help improve practicality of automated drug discovery method.
We proposed a ConvNet-based sequential graph generation algorithm. The molecular graph generation problem is reformulated as a sequence of simple classification tasks. At each step, a convolutional neural network operates on a sub-graph that is generated at previous step, and predicts/classifies an atom/bond adding action to populate the input sub-graph. The proposed model is pretrained by learning to sequentially reconstruct existing molecules. The pretrained model is abbreviated as SEEM (structural encoder for engineering molecules). It is then fine-tuned with reinforcement learning to generate molecules with improved properties. The fine-tuned model is named SEED (structural encoder for engineering drug-like-molecules). The proposed models have demonstrated competitive performance comparing to 16 state-of-the-art baselines on three benchmark datasets.
Code is available at https://github.com/yuh8/SEEM and https://github.com/yuh8/SEED. QM9 dataset is availble at http://quantum-machine.org/datasets/, ZINC250k dataset is availble at https://raw.githubusercontent.com/aspuru-guzik-group/chemical_vae/master/models/zinc_properties/250k_rndm_zinc_drugs_clean_3.csv, and ChEMBL dataset is availble at https://www.ebi.ac.uk/chembl/.
Supplementary data are available at Bioinformatics online.
自动化分子生成是计算机药物发现中的关键步骤。基于图的生成算法近年来取得了重大进展。然而,它们通常实现起来很复杂,难以训练,并且在生成长序列分子时性能可能会下降。开发一种简单而强大的替代方案可以帮助提高自动化药物发现方法的实用性。
我们提出了一种基于卷积神经网络的序贯图生成算法。分子图生成问题被重新表述为一系列简单的分类任务。在每一步,一个卷积神经网络对在前一步生成的子图进行操作,并预测/分类添加原子/键的操作,以填充输入子图。所提出的模型通过学习顺序重建现有分子进行预训练。预训练模型缩写为 SEEM(用于工程分子的结构编码器)。然后使用强化学习对其进行微调,以生成具有改进性质的分子。微调后的模型命名为 SEED(用于工程类药物分子的结构编码器)。所提出的模型在三个基准数据集上与 16 种最先进的基线进行比较,表现出了有竞争力的性能。
代码可在 https://github.com/yuh8/SEEM 和 https://github.com/yuh8/SEED 上获得。QM9 数据集可在 http://quantum-machine.org/datasets/ 获得,ZINC250k 数据集可在 https://raw.githubusercontent.com/aspuru-guzik-group/chemical_vae/master/models/zinc_properties/250k_rndm_zinc_drugs_clean_3.csv 获得,而 ChEMBL 数据集可在 https://www.ebi.ac.uk/chembl/ 获得。
补充数据可在 Bioinformatics 在线获得。