State Key Laboratory of Medicinal Chemical Biology, College of Pharmacy and Tianjin Key Laboratory of Molecular Drug Research, Nankai University, Haihe Education Park, 38 Tongyan Road, Tianjin 300353, China.
National Supercomputer Center in Tianjin, 10 Xinhuanxi Road, Tianjin Binhai New Area, Tianjin 300457, China.
Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac359.
Natural products (NPs) and their derivatives are important resources for drug discovery. There are many in silico target prediction methods that have been reported, however, very few of them distinguish NPs from synthetic molecules. Considering the fact that NPs and synthetic molecules are very different in many characteristics, it is necessary to build specific target prediction models of NPs. Therefore, we collected the activity data of NPs and their derivatives from the public databases and constructed four datasets, including the NP dataset, the NPs and its first-class derivatives dataset, the NPs and all its derivatives and the ChEMBL26 compounds dataset. Conditions, including activity thresholds and input features, were explored to access the performance of eight machine learning methods of target prediction of NPs, including support vector machines (SVM), extreme gradient boosting, random forests, K-nearest neighbor, naive Bayes, feedforward neural networks (FNN), convolutional neural networks and recurrent neural networks. As a result, the NPs and all their derivatives datasets were selected to build the best NP-specific models. Furthermore, the consensus models, as well as the voting models, were additionally applied to improve the prediction performance. More evaluations were made on the external validation set and the results demonstrated that (1) the NP-specific model performed better on the target prediction of NPs than the traditional models training on the whole compounds of ChEMBL26. (2) The consensus model of FNN + SVM possessed the best overall performance, and the voting model can significantly improve recall and specificity.
天然产物(NPs)及其衍生物是药物发现的重要资源。已经有许多基于计算的靶点预测方法被报道,但很少有方法能够区分 NPs 与合成分子。考虑到 NPs 和合成分子在许多特性上非常不同,有必要建立专门针对 NPs 的靶点预测模型。因此,我们从公共数据库中收集了 NPs 及其衍生物的活性数据,并构建了四个数据集,包括 NP 数据集、NP 及其一级衍生物数据集、NP 及其所有衍生物数据集和 ChEMBL26 化合物数据集。我们探索了条件,包括活性阈值和输入特征,以评估八种机器学习方法对 NPs 靶点预测的性能,包括支持向量机(SVM)、极端梯度提升、随机森林、K-最近邻、朴素贝叶斯、前馈神经网络(FNN)、卷积神经网络和递归神经网络。结果表明,选择 NPs 和所有衍生物数据集来构建最佳的 NP 特异性模型。此外,还应用了共识模型和投票模型来提高预测性能。我们在外部验证集上进行了更多的评估,结果表明:(1)NP 特异性模型在 NP 的靶点预测上的性能优于基于 ChEMBL26 所有化合物的传统模型。(2)FNN+SVM 的共识模型具有最佳的整体性能,投票模型可以显著提高召回率和特异性。