Pharmaceutical Research Center, Pharmaceutical Technology Institute, Mashhad University of Medical Sciences, Mashhad, Iran.
Chem Biol Drug Des. 2021 Apr;97(4):930-943. doi: 10.1111/cbdd.13819. Epub 2021 Jan 10.
Machine learning (ML) method performances, including deep learning (DL) on a diverse set with or without feature selection (FS), were evaluated. The superior performance of DL on small sets has not been approved previously. On the other hand, the available sets for the newly identified targets usually are limited in terms of size. It was explored whether the FS, hyperparameters search, and using ensemble model are able to improve the ML and DL performance on the small sets. The QSAR classifier models were developed using K-nearest (KN) neighbors, DL, random forest (RF), naïve Bayesian (NB) classification, support vector machine (SVM), and logistic regression (LR). Generally, the best individual performers were DL and SVM. The LR had a similar performance to the DL and SVM on the small subsets. The nested cross-validation method was able to include different feature vectors in combination with different ML methods to generate an ensemble model for the datasets with similar performance to the best performers. The general performance for the baseline NB model was Matthews correlation coefficient = 0.356, and it was improved to around 0.66 and 0.63 by NB assisted FS with subsequent SVM/DL classification and an ensemble model, respectively.
评估了机器学习 (ML) 方法的性能,包括在具有或不具有特征选择 (FS) 的多样化数据集上的深度学习 (DL)。DL 在小数据集上的优异性能此前尚未得到证实。另一方面,新确定的靶标可用的数据集在规模上通常是有限的。探索了 FS、超参数搜索以及使用集成模型是否能够提高 ML 和 DL 在小数据集上的性能。使用 K-最近邻 (KN) 邻居、DL、随机森林 (RF)、朴素贝叶斯 (NB) 分类、支持向量机 (SVM) 和逻辑回归 (LR) 开发了 QSAR 分类器模型。一般来说,最佳的个体表现者是 DL 和 SVM。在小子集上,LR 的性能与 DL 和 SVM 相似。嵌套交叉验证方法能够结合不同的 ML 方法将不同的特征向量包含在组合中,为数据集生成一个与最佳表现者性能相似的集成模型。基线 NB 模型的总体性能为 Matthews 相关系数 = 0.356,通过 NB 辅助 FS 随后进行 SVM/DL 分类和集成模型,可分别提高到约 0.66 和 0.63。