Suppr超能文献

我们是否需要不同的机器学习算法来进行定量构效关系建模?对 16 种机器学习算法在 14 个定量构效关系数据集上的综合评估。

Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets.

机构信息

College of Pharmaceutical Sciences, Hangzhou Institute of Innovative Medicine, Zhejiang University, P. R. China.

Xiangya School of Pharmaceutical Sciences, Central South University, P. R. China.

出版信息

Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa321.

Abstract

Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure-activity relationships (QSARs), there is no agreed single best algorithm for QSAR learning. Therefore, a comprehensive understanding of the performance characteristics of popular ML algorithms used in QSAR learning is highly desirable. In this study, five linear algorithms [linear function Gaussian process regression (linear-GPR), linear function support vector machine (linear-SVM), partial least squares regression (PLSR), multiple linear regression (MLR) and principal component regression (PCR)], three analogizers [radial basis function support vector machine (rbf-SVM), K-nearest neighbor (KNN) and radial basis function Gaussian process regression (rbf-GPR)], six symbolists [extreme gradient boosting (XGBoost), Cubist, random forest (RF), multiple adaptive regression splines (MARS), gradient boosting machine (GBM), and classification and regression tree (CART)] and two connectionists [principal component analysis artificial neural network (pca-ANN) and deep neural network (DNN)] were employed to learn the regression-based QSAR models for 14 public data sets comprising nine physicochemical properties and five toxicity endpoints. The results show that rbf-SVM, rbf-GPR, XGBoost and DNN generally illustrate better performances than the other algorithms. The overall performances of different algorithms can be ranked from the best to the worst as follows: rbf-SVM > XGBoost > rbf-GPR > Cubist > GBM > DNN > RF > pca-ANN > MARS > linear-GPR ≈ KNN > linear-SVM ≈ PLSR > CART ≈ PCR ≈ MLR. In terms of prediction accuracy and computational efficiency, SVM and XGBoost are recommended to the regression learning for small data sets, and XGBoost is an excellent choice for large data sets. We then investigated the performances of the ensemble models by integrating the predictions of multiple ML algorithms. The results illustrate that the ensembles of two or three algorithms in different categories can indeed improve the predictions of the best individual ML algorithms.

摘要

尽管已经使用了多种机器学习 (ML) 算法来学习定量构效关系 (QSAR),但对于 QSAR 学习,没有一种被普遍认可的最佳算法。因此,深入了解在 QSAR 学习中使用的流行 ML 算法的性能特征是非常可取的。在这项研究中,我们使用了五种线性算法[线性函数高斯过程回归 (linear-GPR)、线性函数支持向量机 (linear-SVM)、偏最小二乘回归 (PLSR)、多元线性回归 (MLR) 和主成分回归 (PCR)]、三种模拟算法[径向基函数支持向量机 (rbf-SVM)、K 近邻 (KNN) 和径向基函数高斯过程回归 (rbf-GPR)]、六种符号算法[极端梯度提升 (XGBoost)、Cubist、随机森林 (RF)、多自适应回归样条 (MARS)、梯度提升机 (GBM) 和分类回归树 (CART)]和两种连接算法[主成分分析人工神经网络 (pca-ANN) 和深度神经网络 (DNN)]来学习包含 9 种物化性质和 5 种毒性终点的 14 个公共数据集的基于回归的 QSAR 模型。结果表明,rbf-SVM、rbf-GPR、XGBoost 和 DNN 通常表现出优于其他算法的性能。不同算法的总体性能可以从最好到最差排序如下:rbf-SVM > XGBoost > rbf-GPR > Cubist > GBM > DNN > RF > pca-ANN > MARS > linear-GPR ≈ KNN > linear-SVM ≈ PLSR > CART ≈ PCR ≈ MLR。在预测精度和计算效率方面,建议对于小数据集使用 SVM 和 XGBoost 进行回归学习,对于大数据集,XGBoost 是一个极好的选择。然后,我们通过集成多个 ML 算法的预测来研究集成模型的性能。结果表明,不同类别中两个或三个算法的集成确实可以提高最佳单个 ML 算法的预测能力。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验