我们是否需要不同的机器学习算法来进行定量构效关系建模？对 16 种机器学习算法在 14 个定量构效关系数据集上的综合评估。

Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets.

机构信息

College of Pharmaceutical Sciences, Hangzhou Institute of Innovative Medicine, Zhejiang University, P. R. China.

Xiangya School of Pharmaceutical Sciences, Central South University, P. R. China.

出版信息

Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa321.

DOI:10.1093/bib/bbaa321

PMID:33313673

Abstract

Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure-activity relationships (QSARs), there is no agreed single best algorithm for QSAR learning. Therefore, a comprehensive understanding of the performance characteristics of popular ML algorithms used in QSAR learning is highly desirable. In this study, five linear algorithms [linear function Gaussian process regression (linear-GPR), linear function support vector machine (linear-SVM), partial least squares regression (PLSR), multiple linear regression (MLR) and principal component regression (PCR)], three analogizers [radial basis function support vector machine (rbf-SVM), K-nearest neighbor (KNN) and radial basis function Gaussian process regression (rbf-GPR)], six symbolists [extreme gradient boosting (XGBoost), Cubist, random forest (RF), multiple adaptive regression splines (MARS), gradient boosting machine (GBM), and classification and regression tree (CART)] and two connectionists [principal component analysis artificial neural network (pca-ANN) and deep neural network (DNN)] were employed to learn the regression-based QSAR models for 14 public data sets comprising nine physicochemical properties and five toxicity endpoints. The results show that rbf-SVM, rbf-GPR, XGBoost and DNN generally illustrate better performances than the other algorithms. The overall performances of different algorithms can be ranked from the best to the worst as follows: rbf-SVM > XGBoost > rbf-GPR > Cubist > GBM > DNN > RF > pca-ANN > MARS > linear-GPR ≈ KNN > linear-SVM ≈ PLSR > CART ≈ PCR ≈ MLR. In terms of prediction accuracy and computational efficiency, SVM and XGBoost are recommended to the regression learning for small data sets, and XGBoost is an excellent choice for large data sets. We then investigated the performances of the ensemble models by integrating the predictions of multiple ML algorithms. The results illustrate that the ensembles of two or three algorithms in different categories can indeed improve the predictions of the best individual ML algorithms.

摘要

尽管已经使用了多种机器学习 (ML) 算法来学习定量构效关系 (QSAR)，但对于 QSAR 学习，没有一种被普遍认可的最佳算法。因此，深入了解在 QSAR 学习中使用的流行 ML 算法的性能特征是非常可取的。在这项研究中，我们使用了五种线性算法[线性函数高斯过程回归 (linear-GPR)、线性函数支持向量机 (linear-SVM)、偏最小二乘回归 (PLSR)、多元线性回归 (MLR) 和主成分回归 (PCR)]、三种模拟算法[径向基函数支持向量机 (rbf-SVM)、K 近邻 (KNN) 和径向基函数高斯过程回归 (rbf-GPR)]、六种符号算法[极端梯度提升 (XGBoost)、Cubist、随机森林 (RF)、多自适应回归样条 (MARS)、梯度提升机 (GBM) 和分类回归树 (CART)]和两种连接算法[主成分分析人工神经网络 (pca-ANN) 和深度神经网络 (DNN)]来学习包含 9 种物化性质和 5 种毒性终点的 14 个公共数据集的基于回归的 QSAR 模型。结果表明，rbf-SVM、rbf-GPR、XGBoost 和 DNN 通常表现出优于其他算法的性能。不同算法的总体性能可以从最好到最差排序如下：rbf-SVM > XGBoost > rbf-GPR > Cubist > GBM > DNN > RF > pca-ANN > MARS > linear-GPR ≈ KNN > linear-SVM ≈ PLSR > CART ≈ PCR ≈ MLR。在预测精度和计算效率方面，建议对于小数据集使用 SVM 和 XGBoost 进行回归学习，对于大数据集，XGBoost 是一个极好的选择。然后，我们通过集成多个 ML 算法的预测来研究集成模型的性能。结果表明，不同类别中两个或三个算法的集成确实可以提高最佳单个 ML 算法的预测能力。

相似文献

Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets.我们是否需要不同的机器学习算法来进行定量构效关系建模？对 16 种机器学习算法在 14 个定量构效关系数据集上的综合评估。

Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa321.

Comparing supervised and semi-supervised Machine Learning Models on Diagnosing Breast Cancer.比较监督式和半监督式机器学习模型在乳腺癌诊断中的应用

Ann Med Surg (Lond). 2021 Jan 8;62:53-64. doi: 10.1016/j.amsu.2020.12.043. eCollection 2021 Feb.

ADMET Evaluation in Drug Discovery. 18. Reliable Prediction of Chemical-Induced Urinary Tract Toxicity by Boosting Machine Learning Approaches.药物发现中的 ADMET 评估。18. 通过机器学习方法的提升实现可靠的化学诱导的泌尿道毒性预测。

Mol Pharm. 2017 Nov 6;14(11):3935-3953. doi: 10.1021/acs.molpharmaceut.7b00631. Epub 2017 Oct 27.

Optimizing neural networks for medical data sets: A case study on neonatal apnea prediction.优化神经网络在医学数据集上的应用：以新生儿呼吸暂停预测为例的研究

Artif Intell Med. 2019 Jul;98:59-76. doi: 10.1016/j.artmed.2019.07.008. Epub 2019 Jul 25.

Evaluation of QSAR Equations for Virtual Screening.QSAR 方程在虚拟筛选中的评估。

Int J Mol Sci. 2020 Oct 22;21(21):7828. doi: 10.3390/ijms21217828.

Artificial intelligence in clinical care amidst COVID-19 pandemic: A systematic review.COVID-19大流行期间临床护理中的人工智能：一项系统综述。

Comput Struct Biotechnol J. 2021;19:2833-2850. doi: 10.1016/j.csbj.2021.05.010. Epub 2021 May 7.

Improved Multiclassification of Schizophrenia Based on Xgboost and Information Fusion for Small Datasets.基于 Xgboost 和信息融合的小数据集精神分裂症改进的多分类。

Comput Math Methods Med. 2022 Jul 19;2022:1581958. doi: 10.1155/2022/1581958. eCollection 2022.

On the use of machine learning algorithms in forensic anthropology.论机器学习算法在法医人类学中的应用。

Leg Med (Tokyo). 2020 Nov;47:101771. doi: 10.1016/j.legalmed.2020.101771. Epub 2020 Aug 6.

Artificial intelligence to predict outcomes of head and neck radiotherapy.人工智能预测头颈部放疗结果。

Clin Transl Radiat Oncol. 2023 Jan 31;39:100590. doi: 10.1016/j.ctro.2023.100590. eCollection 2023 Mar.

A comparative evaluation of the generalised predictive ability of eight machine learning algorithms across ten clinical metabolomics data sets for binary classification.八种机器学习算法在十个临床代谢组学数据集上进行二进制分类的广义预测能力的比较评估。

Metabolomics. 2019 Nov 15;15(12):150. doi: 10.1007/s11306-019-1612-4.

引用本文的文献

Machine Learning Framework for Ovarian Cancer Diagnostics Using Plasma Lipidomics and Metabolomics.基于血浆脂质组学和代谢组学的卵巢癌诊断机器学习框架

Int J Mol Sci. 2025 Jul 10;26(14):6630. doi: 10.3390/ijms26146630.

Q-GEM: Quantum Chemistry Knowledge Fusion Geometry-Enhanced Molecular Representation for Property Prediction.Q-GEM：用于性质预测的量子化学知识融合几何增强分子表示法。

Adv Sci (Weinh). 2025 Sep;12(33):e04867. doi: 10.1002/advs.202504867. Epub 2025 Jun 20.

From Patterns to Pills: How Informatics Is Shaping Medicinal Chemistry.从模式到药丸：信息学如何塑造药物化学

Pharmaceutics. 2025 May 5;17(5):612. doi: 10.3390/pharmaceutics17050612.

QSAR-Based Drug Repurposing and RNA-Seq Metabolic Networks Highlight Treatment Opportunities for Hepatocellular Carcinoma Through Pyrimidine Starvation.基于定量构效关系的药物重新利用与RNA测序代谢网络揭示了通过嘧啶饥饿治疗肝细胞癌的机会。

Cancers (Basel). 2025 Mar 6;17(5):903. doi: 10.3390/cancers17050903.

A metabolic fingerprint of ovarian cancer: a novel diagnostic strategy employing plasma EV-based metabolomics and machine learning algorithms.卵巢癌的代谢指纹图谱：一种采用基于血浆细胞外囊泡的代谢组学和机器学习算法的新型诊断策略。

J Ovarian Res. 2025 Feb 12;18(1):26. doi: 10.1186/s13048-025-01590-w.

Barlow Twins deep neural network for advanced 1D drug-target interaction prediction.用于高级一维药物-靶点相互作用预测的巴洛双胞胎深度神经网络。

J Cheminform. 2025 Feb 5;17(1):18. doi: 10.1186/s13321-025-00952-2.

Evaluation of Machine Learning Based QSAR Models for the Classification of Lung Surfactant Inhibitors.基于机器学习的肺表面活性剂抑制剂分类QSAR模型的评估

Environ Health (Wash). 2024 Sep 20;2(12):912-917. doi: 10.1021/envhealth.4c00118. eCollection 2024 Dec 20.

Deepmol: an automated machine and deep learning framework for computational chemistry.Deepmol：一个用于计算化学的自动化机器与深度学习框架。

J Cheminform. 2024 Dec 5;16(1):136. doi: 10.1186/s13321-024-00937-7.

Genetic algorithm multiple linear regression and machine learning-driven QSTR modeling for the acute toxicity of sterol biosynthesis inhibitor fungicides.基于遗传算法多元线性回归和机器学习的甾醇生物合成抑制剂类杀菌剂急性毒性定量构效关系建模

Heliyon. 2024 Aug 15;10(16):e36373. doi: 10.1016/j.heliyon.2024.e36373. eCollection 2024 Aug 30.

De novo drug design through gradient-based regularized search in information-theoretically controlled latent space.基于信息论控制的潜在空间中基于梯度的正则化搜索的从头药物设计。

J Comput Aided Mol Des. 2024 Aug 27;38(1):32. doi: 10.1007/s10822-024-00571-3.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

我们是否需要不同的机器学习算法来进行定量构效关系建模？对 16 种机器学习算法在 14 个定量构效关系数据集上的综合评估。

Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets.

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献