• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于分子性质预测的梯度提升法实用指南。

Practical guidelines for the use of gradient boosting for molecular property prediction.

作者信息

Boldini Davide, Grisoni Francesca, Kuhn Daniel, Friedrich Lukas, Sieber Stephan A

机构信息

Department of Bioscience, Center for Functional Protein Assemblies (CPA), Technical University of Munich, Garching bei Munich, Germany.

Department of Biomedical Engineering, Institute for Complex Molecular Sciences, Eindhoven University of Technology, Eindhoven, The Netherlands.

出版信息

J Cheminform. 2023 Aug 28;15(1):73. doi: 10.1186/s13321-023-00743-7.

DOI:10.1186/s13321-023-00743-7
PMID:37641120
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10464382/
Abstract

Decision tree ensembles are among the most robust, high-performing and computationally efficient machine learning approaches for quantitative structure-activity relationship (QSAR) modeling. Among them, gradient boosting has recently garnered particular attention, for its performance in data science competitions, virtual screening campaigns, and bioactivity prediction. However, different variants of gradient boosting exist, the most popular being XGBoost, LightGBM and CatBoost. Our study provides the first comprehensive comparison of these approaches for QSAR. To this end, we trained 157,590 gradient boosting models, which were evaluated on 16 datasets and 94 endpoints, comprising 1.4 million compounds in total. Our results show that XGBoost generally achieves the best predictive performance, while LightGBM requires the least training time, especially for larger datasets. In terms of feature importance, the models surprisingly rank molecular features differently, reflecting differences in regularization techniques and decision tree structures. Thus, expert knowledge must always be employed when evaluating data-driven explanations of bioactivity. Furthermore, our results show that the relevance of each hyperparameter varies greatly across datasets and that it is crucial to optimize as many hyperparameters as possible to maximize the predictive performance. In conclusion, our study provides the first set of guidelines for cheminformatics practitioners to effectively train, optimize and evaluate gradient boosting models for virtual screening and QSAR applications.

摘要

决策树集成是用于定量构效关系(QSAR)建模的最强大、高性能且计算高效的机器学习方法之一。其中,梯度提升最近因其在数据科学竞赛、虚拟筛选活动和生物活性预测中的表现而备受关注。然而,梯度提升存在不同的变体,最流行的是XGBoost、LightGBM和CatBoost。我们的研究首次对这些方法在QSAR中的应用进行了全面比较。为此,我们训练了157,590个梯度提升模型,并在16个数据集和94个端点上进行了评估,总共包含140万个化合物。我们的结果表明,XGBoost通常实现最佳预测性能,而LightGBM所需的训练时间最少,特别是对于较大的数据集。在特征重要性方面,这些模型对分子特征的排名出人意料地不同,反映了正则化技术和决策树结构的差异。因此,在评估生物活性的数据驱动解释时,必须始终运用专家知识。此外,我们的结果表明,每个超参数的相关性在不同数据集之间差异很大,尽可能优化多个超参数对于最大化预测性能至关重要。总之,我们的研究为化学信息学从业者提供了第一套指导方针,以便有效地训练、优化和评估用于虚拟筛选和QSAR应用的梯度提升模型。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf2b/10464382/b5035cbdf4d4/13321_2023_743_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf2b/10464382/f528378881d1/13321_2023_743_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf2b/10464382/0b397d8c6e67/13321_2023_743_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf2b/10464382/c034625212ef/13321_2023_743_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf2b/10464382/7b6f5e1ce4f9/13321_2023_743_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf2b/10464382/b5035cbdf4d4/13321_2023_743_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf2b/10464382/f528378881d1/13321_2023_743_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf2b/10464382/0b397d8c6e67/13321_2023_743_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf2b/10464382/c034625212ef/13321_2023_743_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf2b/10464382/7b6f5e1ce4f9/13321_2023_743_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf2b/10464382/b5035cbdf4d4/13321_2023_743_Fig5_HTML.jpg

相似文献

1
Practical guidelines for the use of gradient boosting for molecular property prediction.用于分子性质预测的梯度提升法实用指南。
J Cheminform. 2023 Aug 28;15(1):73. doi: 10.1186/s13321-023-00743-7.
2
Can Predictive Modeling Tools Identify Patients at High Risk of Prolonged Opioid Use After ACL Reconstruction?预测模型工具能否识别 ACL 重建术后阿片类药物使用时间延长的高风险患者?
Clin Orthop Relat Res. 2020 Jul;478(7):0-1618. doi: 10.1097/CORR.0000000000001251.
3
Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets.我们是否需要不同的机器学习算法来进行定量构效关系建模?对 16 种机器学习算法在 14 个定量构效关系数据集上的综合评估。
Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa321.
4
Ensemble Machine Learning of Gradient Boosting (XGBoost, LightGBM, CatBoost) and Attention-Based CNN-LSTM for Harmful Algal Blooms Forecasting.基于梯度提升(XGBoost、LightGBM、CatBoost)和基于注意力的 CNN-LSTM 的集成机器学习用于有害藻华预测。
Toxins (Basel). 2023 Oct 10;15(10):608. doi: 10.3390/toxins15100608.
5
LightGBM: An Effective and Scalable Algorithm for Prediction of Chemical Toxicity-Application to the Tox21 and Mutagenicity Data Sets.LightGBM:一种用于化学毒性预测的有效且可扩展的算法-在 Tox21 和致突变性数据集上的应用。
J Chem Inf Model. 2019 Oct 28;59(10):4150-4158. doi: 10.1021/acs.jcim.9b00633. Epub 2019 Oct 9.
6
Analysis of Variance Combined with Optimized Gradient Boosting Machines for Enhanced Load Recognition in Home Energy Management Systems.用于家庭能源管理系统中增强负载识别的方差分析与优化梯度提升机相结合的方法
Sensors (Basel). 2024 Jul 31;24(15):4965. doi: 10.3390/s24154965.
7
Insights into modeling refractive index of ionic liquids using chemical structure-based machine learning methods.利用基于化学结构的机器学习方法对离子液体折射率进行建模的见解。
Sci Rep. 2023 Jul 24;13(1):11966. doi: 10.1038/s41598-023-39079-5.
8
ADMET Evaluation in Drug Discovery. 18. Reliable Prediction of Chemical-Induced Urinary Tract Toxicity by Boosting Machine Learning Approaches.药物发现中的 ADMET 评估。18. 通过机器学习方法的提升实现可靠的化学诱导的泌尿道毒性预测。
Mol Pharm. 2017 Nov 6;14(11):3935-3953. doi: 10.1021/acs.molpharmaceut.7b00631. Epub 2017 Oct 27.
9
Prediction of the Aqueous Solubility of Compounds Based on Light Gradient Boosting Machines with Molecular Fingerprints and the Cuckoo Search Algorithm.基于带有分子指纹和布谷鸟搜索算法的轻梯度提升机预测化合物的水溶性
ACS Omega. 2022 Nov 8;7(46):42027-42035. doi: 10.1021/acsomega.2c03885. eCollection 2022 Nov 22.
10
Optimal Dimensioning of Retaining Walls Using Explainable Ensemble Learning Algorithms.使用可解释集成学习算法对挡土墙进行优化尺寸设计
Materials (Basel). 2022 Jul 18;15(14):4993. doi: 10.3390/ma15144993.

引用本文的文献

1
ACLPred: an explainable machine learning and tree-based ensemble model for anticancer ligand prediction.ACLPred:一种用于抗癌配体预测的可解释机器学习和基于树的集成模型。
Sci Rep. 2025 Aug 25;15(1):31268. doi: 10.1038/s41598-025-16575-4.
2
Regression machine learning-based highly efficient dual band MIMO antenna design for mm-Wave 5G application and gain prediction.基于回归机器学习的毫米波5G应用高效双频段MIMO天线设计与增益预测。
Sci Rep. 2025 Aug 6;15(1):28730. doi: 10.1038/s41598-025-13514-1.
3
Machine Learning and Deep Learning Hybrid Approach Based on Muscle Imaging Features for Diagnosis of Esophageal Cancer.

本文引用的文献

1
Exposing the Limitations of Molecular Machine Learning with Activity Cliffs.利用活性悬崖揭示分子机器学习的局限性。
J Chem Inf Model. 2022 Dec 12;62(23):5938-5951. doi: 10.1021/acs.jcim.2c01073. Epub 2022 Dec 1.
2
Tuning gradient boosting for imbalanced bioassay modelling with custom loss functions.使用自定义损失函数调整梯度提升以进行不平衡生物测定建模。
J Cheminform. 2022 Nov 10;14(1):80. doi: 10.1186/s13321-022-00657-w.
3
Analysis of the benefits of imputation models over traditional QSAR models for toxicity prediction.插补模型相对于传统定量构效关系(QSAR)模型在毒性预测方面的优势分析。
基于肌肉成像特征的机器学习与深度学习混合方法用于食管癌诊断
Diagnostics (Basel). 2025 Jul 8;15(14):1730. doi: 10.3390/diagnostics15141730.
4
Scaffold and SAR studies on c-MET inhibitors using machine learning approaches.使用机器学习方法对c-MET抑制剂进行支架和构效关系研究。
J Pharm Anal. 2025 Jun;15(6):101303. doi: 10.1016/j.jpha.2025.101303. Epub 2025 Apr 10.
5
A Machine Learning Approach for Predicting the Pure-Component Surface Tension of Atmospherically Relevant Organic Compounds.一种用于预测大气相关有机化合物纯组分表面张力的机器学习方法。
ACS EST Air. 2025 Apr 8;2(5):808-823. doi: 10.1021/acsestair.4c00291. eCollection 2025 May 9.
6
Automated machine learning model for predicting anastomotic strictures after esophageal cancer surgery: a retrospective cohort study.用于预测食管癌手术后吻合口狭窄的自动化机器学习模型:一项回顾性队列研究。
Surg Endosc. 2025 May 2. doi: 10.1007/s00464-025-11759-5.
7
MHNfs: Prompting In-Context Bioactivity Predictions for Low-Data Drug Discovery.MHNfs:为低数据药物发现提供上下文生物活性预测
J Chem Inf Model. 2025 May 12;65(9):4243-4250. doi: 10.1021/acs.jcim.4c02373. Epub 2025 Apr 30.
8
QSAR Classification Modeling Using Machine Learning with a Consensus-Based Approach for Multivariate Chemical Hazard End Points.使用机器学习并基于共识方法对多变量化学危害终点进行定量构效关系分类建模
ACS Omega. 2024 Dec 12;9(51):50796-50808. doi: 10.1021/acsomega.4c09356. eCollection 2024 Dec 24.
9
QSPRpred: a Flexible Open-Source Quantitative Structure-Property Relationship Modelling Tool.QSPRpred:一个灵活的开源定量结构-性质关系建模工具。
J Cheminform. 2024 Nov 14;16(1):128. doi: 10.1186/s13321-024-00908-y.
10
Assessing polyomic risk to predict Alzheimer's disease using a machine learning model.使用机器学习模型评估多组学风险以预测阿尔茨海默病。
Alzheimers Dement. 2024 Dec;20(12):8700-8714. doi: 10.1002/alz.14319. Epub 2024 Nov 7.
J Cheminform. 2022 Jun 7;14(1):32. doi: 10.1186/s13321-022-00611-w.
4
MolData, a molecular benchmark for disease and target based machine learning.MolData,一种基于疾病和靶点的机器学习分子基准。
J Cheminform. 2022 Mar 7;14(1):10. doi: 10.1186/s13321-022-00590-y.
5
DeepStack-DTIs: Predicting Drug-Target Interactions Using LightGBM Feature Selection and Deep-Stacked Ensemble Classifier.DeepStack-DTIs:使用 LightGBM 特征选择和深度堆叠集成分类器预测药物-靶标相互作用。
Interdiscip Sci. 2022 Jun;14(2):311-330. doi: 10.1007/s12539-021-00488-7. Epub 2021 Nov 3.
6
ADMET Predictability at Boehringer Ingelheim: State-of-the-Art, and Do Bigger Datasets or Algorithms Make a Difference?勃林格殷格翰公司的ADMET预测能力:当前技术水平,以及更大的数据集或算法是否会产生影响?
Mol Inform. 2022 Feb;41(2):e2100113. doi: 10.1002/minf.202100113. Epub 2021 Sep 2.
7
Machine Learning Models Identify Inhibitors of SARS-CoV-2.机器学习模型鉴定 SARS-CoV-2 抑制剂。
J Chem Inf Model. 2021 Sep 27;61(9):4224-4235. doi: 10.1021/acs.jcim.1c00683. Epub 2021 Aug 13.
8
GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning.调整决策阈值以处理机器学习中的不平衡数据。
J Chem Inf Model. 2021 Jun 28;61(6):2623-2640. doi: 10.1021/acs.jcim.1c00160. Epub 2021 Jun 8.
9
DrugComb update: a more comprehensive drug sensitivity data repository and analysis portal.DrugComb 更新:一个更全面的药物敏感性数据存储库和分析门户。
Nucleic Acids Res. 2021 Jul 2;49(W1):W174-W184. doi: 10.1093/nar/gkab438.
10
ADMETlab 2.0: an integrated online platform for accurate and comprehensive predictions of ADMET properties.ADMETlab 2.0:一个集成的在线平台,用于准确全面地预测 ADMET 性质。
Nucleic Acids Res. 2021 Jul 2;49(W1):W5-W14. doi: 10.1093/nar/gkab255.