Boldini Davide, Grisoni Francesca, Kuhn Daniel, Friedrich Lukas, Sieber Stephan A
Department of Bioscience, Center for Functional Protein Assemblies (CPA), Technical University of Munich, Garching bei Munich, Germany.
Department of Biomedical Engineering, Institute for Complex Molecular Sciences, Eindhoven University of Technology, Eindhoven, The Netherlands.
J Cheminform. 2023 Aug 28;15(1):73. doi: 10.1186/s13321-023-00743-7.
Decision tree ensembles are among the most robust, high-performing and computationally efficient machine learning approaches for quantitative structure-activity relationship (QSAR) modeling. Among them, gradient boosting has recently garnered particular attention, for its performance in data science competitions, virtual screening campaigns, and bioactivity prediction. However, different variants of gradient boosting exist, the most popular being XGBoost, LightGBM and CatBoost. Our study provides the first comprehensive comparison of these approaches for QSAR. To this end, we trained 157,590 gradient boosting models, which were evaluated on 16 datasets and 94 endpoints, comprising 1.4 million compounds in total. Our results show that XGBoost generally achieves the best predictive performance, while LightGBM requires the least training time, especially for larger datasets. In terms of feature importance, the models surprisingly rank molecular features differently, reflecting differences in regularization techniques and decision tree structures. Thus, expert knowledge must always be employed when evaluating data-driven explanations of bioactivity. Furthermore, our results show that the relevance of each hyperparameter varies greatly across datasets and that it is crucial to optimize as many hyperparameters as possible to maximize the predictive performance. In conclusion, our study provides the first set of guidelines for cheminformatics practitioners to effectively train, optimize and evaluate gradient boosting models for virtual screening and QSAR applications.
决策树集成是用于定量构效关系(QSAR)建模的最强大、高性能且计算高效的机器学习方法之一。其中,梯度提升最近因其在数据科学竞赛、虚拟筛选活动和生物活性预测中的表现而备受关注。然而,梯度提升存在不同的变体,最流行的是XGBoost、LightGBM和CatBoost。我们的研究首次对这些方法在QSAR中的应用进行了全面比较。为此,我们训练了157,590个梯度提升模型,并在16个数据集和94个端点上进行了评估,总共包含140万个化合物。我们的结果表明,XGBoost通常实现最佳预测性能,而LightGBM所需的训练时间最少,特别是对于较大的数据集。在特征重要性方面,这些模型对分子特征的排名出人意料地不同,反映了正则化技术和决策树结构的差异。因此,在评估生物活性的数据驱动解释时,必须始终运用专家知识。此外,我们的结果表明,每个超参数的相关性在不同数据集之间差异很大,尽可能优化多个超参数对于最大化预测性能至关重要。总之,我们的研究为化学信息学从业者提供了第一套指导方针,以便有效地训练、优化和评估用于虚拟筛选和QSAR应用的梯度提升模型。