Chen Chia-Hsiu, Tanaka Kenichi, Kotera Masaaki, Funatsu Kimito
Department of Chemical System Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8656, Japan.
J Cheminform. 2020 Mar 30;12(1):19. doi: 10.1186/s13321-020-0417-9.
Ensemble learning helps improve machine learning results by combining several models and allows the production of better predictive performance compared to a single model. It also benefits and accelerates the researches in quantitative structure-activity relationship (QSAR) and quantitative structure-property relationship (QSPR). With the growing number of ensemble learning models such as random forest, the effectiveness of QSAR/QSPR will be limited by the machine's inability to interpret the predictions to researchers. In fact, many implementations of ensemble learning models are able to quantify the overall magnitude of each feature. For example, feature importance allows us to assess the relative importance of features and to interpret the predictions. However, different ensemble learning methods or implementations may lead to different feature selections for interpretation. In this paper, we compared the predictability and interpretability of four typical well-established ensemble learning models (Random forest, extreme randomized trees, adaptive boosting and gradient boosting) for regression and binary classification modeling tasks. Then, the blending methods were built by summarizing four different ensemble learning methods. The blending method led to better performance and a unification interpretation by summarizing individual predictions from different learning models. The important features of two case studies which gave us some valuable information to compound properties were discussed in detail in this report. QSPR modeling with interpretable machine learning techniques can move the chemical design forward to work more efficiently, confirm hypothesis and establish knowledge for better results.
集成学习通过组合多个模型来帮助提高机器学习的结果,并且与单个模型相比,能够产生更好的预测性能。它还对定量构效关系(QSAR)和定量构性关系(QSPR)的研究有益并能加速其发展。随着随机森林等集成学习模型数量的不断增加,QSAR/QSPR的有效性将受到机器无法向研究人员解释预测结果的限制。事实上,许多集成学习模型实现能够量化每个特征的总体重要程度。例如,特征重要性使我们能够评估特征的相对重要性并解释预测结果。然而,不同的集成学习方法或实现可能会导致用于解释的特征选择不同。在本文中,我们比较了四种典型的成熟集成学习模型(随机森林、极端随机树、自适应提升和梯度提升)在回归和二元分类建模任务中的可预测性和可解释性。然后,通过总结四种不同的集成学习方法构建了混合方法。混合方法通过总结来自不同学习模型的个体预测,实现了更好的性能和统一的解释。本报告详细讨论了两个案例研究的重要特征,这些特征为化合物性质提供了一些有价值的信息。使用可解释机器学习技术的QSPR建模可以推动化学设计更高效地进行,验证假设并建立知识以获得更好的结果。