School of Information Science and Technology , Northeast Normal University , Changchun , 130117 , China.
Institute of Functional Material Chemistry, Faculty of Chemistry , Northeast Normal University , Changchun , 130024 , China.
J Chem Inf Model. 2019 May 28;59(5):1849-1857. doi: 10.1021/acs.jcim.8b00878. Epub 2019 Apr 8.
Machine learning has exhibited powerful capabilities in many areas. However, machine learning models are mostly database dependent, requiring a new model if the database changes. Therefore, a universal model is highly desired to accommodate the widest variety of databases. Fortunately, this universality may be achieved by ensemble learning, which can integrate multiple learners to meet the demands of diversified databases. Therefore, we propose a general procedure for learning ensemble establishment based on noncovalent interactions (NCIs) databases. Additionally, accurate NCI computation is quite demanding for first-principles methods, for which a competent machine learning model can be an efficient solution to obtain high NCI accuracy with minimal computational resources. In regard to these aspects, multiple schemes of ensemble learning models (Bagging, Boosting, and Stacking frameworks), are explored in this study. The models are based on various low levels of density functional theory (DFT) calculations for the benchmark databases S66, S22, and X40. All NCIs computed by the DFT calculations can be improved to high-level accuracy (root-mean-square error RMSE = 0.22 kcal/mol in contrast to CCSD(T)/CBS benchmark) by established ensemble learning models. Compared with single machine learning models, ensemble models show better accuracy (RMSE of the best model is further lowered by ∼25%), robustness and goodness-of-fit according to evaluation parameters suggested by the OECD. Among ensemble learning models, heterogeneous Stacking ensemble models show the most valuable application potential. The standardized procedure of constructing learning ensembles has been well utilized on several NCI data sets, and this procedure may also be applicable for other chemical databases.
机器学习在许多领域都展现出了强大的能力。然而,机器学习模型大多依赖于数据库,如果数据库发生变化,就需要一个新的模型。因此,人们非常希望有一种通用模型来适应最广泛的数据库。幸运的是,这种通用性可以通过集成学习来实现,集成学习可以整合多个学习者来满足多样化数据库的需求。因此,我们提出了一种基于非共价相互作用(NCIs)数据库的学习集成建立的通用程序。此外,对于第一性原理方法来说,准确计算 NCIs 的要求非常高,对于这种情况,一个胜任的机器学习模型可以是一个高效的解决方案,可以用最小的计算资源获得高的 NCIs 精度。在这些方面,本研究探索了多种集成学习模型(Bagging、Boosting 和 Stacking 框架)的方案。这些模型基于不同的密度泛函理论(DFT)计算水平,用于基准数据库 S66、S22 和 X40。通过建立的集成学习模型,可以将 DFT 计算得到的所有 NCIs 提高到高精度(与 CCSD(T)/CBS 基准相比,均方根误差 RMSE = 0.22 kcal/mol)。与单个机器学习模型相比,集成模型显示出更好的准确性(最佳模型的 RMSE 进一步降低了约 25%)、稳健性和拟合优度,根据 OECD 建议的评估参数。在集成学习模型中,异构的 Stacking 集成模型显示出最有价值的应用潜力。构建学习集成的标准化程序已经在几个 NCI 数据集上得到了很好的利用,并且该程序也可能适用于其他化学数据库。