B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Department of Life Science Informatics and Data Science, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany.
Sci Rep. 2023 Apr 12;13(1):5983. doi: 10.1038/s41598-023-33215-x.
The random forest (RF) and support vector machine (SVM) methods are mainstays in molecular machine learning (ML) and compound property prediction. We have explored in detail how binary classification models derived using these algorithms arrive at their predictions. To these ends, approaches from explainable artificial intelligence (XAI) are applicable such as the Shapley value concept originating from game theory that we adapted and further extended for our analysis. In large-scale activity-based compound classification using models derived from training sets of increasing size, RF and SVM with the Tanimoto kernel produced very similar predictions that could hardly be distinguished. However, Shapley value analysis revealed that their learning characteristics systematically differed and that chemically intuitive explanations of accurate RF and SVM predictions had different origins.
随机森林(RF)和支持向量机(SVM)方法是分子机器学习(ML)和化合物性质预测的主要方法。我们详细探讨了使用这些算法得出的二进制分类模型如何进行预测。为此,可应用可解释人工智能(XAI)方法,例如源自博弈论的 Shapley 值概念,我们对其进行了改编和进一步扩展,以用于我们的分析。在使用源自不断增大的训练集的模型进行基于活性的大规模化合物分类中,RF 和 SVM 与 Tanimoto 核产生的预测非常相似,几乎无法区分。然而,Shapley 值分析表明,它们的学习特征系统不同,并且准确的 RF 和 SVM 预测的化学直观解释具有不同的起源。