Kaneko Hiromasa
Department of Applied Chemistry, School of Science and Technology Meiji University Kawasaki Japan.
Anal Sci Adv. 2022 Sep 7;3(9-10):278-287. doi: 10.1002/ansa.202200018. eCollection 2022 Oct.
In molecular design, material design, process design, and process control, it is important not only to construct a model with high predictive ability between explanatory features x and objective features y using a dataset but also to interpret the constructed model. An index of feature importance in x is permutation feature importance (PFI), which can be combined with any regressors and classifiers. However, the PFI becomes unstable when the number of samples is low because it is necessary to divide a dataset into training and validation data when calculating it. Additionally, when there are strongly correlated features in x, the PFI of these features is estimated to be low. Hence, a cross-validated PFI (CVPFI) method is proposed. CVPFI can be calculated stably, even with a small number of samples, because model construction and feature evaluation are repeated based on cross-validation. Furthermore, by considering the absolute correlation coefficients between the features, the feature importance can be evaluated appropriately even when there are strongly correlated features in x. Case studies using numerical simulation data and actual compound data showed that the feature importance can be evaluated appropriately using CVPFI compared to PFI. This is possible when the number of samples is low, when linear and nonlinear relationships are mixed between x and y when there are strong correlations between features in x, and when quantised and biased features exist in x. Python codes for CVPFI are available at https://github.com/hkaneko1985/dcekit.
在分子设计、材料设计、工艺设计和过程控制中,不仅要使用数据集构建一个在解释性特征x和目标特征y之间具有高预测能力的模型,而且要对构建的模型进行解释,这一点很重要。x中特征重要性的一个指标是排列特征重要性(PFI),它可以与任何回归器和分类器相结合。然而,当样本数量较少时,PFI会变得不稳定,因为在计算时需要将数据集划分为训练数据和验证数据。此外,当x中存在强相关特征时,这些特征的PFI估计值会较低。因此,提出了一种交叉验证的PFI(CVPFI)方法。即使样本数量较少,CVPFI也可以稳定计算,因为基于交叉验证重复进行模型构建和特征评估。此外,通过考虑特征之间的绝对相关系数,即使x中存在强相关特征,也可以适当地评估特征重要性。使用数值模拟数据和实际化合物数据的案例研究表明,与PFI相比,使用CVPFI可以适当地评估特征重要性。当样本数量较少、x和y之间存在线性和非线性关系混合、x中特征之间存在强相关性以及x中存在量化和有偏特征时,这是可行的。CVPFI的Python代码可在https://github.com/hkaneko1985/dcekit获取。