SINTEF DIGITAL, Oslo, Norway.
Department of Mathematical Sciences, Norwegian University of Science and Technology, Trondheim, Norway.
PLoS Comput Biol. 2023 Mar 14;19(3):e1010963. doi: 10.1371/journal.pcbi.1010963. eCollection 2023 Mar.
Estimating feature importance, which is the contribution of a prediction or several predictions due to a feature, is an essential aspect of explaining data-based models. Besides explaining the model itself, an equally relevant question is which features are important in the underlying data generating process. We present a Shapley-value-based framework for inferring the importance of individual features, including uncertainty in the estimator. We build upon the recently published model-agnostic feature importance score of SAGE (Shapley additive global importance) and introduce Sub-SAGE. For tree-based models, it has the advantage that it can be estimated without computationally expensive resampling. We argue that for all model types the uncertainties in our Sub-SAGE estimator can be estimated using bootstrapping and demonstrate the approach for tree ensemble methods. The framework is exemplified on synthetic data as well as large genotype data for predicting feature importance with respect to obesity.
估计特征重要性,即由于某个特征导致的预测或多个预测的贡献,是解释基于数据的模型的一个重要方面。除了解释模型本身之外,一个同样相关的问题是,在潜在的数据生成过程中哪些特征是重要的。我们提出了一个基于 Shapley 值的框架,用于推断单个特征的重要性,包括估计器中的不确定性。我们基于最近发布的 SAGE(Shapley 可加全局重要性)的无模型特征重要性得分进行构建,并引入了 Sub-SAGE。对于基于树的模型,它的优点是可以在不进行计算成本高昂的重采样的情况下进行估计。我们认为,对于所有模型类型,我们的 Sub-SAGE 估计器的不确定性都可以使用自举法进行估计,并针对树集成方法演示了该方法。该框架在合成数据以及用于预测肥胖特征重要性的大型基因型数据上进行了示例。