Williamson Brian D, Gilbert Peter B, Simon Noah R, Carone Marco
Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center.
Department of Biostatistics, University of Washington.
J Am Stat Assoc. 2023;118(543):1645-1658. doi: 10.1080/01621459.2021.2003200. Epub 2022 Jan 5.
In many applications, it is of interest to assess the relative contribution of features (or subsets of features) toward the goal of predicting a response - in other words, to gauge the variable importance of features. Most recent work on variable importance assessment has focused on describing the importance of features within the confines of a given prediction algorithm. However, such assessment does not necessarily characterize the prediction potential of features, and may provide a misleading reflection of the intrinsic value of these features. To address this limitation, we propose a general framework for nonparametric inference on interpretable algorithm-agnostic variable importance. We define variable importance as a population-level contrast between the oracle predictiveness of all available features versus all features except those under consideration. We propose a nonparametric efficient estimation procedure that allows the construction of valid confidence intervals, even when machine learning techniques are used. We also outline a valid strategy for testing the null importance hypothesis. Through simulations, we show that our proposal has good operating characteristics, and we illustrate its use with data from a study of an antibody against HIV-1 infection.
在许多应用中,评估特征(或特征子集)对预测响应目标的相对贡献是很有意义的——换句话说,就是衡量特征的变量重要性。最近关于变量重要性评估的工作大多集中在描述给定预测算法范围内特征的重要性。然而,这种评估不一定能表征特征的预测潜力,可能会对这些特征的内在价值提供误导性的反映。为了解决这一局限性,我们提出了一个用于可解释的与算法无关的变量重要性的非参数推断的通用框架。我们将变量重要性定义为所有可用特征的神谕预测性与除正在考虑的特征之外的所有特征的神谕预测性之间的总体水平对比。我们提出了一种非参数有效估计程序,即使在使用机器学习技术时也能构建有效的置信区间。我们还概述了一种检验零重要性假设的有效策略。通过模拟,我们表明我们的提议具有良好的操作特性,并用一项针对抗HIV-1感染抗体研究的数据说明了它的用法。