Williamson Brian D, Feng Jean
Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA.
Department of Biostatistics, University of Washington, Seattle, WA.
Proc Mach Learn Res. 2020 Jul;119:10282-10291.
The true population-level importance of a variable in a prediction task provides useful knowledge about the underlying data-generating mechanism and can help in deciding which measurements to collect in subsequent experiments. Valid statistical inference on this importance is a key component in understanding the population of interest. We present a computationally efficient procedure for estimating and obtaining valid statistical inference on the hapley opulation ariable mportance easure (SPVIM). Although the computational complexity of the true SPVIM scales exponentially with the number of variables, we propose an estimator based on randomly sampling only Θ() feature subsets given observations. We prove that our estimator converges at an asymptotically optimal rate. Moreover, by deriving the asymptotic distribution of our estimator, we construct valid confidence intervals and hypothesis tests. Our procedure has good finite-sample performance in simulations, and for an in-hospital mortality prediction task produces similar variable importance estimates when different machine learning algorithms are applied.
在预测任务中,变量在总体层面的真正重要性提供了有关潜在数据生成机制的有用知识,并有助于决定在后续实验中收集哪些测量数据。对这种重要性进行有效的统计推断是理解目标总体的关键组成部分。我们提出了一种计算效率高的程序,用于估计哈普利总体变量重要性度量(SPVIM)并获得有效的统计推断。尽管真实SPVIM的计算复杂度随变量数量呈指数增长,但我们提出了一种估计器,在给定观测值的情况下,仅对Θ()个特征子集进行随机采样。我们证明我们的估计器以渐近最优速率收敛。此外,通过推导我们估计器的渐近分布,我们构建了有效的置信区间和假设检验。我们的程序在模拟中具有良好的有限样本性能,并且对于院内死亡率预测任务,当应用不同的机器学习算法时会产生相似的变量重要性估计。