Suppr超能文献

使用夏普利值对总体特征重要性进行高效非参数统计推断。

Efficient nonparametric statistical inference on population feature importance using Shapley values.

作者信息

Williamson Brian D, Feng Jean

机构信息

Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA.

Department of Biostatistics, University of Washington, Seattle, WA.

出版信息

Proc Mach Learn Res. 2020 Jul;119:10282-10291.

Abstract

The true population-level importance of a variable in a prediction task provides useful knowledge about the underlying data-generating mechanism and can help in deciding which measurements to collect in subsequent experiments. Valid statistical inference on this importance is a key component in understanding the population of interest. We present a computationally efficient procedure for estimating and obtaining valid statistical inference on the hapley opulation ariable mportance easure (SPVIM). Although the computational complexity of the true SPVIM scales exponentially with the number of variables, we propose an estimator based on randomly sampling only Θ() feature subsets given observations. We prove that our estimator converges at an asymptotically optimal rate. Moreover, by deriving the asymptotic distribution of our estimator, we construct valid confidence intervals and hypothesis tests. Our procedure has good finite-sample performance in simulations, and for an in-hospital mortality prediction task produces similar variable importance estimates when different machine learning algorithms are applied.

摘要

在预测任务中,变量在总体层面的真正重要性提供了有关潜在数据生成机制的有用知识,并有助于决定在后续实验中收集哪些测量数据。对这种重要性进行有效的统计推断是理解目标总体的关键组成部分。我们提出了一种计算效率高的程序,用于估计哈普利总体变量重要性度量(SPVIM)并获得有效的统计推断。尽管真实SPVIM的计算复杂度随变量数量呈指数增长,但我们提出了一种估计器,在给定观测值的情况下,仅对Θ()个特征子集进行随机采样。我们证明我们的估计器以渐近最优速率收敛。此外,通过推导我们估计器的渐近分布,我们构建了有效的置信区间和假设检验。我们的程序在模拟中具有良好的有限样本性能,并且对于院内死亡率预测任务,当应用不同的机器学习算法时会产生相似的变量重要性估计。

相似文献

4
Collaborative double robust targeted maximum likelihood estimation.协作双稳健靶向最大似然估计
Int J Biostat. 2010 May 17;6(1):Article 17. doi: 10.2202/1557-4679.1181.
6
Optimal Nonparametric Inference with Two-Scale Distributional Nearest Neighbors.基于双尺度分布最近邻的最优非参数推断
J Am Stat Assoc. 2024;119(545):297-307. doi: 10.1080/01621459.2022.2115375. Epub 2022 Oct 5.
7
Shapley variable importance cloud for interpretable machine learning.用于可解释机器学习的Shapley变量重要性云图
Patterns (N Y). 2022 Feb 22;3(4):100452. doi: 10.1016/j.patter.2022.100452. eCollection 2022 Apr 8.
10

引用本文的文献

3
Flexible variable selection in the presence of missing data.存在缺失数据时的灵活变量选择。
Int J Biostat. 2024 Feb 13;20(2):347-359. doi: 10.1515/ijb-2023-0059. eCollection 2024 Nov 1.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验