Suppr超能文献

两阶段研究中非参数变量重要性的有效推断

Valid and efficient inference for nonparametric variable importance in two-phase studies.

作者信息

Dai Guorong, Carroll Raymond J, Chen Jinbo

机构信息

Department of Statistics and Data Science, School of Management, Fudan University, Shanghai 200433, China.

Department of Statistics, Texas A&M University, College Station, TX 77840, United States.

出版信息

Biometrics. 2025 Jul 3;81(3). doi: 10.1093/biomtc/ujaf095.

Abstract

We consider a common nonparametric regression setting, where the data consist of a response variable Y, some easily obtainable covariates $\mathbf {X}$, and a set of costly covariates $\mathbf {Z}$. Before establishing predictive models for Y, a natural question arises: Is it worthwhile to include $\mathbf {Z}$ as predictors, given the additional cost of collecting data on $\mathbf {Z}$ for both training the models and predicting Y for future individuals? Therefore, we aim to conduct preliminary investigations to infer importance of $\mathbf {Z}$ in predicting Y in the presence of $\mathbf {X}$. To achieve this goal, we propose a nonparametric variable importance measure for $\mathbf {Z}$. It is defined as a parameter that aggregates maximum potential contributions of $\mathbf {Z}$ in single or multiple predictive models, with contributions quantified by general loss functions. Considering two-phase data that provide a large number of observations for $(Y,\mathbf {X})$ with the expensive $\mathbf {Z}$ measured only in a small subsample, we develop a novel approach to infer the proposed importance measure, accommodating missingness of $\mathbf {Z}$ in the sample by substituting functions of $(Y,\mathbf {X})$ for each individual's contribution to the predictive loss of models involving $\mathbf {Z}$. Our approach attains unified and efficient inference regardless of whether $\mathbf {Z}$ makes zero or positive contribution to predicting Y, a desirable yet surprising property owing to data incompleteness. As intermediate steps of our theoretical development, we establish novel results in two relevant research areas, semi-supervised inference and two-phase nonparametric estimation. Numerical results from both simulated and real data demonstrate superior performance of our approach.

摘要

我们考虑一种常见的非参数回归设置,其中数据由一个响应变量(Y)、一些易于获取的协变量(\mathbf{X})以及一组代价高昂的协变量(\mathbf{Z})组成。在为(Y)建立预测模型之前,会出现一个自然的问题:考虑到为训练模型和预测未来个体的(Y)而收集(\mathbf{Z})的数据所产生的额外成本,将(\mathbf{Z})作为预测变量是否值得?因此,我们旨在进行初步研究,以推断在存在(\mathbf{X})的情况下(\mathbf{Z})对预测(Y)的重要性。为实现这一目标,我们提出了一种针对(\mathbf{Z})的非参数变量重要性度量。它被定义为一个参数,该参数汇总了(\mathbf{Z})在单个或多个预测模型中的最大潜在贡献,其贡献由一般损失函数量化。考虑到两阶段数据,即对于((Y,\mathbf{X}))提供了大量观测值,而昂贵的(\mathbf{Z})仅在一个小子样本中进行了测量,我们开发了一种新颖的方法来推断所提出的重要性度量,通过用((Y,\mathbf{X}))的函数替代每个个体对涉及(\mathbf{Z})的模型预测损失的贡献,来适应样本中(\mathbf{Z})的缺失。无论(\mathbf{Z})对预测(Y)的贡献为零还是为正,我们的方法都能实现统一且高效的推断,由于数据不完整性,这是一个理想但令人惊讶的特性。作为我们理论发展的中间步骤,我们在两个相关研究领域,即半监督推断和两阶段非参数估计中建立了新颖的结果。来自模拟数据和真实数据的数值结果都证明了我们方法的卓越性能。

相似文献

7
Surgical interventions for Ménière's disease.梅尼埃病的手术干预。
Cochrane Database Syst Rev. 2023 Feb 24;2(2):CD015249. doi: 10.1002/14651858.CD015249.pub2.

本文引用的文献

4
Practical considerations for specifying a super learner.指定超级学习者的实用考虑因素。
Int J Epidemiol. 2023 Aug 2;52(4):1276-1285. doi: 10.1093/ije/dyad023.
6
Optimal Designs of Two-Phase Studies.两阶段研究的最优设计
J Am Stat Assoc. 2020;115(532):1946-1959. doi: 10.1080/01621459.2019.1671200. Epub 2019 Oct 29.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验