Dai Guorong, Carroll Raymond J, Chen Jinbo
Department of Statistics and Data Science, School of Management, Fudan University, Shanghai 200433, China.
Department of Statistics, Texas A&M University, College Station, TX 77840, United States.
Biometrics. 2025 Jul 3;81(3). doi: 10.1093/biomtc/ujaf095.
We consider a common nonparametric regression setting, where the data consist of a response variable Y, some easily obtainable covariates $\mathbf {X}$, and a set of costly covariates $\mathbf {Z}$. Before establishing predictive models for Y, a natural question arises: Is it worthwhile to include $\mathbf {Z}$ as predictors, given the additional cost of collecting data on $\mathbf {Z}$ for both training the models and predicting Y for future individuals? Therefore, we aim to conduct preliminary investigations to infer importance of $\mathbf {Z}$ in predicting Y in the presence of $\mathbf {X}$. To achieve this goal, we propose a nonparametric variable importance measure for $\mathbf {Z}$. It is defined as a parameter that aggregates maximum potential contributions of $\mathbf {Z}$ in single or multiple predictive models, with contributions quantified by general loss functions. Considering two-phase data that provide a large number of observations for $(Y,\mathbf {X})$ with the expensive $\mathbf {Z}$ measured only in a small subsample, we develop a novel approach to infer the proposed importance measure, accommodating missingness of $\mathbf {Z}$ in the sample by substituting functions of $(Y,\mathbf {X})$ for each individual's contribution to the predictive loss of models involving $\mathbf {Z}$. Our approach attains unified and efficient inference regardless of whether $\mathbf {Z}$ makes zero or positive contribution to predicting Y, a desirable yet surprising property owing to data incompleteness. As intermediate steps of our theoretical development, we establish novel results in two relevant research areas, semi-supervised inference and two-phase nonparametric estimation. Numerical results from both simulated and real data demonstrate superior performance of our approach.
我们考虑一种常见的非参数回归设置,其中数据由一个响应变量(Y)、一些易于获取的协变量(\mathbf{X})以及一组代价高昂的协变量(\mathbf{Z})组成。在为(Y)建立预测模型之前,会出现一个自然的问题:考虑到为训练模型和预测未来个体的(Y)而收集(\mathbf{Z})的数据所产生的额外成本,将(\mathbf{Z})作为预测变量是否值得?因此,我们旨在进行初步研究,以推断在存在(\mathbf{X})的情况下(\mathbf{Z})对预测(Y)的重要性。为实现这一目标,我们提出了一种针对(\mathbf{Z})的非参数变量重要性度量。它被定义为一个参数,该参数汇总了(\mathbf{Z})在单个或多个预测模型中的最大潜在贡献,其贡献由一般损失函数量化。考虑到两阶段数据,即对于((Y,\mathbf{X}))提供了大量观测值,而昂贵的(\mathbf{Z})仅在一个小子样本中进行了测量,我们开发了一种新颖的方法来推断所提出的重要性度量,通过用((Y,\mathbf{X}))的函数替代每个个体对涉及(\mathbf{Z})的模型预测损失的贡献,来适应样本中(\mathbf{Z})的缺失。无论(\mathbf{Z})对预测(Y)的贡献为零还是为正,我们的方法都能实现统一且高效的推断,由于数据不完整性,这是一个理想但令人惊讶的特性。作为我们理论发展的中间步骤,我们在两个相关研究领域,即半监督推断和两阶段非参数估计中建立了新颖的结果。来自模拟数据和真实数据的数值结果都证明了我们方法的卓越性能。