两阶段研究中非参数变量重要性的有效推断

Valid and efficient inference for nonparametric variable importance in two-phase studies.

作者信息

Dai Guorong, Carroll Raymond J, Chen Jinbo

机构信息

Department of Statistics and Data Science, School of Management, Fudan University, Shanghai 200433, China.

Department of Statistics, Texas A&M University, College Station, TX 77840, United States.

出版信息

Biometrics. 2025 Jul 3;81(3). doi: 10.1093/biomtc/ujaf095.

DOI:10.1093/biomtc/ujaf095

PMID:40742446

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12312401/

Abstract

We consider a common nonparametric regression setting, where the data consist of a response variable Y, some easily obtainable covariates $\mathbf {X}$, and a set of costly covariates $\mathbf {Z}$. Before establishing predictive models for Y, a natural question arises: Is it worthwhile to include $\mathbf {Z}$ as predictors, given the additional cost of collecting data on $\mathbf {Z}$ for both training the models and predicting Y for future individuals? Therefore, we aim to conduct preliminary investigations to infer importance of $\mathbf {Z}$ in predicting Y in the presence of $\mathbf {X}$. To achieve this goal, we propose a nonparametric variable importance measure for $\mathbf {Z}$. It is defined as a parameter that aggregates maximum potential contributions of $\mathbf {Z}$ in single or multiple predictive models, with contributions quantified by general loss functions. Considering two-phase data that provide a large number of observations for $(Y,\mathbf {X})$ with the expensive $\mathbf {Z}$ measured only in a small subsample, we develop a novel approach to infer the proposed importance measure, accommodating missingness of $\mathbf {Z}$ in the sample by substituting functions of $(Y,\mathbf {X})$ for each individual's contribution to the predictive loss of models involving $\mathbf {Z}$. Our approach attains unified and efficient inference regardless of whether $\mathbf {Z}$ makes zero or positive contribution to predicting Y, a desirable yet surprising property owing to data incompleteness. As intermediate steps of our theoretical development, we establish novel results in two relevant research areas, semi-supervised inference and two-phase nonparametric estimation. Numerical results from both simulated and real data demonstrate superior performance of our approach.

摘要

我们考虑一种常见的非参数回归设置，其中数据由一个响应变量(Y)、一些易于获取的协变量(\mathbf{X})以及一组代价高昂的协变量(\mathbf{Z})组成。在为(Y)建立预测模型之前，会出现一个自然的问题：考虑到为训练模型和预测未来个体的(Y)而收集(\mathbf{Z})的数据所产生的额外成本，将(\mathbf{Z})作为预测变量是否值得？因此，我们旨在进行初步研究，以推断在存在(\mathbf{X})的情况下(\mathbf{Z})对预测(Y)的重要性。为实现这一目标，我们提出了一种针对(\mathbf{Z})的非参数变量重要性度量。它被定义为一个参数，该参数汇总了(\mathbf{Z})在单个或多个预测模型中的最大潜在贡献，其贡献由一般损失函数量化。考虑到两阶段数据，即对于((Y,\mathbf{X}))提供了大量观测值，而昂贵的(\mathbf{Z})仅在一个小子样本中进行了测量，我们开发了一种新颖的方法来推断所提出的重要性度量，通过用((Y,\mathbf{X}))的函数替代每个个体对涉及(\mathbf{Z})的模型预测损失的贡献，来适应样本中(\mathbf{Z})的缺失。无论(\mathbf{Z})对预测(Y)的贡献为零还是为正，我们的方法都能实现统一且高效的推断，由于数据不完整性，这是一个理想但令人惊讶的特性。作为我们理论发展的中间步骤，我们在两个相关研究领域，即半监督推断和两阶段非参数估计中建立了新颖的结果。来自模拟数据和真实数据的数值结果都证明了我们方法的卓越性能。

相似文献

Valid and efficient inference for nonparametric variable importance in two-phase studies.两阶段研究中非参数变量重要性的有效推断

Biometrics. 2025 Jul 3;81(3). doi: 10.1093/biomtc/ujaf095.

Sexual Harassment and Prevention Training性骚扰与预防培训

Healthcare outcomes assessed with observational study designs compared with those assessed in randomized trials.与随机试验中评估的医疗保健结果相比，观察性研究设计评估的医疗保健结果。

Cochrane Database Syst Rev. 2014 Apr 29;2014(4):MR000034. doi: 10.1002/14651858.MR000034.pub2.

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病：网络荟萃分析。

Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.

Barriers and facilitators to the implementation of lay health worker programmes to improve access to maternal and child health: qualitative evidence synthesis.实施非专业卫生工作者项目以改善孕产妇和儿童健康服务可及性的障碍与促进因素：定性证据综合分析

Cochrane Database Syst Rev. 2013 Oct 8;2013(10):CD010414. doi: 10.1002/14651858.CD010414.pub2.

Physician anaesthetists versus non-physician providers of anaesthesia for surgical patients.外科患者的麻醉：医师麻醉师与非医师麻醉提供者的比较

Cochrane Database Syst Rev. 2014 Jul 11;2014(7):CD010357. doi: 10.1002/14651858.CD010357.pub2.

Surgical interventions for Ménière's disease.梅尼埃病的手术干预。

Cochrane Database Syst Rev. 2023 Feb 24;2(2):CD015249. doi: 10.1002/14651858.CD015249.pub2.

Systemic Inflammatory Response Syndrome全身炎症反应综合征

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Are Current Survival Prediction Tools Useful When Treating Subsequent Skeletal-related Events From Bone Metastases?当前的生存预测工具在治疗骨转移后的骨骼相关事件时有用吗？

Clin Orthop Relat Res. 2024 Sep 1;482(9):1710-1721. doi: 10.1097/CORR.0000000000003030. Epub 2024 Mar 22.

本文引用的文献

Quantifying how single dose Ad26.COV2.S vaccine efficacy depends on Spike sequence features.量化单剂量Ad26.COV2.S疫苗效力如何取决于刺突蛋白序列特征。

Nat Commun. 2024 Mar 11;15(1):2175. doi: 10.1038/s41467-024-46536-w.

Prevention efficacy of the broadly neutralizing antibody VRC01 depends on HIV-1 envelope sequence features.广泛中和抗体 VRC01 的预防效果取决于 HIV-1 包膜序列特征。

Proc Natl Acad Sci U S A. 2024 Jan 23;121(4):e2308942121. doi: 10.1073/pnas.2308942121. Epub 2024 Jan 19.

A general framework for inference on algorithm-agnostic variable importance.一种用于推断与算法无关的变量重要性的通用框架。

J Am Stat Assoc. 2023;118(543):1645-1658. doi: 10.1080/01621459.2021.2003200. Epub 2022 Jan 5.

Practical considerations for specifying a super learner.指定超级学习者的实用考虑因素。

Int J Epidemiol. 2023 Aug 2;52(4):1276-1285. doi: 10.1093/ije/dyad023.

Two-phase stratified sampling and analysis for predicting binary outcomes.两阶段分层抽样分析用于预测二项结局。

Biostatistics. 2023 Jul 14;24(3):585-602. doi: 10.1093/biostatistics/kxab044.

Optimal Designs of Two-Phase Studies.两阶段研究的最优设计

J Am Stat Assoc. 2020;115(532):1946-1959. doi: 10.1080/01621459.2019.1671200. Epub 2019 Oct 29.

Nonparametric variable importance assessment using machine learning techniques.基于机器学习技术的非参数变量重要性评估。

Biometrics. 2021 Mar;77(1):9-22. doi: 10.1111/biom.13392. Epub 2020 Dec 8.

Efficient Semiparametric Inference Under Two-Phase Sampling, With Applications to Genetic Association Studies.两阶段抽样下的高效半参数推断及其在基因关联研究中的应用

J Am Stat Assoc. 2017;112(520):1468-1476. doi: 10.1080/01621459.2017.1295864. Epub 2017 Feb 28.

IMPROVING EFFICIENCY IN BIOMARKER INCREMENTAL VALUE EVALUATION UNDER TWO-PHASE DESIGNS.在两阶段设计下提高生物标志物增量价值评估的效率

Ann Appl Stat. 2017 Jun;11(2):638-654. doi: 10.1214/16-AOAS997. Epub 2017 Jul 20.

Efficient Estimation of Semiparametric Transformation Models for Two-Phase Cohort Studies.两阶段队列研究半参数转换模型的有效估计

J Am Stat Assoc. 2014 Jan 1;109(505):371-383. doi: 10.1080/01621459.2013.842172.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验