存在缺失数据时的灵活变量选择。

Flexible variable selection in the presence of missing data.

作者信息

Williamson Brian D, Huang Ying

机构信息

Biostatistics Division, Kaiser Permanente Washington Health Research Institute, Seattle, USA.

Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Center, Seattle, USA.

出版信息

Int J Biostat. 2024 Feb 13;20(2):347-359. doi: 10.1515/ijb-2023-0059. eCollection 2024 Nov 1.

DOI:10.1515/ijb-2023-0059

PMID:38348882

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11323294/

Abstract

In many applications, it is of interest to identify a parsimonious set of features, or panel, from multiple candidates that achieves a desired level of performance in predicting a response. This task is often complicated in practice by missing data arising from the sampling design or other random mechanisms. Most recent work on variable selection in missing data contexts relies in some part on a finite-dimensional statistical model, e.g., a generalized or penalized linear model. In cases where this model is misspecified, the selected variables may not all be truly scientifically relevant and can result in panels with suboptimal classification performance. To address this limitation, we propose a nonparametric variable selection algorithm combined with multiple imputation to develop flexible panels in the presence of missing-at-random data. We outline strategies based on the proposed algorithm that achieve control of commonly used error rates. Through simulations, we show that our proposal has good operating characteristics and results in panels with higher classification and variable selection performance compared to several existing penalized regression approaches in cases where a generalized linear model is misspecified. Finally, we use the proposed method to develop biomarker panels for separating pancreatic cysts with differing malignancy potential in a setting where complicated missingness in the biomarkers arose due to limited specimen volumes.

摘要

在许多应用中，从多个候选特征中识别出一组简洁的特征（即特征组）以在预测响应时达到期望的性能水平是很有意义的。在实践中，由于抽样设计或其他随机机制导致的数据缺失，这项任务通常会变得复杂。最近在缺失数据情况下进行变量选择的工作在某种程度上依赖于有限维统计模型，例如广义线性模型或惩罚线性模型。在该模型设定错误的情况下，所选变量可能并非全部在科学上真正相关，并且可能导致特征组的分类性能次优。为了解决这一局限性，我们提出一种非参数变量选择算法，并结合多重填补法，以便在存在随机缺失数据的情况下开发灵活的特征组。我们概述了基于所提出算法的策略，这些策略能够控制常用的错误率。通过模拟，我们表明，在广义线性模型设定错误的情况下，与几种现有的惩罚回归方法相比，我们的方法具有良好的操作特性，并能得到具有更高分类和变量选择性能的特征组。最后，在生物标志物因样本量有限而出现复杂缺失的情况下，我们使用所提出的方法开发生物标志物特征组，以区分具有不同恶性潜能的胰腺囊肿。

相似文献

Flexible variable selection in the presence of missing data.

Int J Biostat. 2024 Feb 13;20(2):347-359. doi: 10.1515/ijb-2023-0059. eCollection 2024 Nov 1.

The performance of prognostic models depended on the choice of missing value imputation algorithm: a simulation study.

J Clin Epidemiol. 2024 Dec;176:111539. doi: 10.1016/j.jclinepi.2024.111539. Epub 2024 Sep 24.

Multiple imputation using auxiliary imputation variables that only predict missingness can increase bias due to data missing not at random.

BMC Med Res Methodol. 2024 Oct 7;24(1):231. doi: 10.1186/s12874-024-02353-9.

Multiple imputation with sequential penalized regression.

Stat Methods Med Res. 2019 May;28(5):1311-1327. doi: 10.1177/0962280218755574. Epub 2018 Feb 16.

A nonparametric multiple imputation approach for missing categorical data.

BMC Med Res Methodol. 2017 Jun 6;17(1):87. doi: 10.1186/s12874-017-0360-2.

A flexible approach for variable selection in large-scale healthcare database studies with missing covariate and outcome data.

BMC Med Res Methodol. 2022 May 4;22(1):132. doi: 10.1186/s12874-022-01608-7.

Nonparametric multiple imputation for receiver operating characteristics analysis when some biomarker values are missing at random.

Stat Med. 2011 Nov 20;30(26):3149-61. doi: 10.1002/sim.4338.

A generative model for evaluating missing data methods in large epidemiological cohorts.

BMC Med Res Methodol. 2025 Feb 8;25(1):34. doi: 10.1186/s12874-025-02487-4.

Variable selection in the presence of missing data: resampling and imputation.

Biostatistics. 2015 Jul;16(3):596-610. doi: 10.1093/biostatistics/kxv003. Epub 2015 Feb 18.

SuperMICE: An Ensemble Machine Learning Approach to Multiple Imputation by Chained Equations.

Am J Epidemiol. 2022 Feb 19;191(3):516-525. doi: 10.1093/aje/kwab271.

引用本文的文献

Flexible imputation toolkit for electronic health records.

Sci Rep. 2025 May 17;15(1):17176. doi: 10.1038/s41598-025-02276-5.

本文引用的文献

A general framework for inference on algorithm-agnostic variable importance.

J Am Stat Assoc. 2023;118(543):1645-1658. doi: 10.1080/01621459.2021.2003200. Epub 2022 Jan 5.

Efficient nonparametric statistical inference on population feature importance using Shapley values.

Proc Mach Learn Res. 2020 Jul;119:10282-10291.

Biomarkers and Strategy to Detect Preinvasive and Early Pancreatic Cancer: State of the Field and the Impact of the EDRN.

Cancer Epidemiol Biomarkers Prev. 2020 Dec;29(12):2513-2523. doi: 10.1158/1055-9965.EPI-20-0161. Epub 2020 Jun 12.

Antibody Fc effector functions and IgG3 associate with decreased HIV-1 risk.

J Clin Invest. 2019 Nov 1;129(11):4838-4849. doi: 10.1172/JCI126391.

Novel Methylated DNA Markers Discriminate Advanced Neoplasia in Pancreatic Cysts: Marker Discovery, Tissue Validation, and Cyst Fluid Testing.

Am J Gastroenterol. 2019 Sep;114(9):1539-1549. doi: 10.14309/ajg.0000000000000284.

On Inverse Probability Weighting for Nonmonotone Missing at Random Data.

J Am Stat Assoc. 2018;113(521):369-379. doi: 10.1080/01621459.2016.1256814. Epub 2017 Dec 1.

Preoperative next-generation sequencing of pancreatic cyst fluid is highly accurate in cyst classification and detection of advanced neoplasia.

Gut. 2018 Dec;67(12):2131-2141. doi: 10.1136/gutjnl-2016-313586. Epub 2017 Sep 28.

Global Protease Activity Profiling Provides Differential Diagnosis of Pancreatic Cysts.

Clin Cancer Res. 2017 Aug 15;23(16):4865-4874. doi: 10.1158/1078-0432.CCR-16-2987. Epub 2017 Apr 19.

Predicting the Grade of Dysplasia of Pancreatic Cystic Neoplasms Using Cyst Fluid DNA Methylation Markers.

Clin Cancer Res. 2017 Jul 15;23(14):3935-3944. doi: 10.1158/1078-0432.CCR-16-2244. Epub 2017 Feb 1.

Cyst Fluid Telomerase Activity Predicts the Histologic Grade of Cystic Neoplasms of the Pancreas.

Clin Cancer Res. 2016 Oct 15;22(20):5141-5151. doi: 10.1158/1078-0432.CCR-16-0311. Epub 2016 May 26.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

存在缺失数据时的灵活变量选择。

Flexible variable selection in the presence of missing data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献