多重插补数据集的变量选择：在堆叠法和分组法之间进行选择。

Variable selection with multiply-imputed datasets: choosing between stacked and grouped methods.

作者信息

Du Jiacong, Boss Jonathan, Han Peisong, Beesley Lauren J, Kleinsasser Michael, Goutman Stephen A, Batterman Stuart, Feldman Eva L, Mukherjee Bhramar

机构信息

Department of Biostatistics, University of Michigan, Ann Arbor, MI.

Department of Neurology, University of Michigan, Ann Arbor, MI.

出版信息

J Comput Graph Stat. 2022;31(4):1063-1075. doi: 10.1080/10618600.2022.2035739. Epub 2022 Mar 28.

DOI:10.1080/10618600.2022.2035739

PMID:36644406

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9838615/

Abstract

Penalized regression methods are used in many biomedical applications for variable selection and simultaneous coefficient estimation. However, missing data complicates the implementation of these methods, particularly when missingness is handled using multiple imputation. Applying a variable selection algorithm on each imputed dataset will likely lead to different sets of selected predictors. This paper considers a general class of penalized objective functions which, by construction, force selection of the same variables across imputed datasets. By pooling objective functions across imputations, optimization is then performed jointly over all imputed datasets rather than separately for each dataset. We consider two objective function formulations that exist in the literature, which we will refer to as "stacked" and "grouped" objective functions. Building on existing work, we (a) derive and implement efficient cyclic coordinate descent and majorization-minimization optimization algorithms for continuous and binary outcome data, (b) incorporate adaptive shrinkage penalties, (c) compare these methods through simulation, and (d) develop an R package Simulations demonstrate that the "stacked" approaches are more computationally efficient and have better estimation and selection properties. We apply these methods to data from the University of Michigan ALS Patients Biorepository aiming to identify the association between environmental pollutants and ALS risk. Supplementary materials are available online.

摘要

惩罚回归方法在许多生物医学应用中用于变量选择和同时进行系数估计。然而，缺失数据使这些方法的实施变得复杂，尤其是当使用多重填补处理缺失值时。在每个填补数据集上应用变量选择算法可能会导致不同的选定预测变量集。本文考虑了一类一般的惩罚目标函数，通过构造，这些函数会强制在各个填补数据集上选择相同的变量。通过将目标函数跨填补进行合并，然后在所有填补数据集上联合进行优化，而不是为每个数据集分别进行优化。我们考虑了文献中存在的两种目标函数形式，我们将其称为“堆叠”和“分组”目标函数。基于现有工作，我们（a）针对连续和二元结局数据推导并实现了高效的循环坐标下降和主元最小化优化算法，（b）纳入了自适应收缩惩罚，（c）通过模拟比较这些方法，以及（d）开发了一个R包。模拟表明，“堆叠”方法在计算上更高效，并且具有更好的估计和选择特性。我们将这些方法应用于密歇根大学肌萎缩侧索硬化症（ALS）患者生物样本库的数据，旨在确定环境污染物与ALS风险之间的关联。补充材料可在线获取。

相似文献

Variable selection with multiply-imputed datasets: choosing between stacked and grouped methods.

J Comput Graph Stat. 2022;31(4):1063-1075. doi: 10.1080/10618600.2022.2035739. Epub 2022 Mar 28.

Variable selection for multiply-imputed data with application to dioxin exposure study.

Stat Med. 2013 Sep 20;32(21):3646-59. doi: 10.1002/sim.5783. Epub 2013 Mar 25.

A simple pooling method for variable selection in multiply imputed datasets outperformed complex methods.

BMC Med Res Methodol. 2022 Aug 4;22(1):214. doi: 10.1186/s12874-022-01693-8.

A comparison of model selection methods for prediction in the presence of multiply imputed data.

Biom J. 2019 Mar;61(2):343-356. doi: 10.1002/bimj.201700232. Epub 2018 Oct 23.

Covariate Selection for Multilevel Models with Missing Data.

Stat (Int Stat Inst). 2017;6(1):31-46. doi: 10.1002/sta4.133. Epub 2017 Jan 8.

Analyzing evidence-based falls prevention data with significant missing information using variable selection after multiple imputation.

J Appl Stat. 2021 Oct 7;50(3):724-743. doi: 10.1080/02664763.2021.1985090. eCollection 2023.

Model selection of generalized estimating equations with multiply imputed longitudinal data.

Biom J. 2013 Nov;55(6):899-911. doi: 10.1002/bimj.201200236. Epub 2013 Aug 23.

missForest with feature selection using binary particle swarm optimization improves the imputation accuracy of continuous data.

Genes Genomics. 2022 Jun;44(6):651-658. doi: 10.1007/s13258-022-01247-8. Epub 2022 Apr 6.

Missing data imputation, prediction, and feature selection in diagnosis of vaginal prolapse.

BMC Med Res Methodol. 2023 Nov 6;23(1):259. doi: 10.1186/s12874-023-02079-0.

Variable Selection in the Presence of Missing Data: Imputation-based Methods.

Wiley Interdiscip Rev Comput Stat. 2017 Sep-Oct;9(5). doi: 10.1002/wics.1402. Epub 2017 May 24.

引用本文的文献

Constructing a binary prediction model with incomplete data: Variable selection to balance fairness and precision.

Psychol Methods. 2025 Aug 14. doi: 10.1037/met0000786.

Predicting obesity at adolescence from an early age in a Dutch observational cohort study: the development and internal validation of a multivariable prediction model.

BMC Pediatr. 2025 Jun 7;25(1):465. doi: 10.1186/s12887-025-05661-1.

Model for Musculoskeletal Injury Risk Factors Among US Army Basic Combat Trainees.

JAMA Netw Open. 2025 Jun 2;8(6):e2513177. doi: 10.1001/jamanetworkopen.2025.13177.

Statistical Methods for Chemical Mixtures: A Roadmap for Practitioners Using Simulation Studies and a Sample Data Analysis in the PROTECT Cohort.

Environ Health Perspect. 2025 Jun;133(6):67019. doi: 10.1289/EHP15305. Epub 2025 Jun 19.

Using Machine Learning to Identify Social Determinants of Health that Impact Discharge Disposition for Hospitalized Patients.

J Am Med Dir Assoc. 2025 May;26(5):105524. doi: 10.1016/j.jamda.2025.105524. Epub 2025 Mar 20.

Using clinical data to reclassify ESUS patients to large artery atherosclerotic or cardioembolic stroke mechanisms.

J Neurol. 2024 Dec 21;272(1):87. doi: 10.1007/s00415-024-12848-6.

Resting Heart Rate and Associations With Clinical Measures From the Project Baseline Health Study: Observational Study.

J Med Internet Res. 2024 Dec 20;26:e60493. doi: 10.2196/60493.

Predicting implementation of response to intervention in math using elastic net logistic regression.

Front Psychol. 2024 Oct 2;15:1410396. doi: 10.3389/fpsyg.2024.1410396. eCollection 2024.

Factors associated with lower quarter performance-based balance and strength tests: a cross-sectional analysis from the project baseline health study.

Front Sports Act Living. 2024 Jul 15;6:1393332. doi: 10.3389/fspor.2024.1393332. eCollection 2024.

Intention to quit or reduce e-cigarettes, cannabis, and their co-use among a school-based sample of adolescents.

Addict Behav. 2024 Oct;157:108101. doi: 10.1016/j.addbeh.2024.108101. Epub 2024 Jul 7.

本文引用的文献

Estimating Outcome-Exposure Associations when Exposure Biomarker Detection Limits vary Across Batches.

Epidemiology. 2019 Sep;30(5):746-755. doi: 10.1097/EDE.0000000000001052.

High plasma concentrations of organic pollutants negatively impact survival in amyotrophic lateral sclerosis.

J Neurol Neurosurg Psychiatry. 2019 Aug;90(8):907-912. doi: 10.1136/jnnp-2018-319785. Epub 2019 Feb 13.

Emerging understanding of the genotype-phenotype relationship in amyotrophic lateral sclerosis.

Handb Clin Neurol. 2018;148:603-623. doi: 10.1016/B978-0-444-64076-5.00039-9.

Diagnosis and Clinical Management of Amyotrophic Lateral Sclerosis and Other Motor Neuron Disorders.

Continuum (Minneap Minn). 2017 Oct;23(5, Peripheral Nerve and Motor Neuron Disorders):1332-1359. doi: 10.1212/CON.0000000000000535.

VARIABLE SELECTION AND PREDICTION WITH INCOMPLETE HIGH-DIMENSIONAL DATA.

Ann Appl Stat. 2016 Mar;10(1):418-450. doi: 10.1214/15-AOAS899. Epub 2016 Mar 25.

Association of Environmental Toxins With Amyotrophic Lateral Sclerosis.

JAMA Neurol. 2016 Jul 1;73(7):803-11. doi: 10.1001/jamaneurol.2016.0594.

Variable selection models based on multiple imputation with an application for predicting median effective dose and maximum effect.

J Stat Comput Simul. 2015;85(9):1902-1916. doi: 10.1080/00949655.2014.907801.

A LASSO FOR HIERARCHICAL INTERACTIONS.

Ann Stat. 2013 Jun;41(3):1111-1141. doi: 10.1214/13-AOS1096.

Amyotrophic lateral sclerosis: mechanisms and therapeutics in the epigenomic era.

Nat Rev Neurol. 2015 May;11(5):266-79. doi: 10.1038/nrneurol.2015.57. Epub 2015 Apr 21.

Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors.

Stat Comput. 2015 Mar;25(2):173-187. doi: 10.1007/s11222-013-9424-2.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

多重插补数据集的变量选择：在堆叠法和分组法之间进行选择。

Variable selection with multiply-imputed datasets: choosing between stacked and grouped methods.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献