Du Jiacong, Boss Jonathan, Han Peisong, Beesley Lauren J, Kleinsasser Michael, Goutman Stephen A, Batterman Stuart, Feldman Eva L, Mukherjee Bhramar
Department of Biostatistics, University of Michigan, Ann Arbor, MI.
Department of Neurology, University of Michigan, Ann Arbor, MI.
J Comput Graph Stat. 2022;31(4):1063-1075. doi: 10.1080/10618600.2022.2035739. Epub 2022 Mar 28.
Penalized regression methods are used in many biomedical applications for variable selection and simultaneous coefficient estimation. However, missing data complicates the implementation of these methods, particularly when missingness is handled using multiple imputation. Applying a variable selection algorithm on each imputed dataset will likely lead to different sets of selected predictors. This paper considers a general class of penalized objective functions which, by construction, force selection of the same variables across imputed datasets. By pooling objective functions across imputations, optimization is then performed jointly over all imputed datasets rather than separately for each dataset. We consider two objective function formulations that exist in the literature, which we will refer to as "stacked" and "grouped" objective functions. Building on existing work, we (a) derive and implement efficient cyclic coordinate descent and majorization-minimization optimization algorithms for continuous and binary outcome data, (b) incorporate adaptive shrinkage penalties, (c) compare these methods through simulation, and (d) develop an R package Simulations demonstrate that the "stacked" approaches are more computationally efficient and have better estimation and selection properties. We apply these methods to data from the University of Michigan ALS Patients Biorepository aiming to identify the association between environmental pollutants and ALS risk. Supplementary materials are available online.
惩罚回归方法在许多生物医学应用中用于变量选择和同时进行系数估计。然而,缺失数据使这些方法的实施变得复杂,尤其是当使用多重填补处理缺失值时。在每个填补数据集上应用变量选择算法可能会导致不同的选定预测变量集。本文考虑了一类一般的惩罚目标函数,通过构造,这些函数会强制在各个填补数据集上选择相同的变量。通过将目标函数跨填补进行合并,然后在所有填补数据集上联合进行优化,而不是为每个数据集分别进行优化。我们考虑了文献中存在的两种目标函数形式,我们将其称为“堆叠”和“分组”目标函数。基于现有工作,我们(a)针对连续和二元结局数据推导并实现了高效的循环坐标下降和主元最小化优化算法,(b)纳入了自适应收缩惩罚,(c)通过模拟比较这些方法,以及(d)开发了一个R包。模拟表明,“堆叠”方法在计算上更高效,并且具有更好的估计和选择特性。我们将这些方法应用于密歇根大学肌萎缩侧索硬化症(ALS)患者生物样本库的数据,旨在确定环境污染物与ALS风险之间的关联。补充材料可在线获取。