Lin Jung-Yi Joyce, Hu Liangyuan, Huang Chuyue, Jiayi Ji, Lawrence Steven, Govindarajulu Usha
Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, 1425 Madison Ave, New York, 10029, USA.
Department of Biostatistics and Epidemiology, Rutgers University, 683 Hoes Lane West, Piscataway, 08854, USA.
BMC Med Res Methodol. 2022 May 4;22(1):132. doi: 10.1186/s12874-022-01608-7.
Prior work has shown that combining bootstrap imputation with tree-based machine learning variable selection methods can provide good performances achievable on fully observed data when covariate and outcome data are missing at random (MAR). This approach however is computationally expensive, especially on large-scale datasets.
We propose an inference-based method, called RR-BART, which leverages the likelihood-based Bayesian machine learning technique, Bayesian additive regression trees, and uses Rubin's rule to combine the estimates and variances of the variable importance measures on multiply imputed datasets for variable selection in the presence of MAR data. We conduct a representative simulation study to investigate the practical operating characteristics of RR-BART, and compare it with the bootstrap imputation based methods. We further demonstrate the methods via a case study of risk factors for 3-year incidence of metabolic syndrome among middle-aged women using data from the Study of Women's Health Across the Nation (SWAN).
The simulation study suggests that even in complex conditions of nonlinearity and nonadditivity with a large percentage of missingness, RR-BART can reasonably recover both prediction and variable selection performances, achievable on the fully observed data. RR-BART provides the best performance that the bootstrap imputation based methods can achieve with the optimal selection threshold value. In addition, RR-BART demonstrates a substantially stronger ability of detecting discrete predictors. Furthermore, RR-BART offers substantial computational savings. When implemented on the SWAN data, RR-BART adds to the literature by selecting a set of predictors that had been less commonly identified as risk factors but had substantial biological justifications.
The proposed variable selection method for MAR data, RR-BART, offers both computational efficiency and good operating characteristics and is utilitarian in large-scale healthcare database studies.
先前的研究表明,当协变量和结果数据随机缺失(MAR)时,将自助法插补与基于树的机器学习变量选择方法相结合,能够在完全观测数据上实现良好的性能。然而,这种方法计算成本高昂,尤其是在大规模数据集上。
我们提出了一种基于推断的方法,称为RR-BART,它利用基于似然的贝叶斯机器学习技术——贝叶斯加法回归树,并使用鲁宾法则来合并多重插补数据集上变量重要性度量的估计值和方差,以便在存在MAR数据的情况下进行变量选择。我们进行了一项具有代表性的模拟研究,以调查RR-BART的实际操作特性,并将其与基于自助法插补的方法进行比较。我们还通过一项案例研究进一步展示了这些方法,该案例研究使用来自全国女性健康研究(SWAN)的数据,分析中年女性代谢综合征3年发病率的风险因素。
模拟研究表明,即使在存在大量缺失值的非线性和非可加性复杂条件下,RR-BART也能够合理地恢复在完全观测数据上可实现的预测和变量选择性能。RR-BART在最优选择阈值下提供了基于自助法插补的方法所能达到的最佳性能。此外,RR-BART在检测离散预测变量方面表现出更强的能力。而且,RR-BART显著节省了计算成本。在SWAN数据上实施时,RR-BART通过选择一组较少被认定为风险因素但具有充分生物学依据的预测变量,为该领域文献增添了内容。
所提出的用于MAR数据的变量选择方法RR-BART,兼具计算效率和良好的操作特性,在大规模医疗保健数据库研究中具有实用价值。