一种用于大规模医疗保健数据库研究中变量选择的灵活方法，该研究存在协变量和结果数据缺失的情况。

A flexible approach for variable selection in large-scale healthcare database studies with missing covariate and outcome data.

作者信息

Lin Jung-Yi Joyce, Hu Liangyuan, Huang Chuyue, Jiayi Ji, Lawrence Steven, Govindarajulu Usha

机构信息

Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, 1425 Madison Ave, New York, 10029, USA.

Department of Biostatistics and Epidemiology, Rutgers University, 683 Hoes Lane West, Piscataway, 08854, USA.

出版信息

BMC Med Res Methodol. 2022 May 4;22(1):132. doi: 10.1186/s12874-022-01608-7.

DOI:10.1186/s12874-022-01608-7

PMID:35508974

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9066834/

Abstract

BACKGROUND

Prior work has shown that combining bootstrap imputation with tree-based machine learning variable selection methods can provide good performances achievable on fully observed data when covariate and outcome data are missing at random (MAR). This approach however is computationally expensive, especially on large-scale datasets.

METHODS

We propose an inference-based method, called RR-BART, which leverages the likelihood-based Bayesian machine learning technique, Bayesian additive regression trees, and uses Rubin's rule to combine the estimates and variances of the variable importance measures on multiply imputed datasets for variable selection in the presence of MAR data. We conduct a representative simulation study to investigate the practical operating characteristics of RR-BART, and compare it with the bootstrap imputation based methods. We further demonstrate the methods via a case study of risk factors for 3-year incidence of metabolic syndrome among middle-aged women using data from the Study of Women's Health Across the Nation (SWAN).

RESULTS

The simulation study suggests that even in complex conditions of nonlinearity and nonadditivity with a large percentage of missingness, RR-BART can reasonably recover both prediction and variable selection performances, achievable on the fully observed data. RR-BART provides the best performance that the bootstrap imputation based methods can achieve with the optimal selection threshold value. In addition, RR-BART demonstrates a substantially stronger ability of detecting discrete predictors. Furthermore, RR-BART offers substantial computational savings. When implemented on the SWAN data, RR-BART adds to the literature by selecting a set of predictors that had been less commonly identified as risk factors but had substantial biological justifications.

CONCLUSION

The proposed variable selection method for MAR data, RR-BART, offers both computational efficiency and good operating characteristics and is utilitarian in large-scale healthcare database studies.

摘要

背景

先前的研究表明，当协变量和结果数据随机缺失（MAR）时，将自助法插补与基于树的机器学习变量选择方法相结合，能够在完全观测数据上实现良好的性能。然而，这种方法计算成本高昂，尤其是在大规模数据集上。

方法

我们提出了一种基于推断的方法，称为RR-BART，它利用基于似然的贝叶斯机器学习技术——贝叶斯加法回归树，并使用鲁宾法则来合并多重插补数据集上变量重要性度量的估计值和方差，以便在存在MAR数据的情况下进行变量选择。我们进行了一项具有代表性的模拟研究，以调查RR-BART的实际操作特性，并将其与基于自助法插补的方法进行比较。我们还通过一项案例研究进一步展示了这些方法，该案例研究使用来自全国女性健康研究（SWAN）的数据，分析中年女性代谢综合征3年发病率的风险因素。

结果

模拟研究表明，即使在存在大量缺失值的非线性和非可加性复杂条件下，RR-BART也能够合理地恢复在完全观测数据上可实现的预测和变量选择性能。RR-BART在最优选择阈值下提供了基于自助法插补的方法所能达到的最佳性能。此外，RR-BART在检测离散预测变量方面表现出更强的能力。而且，RR-BART显著节省了计算成本。在SWAN数据上实施时，RR-BART通过选择一组较少被认定为风险因素但具有充分生物学依据的预测变量，为该领域文献增添了内容。

结论

所提出的用于MAR数据的变量选择方法RR-BART，兼具计算效率和良好的操作特性，在大规模医疗保健数据库研究中具有实用价值。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/88da/9066834/7866d78a2c85/12874_2022_1608_Fig1_HTML.jpg

相似文献

A flexible approach for variable selection in large-scale healthcare database studies with missing covariate and outcome data.一种用于大规模医疗保健数据库研究中变量选择的灵活方法，该研究存在协变量和结果数据缺失的情况。

BMC Med Res Methodol. 2022 May 4;22(1):132. doi: 10.1186/s12874-022-01608-7.

Variable selection with missing data in both covariates and outcomes: Imputation and machine learning.变量选择与协变量和结果中的缺失数据：插补和机器学习。

Stat Methods Med Res. 2021 Dec;30(12):2651-2671. doi: 10.1177/09622802211046385. Epub 2021 Oct 25.

A new method for clustered survival data: Estimation of treatment effect heterogeneity and variable selection.一种新的聚类生存数据分析方法：处理效应异质性估计和变量选择。

Biom J. 2024 Jan;66(1):e2200178. doi: 10.1002/bimj.202200178. Epub 2023 Dec 10.

An Efficient and Effective Model to Handle Missing Data in Classification.一种用于分类中处理缺失数据的高效有效模型。

Biomed Res Int. 2020 Nov 25;2020:8810143. doi: 10.1155/2020/8810143. eCollection 2020.

How should variable selection be performed with multiply imputed data?对于多重填补的数据，应如何进行变量选择？

Stat Med. 2008 Jul 30;27(17):3227-46. doi: 10.1002/sim.3177.

Nonparametric failure time: Time-to-event machine learning with heteroskedastic Bayesian additive regression trees and low information omnibus Dirichlet process mixtures.非参数失效时间：具有异方差贝叶斯加性回归树和低信息总括 Dirichlet 过程混合的事件时间机器学习。

Biometrics. 2023 Dec;79(4):3023-3037. doi: 10.1111/biom.13857. Epub 2023 Apr 16.

Variable selection for multiply-imputed data with application to dioxin exposure study.具有应用于二恶英暴露研究的多重插补数据的变量选择。

Stat Med. 2013 Sep 20;32(21):3646-59. doi: 10.1002/sim.5783. Epub 2013 Mar 25.

Sequential BART for imputation of missing covariates.用于插补缺失协变量的顺序BART

Biostatistics. 2016 Jul;17(3):589-602. doi: 10.1093/biostatistics/kxw009. Epub 2016 Mar 15.

A Bayesian Latent Variable Selection Model for Nonignorable Missingness.贝叶斯潜在变量选择模型在不可忽略缺失数据中的应用

Multivariate Behav Res. 2022 Mar-May;57(2-3):478-512. doi: 10.1080/00273171.2021.1874259. Epub 2021 Feb 2.

Genome-wide prediction using Bayesian additive regression trees.使用贝叶斯加法回归树进行全基因组预测。

Genet Sel Evol. 2016 Jun 10;48(1):42. doi: 10.1186/s12711-016-0219-8.

引用本文的文献

A new method for clustered survival data: Estimation of treatment effect heterogeneity and variable selection.一种新的聚类生存数据分析方法：处理效应异质性估计和变量选择。

Biom J. 2024 Jan;66(1):e2200178. doi: 10.1002/bimj.202200178. Epub 2023 Dec 10.

Using Tree-Based Machine Learning for Health Studies: Literature Review and Case Series.基于树的机器学习在健康研究中的应用：文献综述和病例系列。

Int J Environ Res Public Health. 2022 Dec 1;19(23):16080. doi: 10.3390/ijerph192316080.

A flexible approach for causal inference with multiple treatments and clustered survival outcomes.一种适用于多处理和聚类生存结局的因果推断的灵活方法。

Stat Med. 2022 Nov 10;41(25):4982-4999. doi: 10.1002/sim.9548. Epub 2022 Aug 10.

Variable selection with missing data in both covariates and outcomes: Imputation and machine learning.变量选择与协变量和结果中的缺失数据：插补和机器学习。

Stat Methods Med Res. 2021 Dec;30(12):2651-2671. doi: 10.1177/09622802211046385. Epub 2021 Oct 25.

本文引用的文献

A FLEXIBLE SENSITIVITY ANALYSIS APPROACH FOR UNMEASURED CONFOUNDING WITH MULTIPLE TREATMENTS AND A BINARY OUTCOME WITH APPLICATION TO SEER-MEDICARE LUNG CANCER DATA.一种针对未测量混杂因素的灵活敏感性分析方法，适用于多种治疗和二元结局，并应用于监测、流行病学和最终结果（SEER）-医疗保险肺癌数据

Ann Appl Stat. 2022 Jun;16(2):1014-1037. doi: 10.1214/21-aoas1530. Epub 2022 Jun 13.

Variable selection with missing data in both covariates and outcomes: Imputation and machine learning.变量选择与协变量和结果中的缺失数据：插补和机器学习。

Stat Methods Med Res. 2021 Dec;30(12):2651-2671. doi: 10.1177/09622802211046385. Epub 2021 Oct 25.

Estimating heterogeneous survival treatment effects of lung cancer screening approaches: A causal machine learning analysis.估算肺癌筛查方法的异质生存治疗效果：因果机器学习分析。

Ann Epidemiol. 2021 Oct;62:36-42. doi: 10.1016/j.annepidem.2021.06.008. Epub 2021 Jun 23.

Estimating heterogeneous survival treatment effect in observational data using machine learning.利用机器学习估计观察性数据中异质生存治疗效果。

Stat Med. 2021 Sep 20;40(21):4691-4713. doi: 10.1002/sim.9090. Epub 2021 Jun 10.

Machine learning to identify and understand key factors for provider-patient discussions about smoking.机器学习用于识别和理解医患关于吸烟问题讨论的关键因素。

Prev Med Rep. 2020 Nov 5;20:101238. doi: 10.1016/j.pmedr.2020.101238. eCollection 2020 Dec.

Identifying and assessing the impact of key neighborhood-level determinants on geographic variation in stroke: a machine learning and multilevel modeling approach.识别和评估关键邻里水平决定因素对中风地理变异的影响：一种机器学习和多层次建模方法。

BMC Public Health. 2020 Nov 7;20(1):1666. doi: 10.1186/s12889-020-09766-3.

Tree-Based Machine Learning to Identify and Understand Major Determinants for Stroke at the Neighborhood Level.基于树的机器学习方法在社区层面识别和理解中风主要决定因素

J Am Heart Assoc. 2020 Nov 17;9(22):e016745. doi: 10.1161/JAHA.120.016745. Epub 2020 Nov 3.

Machine learning identifies novel blood protein predictors of penetrating and stricturing complications in newly diagnosed paediatric Crohn's disease.机器学习识别新诊断的小儿克罗恩病并发穿透性和狭窄性并发症的新型血液蛋白预测因子。

Aliment Pharmacol Ther. 2021 Jan;53(2):281-290. doi: 10.1111/apt.16136. Epub 2020 Nov 1.

Nonparametric variable importance assessment using machine learning techniques.基于机器学习技术的非参数变量重要性评估。

Biometrics. 2021 Mar;77(1):9-22. doi: 10.1111/biom.13392. Epub 2020 Dec 8.

Quantile Regression Forests to Identify Determinants of Neighborhood Stroke Prevalence in 500 Cities in the USA: Implications for Neighborhoods with High Prevalence.分位数回归森林法用于识别美国500个城市社区中风患病率的决定因素：对高患病率社区的启示

J Urban Health. 2021 Apr;98(2):259-270. doi: 10.1007/s11524-020-00478-y.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于大规模医疗保健数据库研究中变量选择的灵活方法，该研究存在协变量和结果数据缺失的情况。

A flexible approach for variable selection in large-scale healthcare database studies with missing covariate and outcome data.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSION

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献