Xu Dandan, Daniels Michael J, Winterstein Almut G
Department of Statistics, University of Florida, Gainesville, FL 32601, USA.
Departments of Integrative Biology, and Statistics & Data Sciences, The University of Texas at Austin, Austin, TX 78712, USA
Biostatistics. 2016 Jul;17(3):589-602. doi: 10.1093/biostatistics/kxw009. Epub 2016 Mar 15.
To conduct comparative effectiveness research using electronic health records (EHR), many covariates are typically needed to adjust for selection and confounding biases. Unfortunately, it is typical to have missingness in these covariates. Just using cases with complete covariates will result in considerable efficiency losses and likely bias. Here, we consider the covariates missing at random with missing data mechanism either depending on the response or not. Standard methods for multiple imputation can either fail to capture nonlinear relationships or suffer from the incompatibility and uncongeniality issues. We explore a flexible Bayesian nonparametric approach to impute the missing covariates, which involves factoring the joint distribution of the covariates with missingness into a set of sequential conditionals and applying Bayesian additive regression trees to model each of these univariate conditionals. Using data augmentation, the posterior for each conditional can be sampled simultaneously. We provide details on the computational algorithm and make comparisons to other methods, including parametric sequential imputation and two versions of multiple imputation by chained equations. We illustrate the proposed approach on EHR data from an affiliated tertiary care institution to examine factors related to hyperglycemia.
为了使用电子健康记录(EHR)进行比较效果研究,通常需要许多协变量来调整选择偏倚和混杂偏倚。不幸的是,这些协变量中存在缺失值是很常见的。仅使用协变量完整的病例会导致相当大的效率损失,并且可能产生偏差。在此,我们考虑协变量随机缺失,其缺失数据机制可能依赖于响应变量,也可能不依赖。多重填补的标准方法要么无法捕捉非线性关系,要么会遇到不相容性和非一致性问题。我们探索一种灵活的贝叶斯非参数方法来填补缺失的协变量,该方法包括将带有缺失值的协变量联合分布分解为一组顺序条件分布,并应用贝叶斯加法回归树对每个单变量条件分布进行建模。通过数据扩充,可以同时对每个条件分布的后验进行采样。我们提供了计算算法的详细信息,并与其他方法进行了比较,包括参数顺序填补法和两种链式方程多重填补法。我们在一家附属三级医疗机构的EHR数据上说明了所提出的方法,以检查与高血糖相关的因素。