Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, Pennsylvania, USA.
Department of Pediatrics, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, USA.
J Am Med Inform Assoc. 2023 Jun 20;30(7):1246-1256. doi: 10.1093/jamia/ocad066.
The impacts of missing data in comparative effectiveness research (CER) using electronic health records (EHRs) may vary depending on the type and pattern of missing data. In this study, we aimed to quantify these impacts and compare the performance of different imputation methods.
We conducted an empirical (simulation) study to quantify the bias and power loss in estimating treatment effects in CER using EHR data. We considered various missing scenarios and used the propensity scores to control for confounding. We compared the performance of the multiple imputation and spline smoothing methods to handle missing data.
When missing data depended on the stochastic progression of disease and medical practice patterns, the spline smoothing method produced results that were close to those obtained when there were no missing data. Compared to multiple imputation, the spline smoothing generally performed similarly or better, with smaller estimation bias and less power loss. The multiple imputation can still reduce study bias and power loss in some restrictive scenarios, eg, when missing data did not depend on the stochastic process of disease progression.
Missing data in EHRs could lead to biased estimates of treatment effects and false negative findings in CER even after missing data were imputed. It is important to leverage the temporal information of disease trajectory to impute missing values when using EHRs as a data resource for CER and to consider the missing rate and the effect size when choosing an imputation method.
利用电子健康记录(EHR)进行的比较疗效研究(CER)中缺失数据的影响可能因缺失数据的类型和模式而异。在这项研究中,我们旨在量化这些影响并比较不同插补方法的性能。
我们进行了一项实证(模拟)研究,以量化使用 EHR 数据进行 CER 时估计治疗效果的偏差和效力损失。我们考虑了各种缺失情况,并使用倾向评分来控制混杂。我们比较了多重插补和样条平滑方法处理缺失数据的性能。
当缺失数据取决于疾病的随机进展和医疗实践模式时,样条平滑方法产生的结果接近无缺失数据时的结果。与多重插补相比,样条平滑通常表现相似或更好,估计偏差较小,效力损失较小。在某些限制情况下,多重插补仍可以减少研究偏差和效力损失,例如,当缺失数据不依赖于疾病进展的随机过程时。
即使在缺失数据被插补后,EHR 中的缺失数据仍可能导致 CER 中治疗效果的估计偏差和假阴性结果。在将 EHR 用作 CER 的数据资源时,利用疾病轨迹的时间信息来插补缺失值非常重要,并且在选择插补方法时应考虑缺失率和效应量。