Ren Wenhui, Liu Zheng, Wu Yanqiu, Zhang Zhilong, Hong Shenda, Liu Huixin
Department of Clinical Epidemiology and Biostatistics, Peking University People's Hospital, Beijing, China.
National Institute of Health Data Science, Peking University, Beijing, China.
Health Data Sci. 2024 Dec 4;4:0176. doi: 10.34133/hds.0176. eCollection 2024.
Missing data in electronic health records (EHRs) presents significant challenges in medical studies. Many methods have been proposed, but uncertainty exists regarding the current state of missing data addressing methods applied for EHR and which strategy performs better within specific contexts. All studies referencing EHR and missing data methods published from their inception until 2024 March 30 were searched via the MEDLINE, EMBASE, and Digital Bibliography and Library Project databases. The characteristics of the included studies were extracted. We also compared the performance of various methods under different missingness scenarios. After screening, 46 studies published between 2010 and 2024 were included. Three missingness mechanisms were simulated when evaluating the missing data methods: missing completely at random (29/46), missing at random (20/46), and missing not at random (21/46). Multiple imputation by chained equations (MICE) was the most popular statistical method, whereas generative adversarial network-based methods and the k nearest neighbor (KNN) classification were the common deep-learning-based or traditional machine-learning-based methods, respectively. Among the 26 articles comparing the performance among medical statistical and machine learning approaches, traditional machine learning or deep learning methods generally outperformed statistical methods. Med.KNN and context-aware time-series imputation performed better for longitudinal datasets, whereas probabilistic principal component analysis and MICE-based methods were optimal for cross-sectional datasets. Machine learning methods show significant promise for addressing missing data in EHRs. However, no single approach provides a universally generalizable solution. Standardized benchmarking analyses are essential to evaluate these methods across different missingness scenarios.
电子健康记录(EHR)中的缺失数据给医学研究带来了重大挑战。虽然已经提出了许多方法,但对于EHR中应用的缺失数据处理方法的现状以及哪种策略在特定情况下表现更好仍存在不确定性。通过MEDLINE、EMBASE以及数字文献与图书馆项目数据库,检索了从开始到2024年3月30日发表的所有引用EHR和缺失数据方法的研究。提取了纳入研究的特征。我们还比较了不同缺失情况下来各种方法的性能。经过筛选,纳入了2010年至2024年间发表的46项研究。在评估缺失数据方法时,模拟了三种缺失机制:完全随机缺失(46项中的29项)、随机缺失(46项中的20项)和非随机缺失(46项中的21项)。链式方程多重插补(MICE)是最常用的统计方法,而基于生成对抗网络的方法和k近邻(KNN)分类分别是常见的基于深度学习或传统机器学习的方法。在比较医学统计和机器学习方法性能的26篇文章中,传统机器学习或深度学习方法通常优于统计方法。Med.KNN和上下文感知时间序列插补在纵向数据集中表现更好,而概率主成分分析和基于MICE的方法在横断面数据集中表现最佳。机器学习方法在解决EHR中的缺失数据方面显示出巨大潜力。然而,没有一种方法能提供普遍适用的解决方案。标准化的基准分析对于评估这些方法在不同缺失情况下的性能至关重要。