Lian Yi, Jiang Xiaoqian, Long Qi
Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA.
McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, USA.
medRxiv. 2025 May 8:2025.05.08.25327224. doi: 10.1101/2025.05.08.25327224.
Electronic health records (EHRs) collected from diverse healthcare institutions offer a rich and representative data source for clinical research. Federated learning enables analysis of these distributed data without sharing sensitive patient-level information, preserving privacy. However, missing data remain a major challenge and can introduce substantial bias if not properly addressed. Very few distributed imputation methods currently exist, and they fail to account for two critical aspects of EHR data: correlation within sites and variability across sites. We aim to fill this important methodological gap.
We propose Distributed Mixed Model-based Multiple Imputation (D3MI), a novel federated imputation method designed to reduce bias in distributed EHRs. D3MI integrates the strengths from federated learning techniques, statistical learning methods for correlated data, and multilevel imputation algorithms to explicitly account for both and within-site correlation and between-site heterogeneity using site-specific random effects. It preserves privacy by avoiding sharing raw data and features communication and computational efficiency.
Through extensive simulation studies, we demonstrate that D3MI outperforms SOTA distributed imputation methods in both accuracy and consistency. We further demonstrate the use of D3MI in a real-world EHR case study involving incomplete and clustered data from participating hospitals in the Georgia Coverdell Acute Stroke Registry.
By explicitly modeling the complex structure of distributed EHR data, D3MI addresses key limitations of existing approaches. It provides a powerful and efficient solution for handling missing data in distributed and privacy-sensitive settings and enhances the rigor and reproducibility of collaborative clinical research.
从不同医疗机构收集的电子健康记录(EHR)为临床研究提供了丰富且具有代表性的数据源。联邦学习能够在不共享敏感患者层面信息的情况下分析这些分布式数据,从而保护隐私。然而,缺失数据仍然是一个重大挑战,如果处理不当可能会引入大量偏差。目前存在的分布式插补方法非常少,并且它们未能考虑EHR数据的两个关键方面:各站点内的相关性和各站点间的变异性。我们旨在填补这一重要的方法学空白。
我们提出了基于分布式混合模型的多重插补(D3MI),这是一种新颖的联邦插补方法,旨在减少分布式EHR中的偏差。D3MI整合了联邦学习技术、相关数据的统计学习方法以及多级插补算法的优势,通过特定于站点的随机效应明确考虑站点内相关性和站点间异质性。它通过避免共享原始数据以及提高通信和计算效率来保护隐私。
通过广泛的模拟研究,我们证明D3MI在准确性和一致性方面均优于现有的分布式插补方法。我们进一步展示了D3MI在一个真实世界EHR案例研究中的应用,该研究涉及佐治亚州科弗代尔急性卒中登记处参与医院的不完整且聚类的数据。
通过明确对分布式EHR数据的复杂结构进行建模,D3MI解决了现有方法的关键局限性。它为处理分布式和隐私敏感环境中的缺失数据提供了一个强大且高效的解决方案,并提高了协作临床研究的严谨性和可重复性。