Department of Biostatistics, Epidemiology, and Informatics Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA.
Biometrics. 2023 Sep;79(3):2357-2369. doi: 10.1111/biom.13786. Epub 2022 Nov 7.
Electronic health records (EHRs) offer great promises for advancing precision medicine and, at the same time, present significant analytical challenges. Particularly, it is often the case that patient-level data in EHRs cannot be shared across institutions (data sources) due to government regulations and/or institutional policies. As a result, there are growing interests about distributed learning over multiple EHRs databases without sharing patient-level data. To tackle such challenges, we propose a novel communication efficient method that aggregates the optimal estimates of external sites, by turning the problem into a missing data problem. In addition, we propose incorporating posterior samples of remote sites, which can provide partial information on the missing quantities and improve efficiency of parameter estimates while having the differential privacy property and thus reducing the risk of information leaking. The proposed approach, without sharing the raw patient level data, allows for proper statistical inference. We provide theoretical investigation for the asymptotic properties of the proposed method for statistical inference as well as differential privacy, and evaluate its performance in simulations and real data analyses in comparison with several recently developed methods.
电子健康记录 (EHR) 为推进精准医学提供了巨大的承诺,但同时也带来了重大的分析挑战。特别是,由于政府法规和/或机构政策,EHR 中的患者级数据通常无法在机构间(数据源)共享。因此,人们越来越感兴趣地在不共享患者级数据的情况下,通过在多个 EHR 数据库上进行分布式学习来解决此类挑战。为了应对这些挑战,我们提出了一种新颖的通信高效方法,通过将问题转化为缺失数据问题,来聚合外部站点的最优估计。此外,我们还提出了合并远程站点的后验样本的方法,这些样本可以提供关于缺失量的部分信息,并在具有差分隐私属性的同时提高参数估计的效率,从而降低信息泄露的风险。所提出的方法无需共享原始患者级数据,即可进行适当的统计推断。我们对所提出方法的统计推断和差分隐私的渐近性质进行了理论研究,并将其性能与几种最近开发的方法在模拟和真实数据分析中进行了比较。