Emory, Atlanta, GA, USA.
University of Texas Health Science Center, Houston, TX, USA.
AMIA Annu Symp Proc. 2021 Jan 25;2020:348-357. eCollection 2020.
Distributed health data networks that use information from multiple sources have drawn substantial interest in recent years. However, missing data are prevalent in such networks and present significant analytical challenges. The current state-of-the-art methods for handling missing data require pooling data into a central repository before analysis, which may not be possible in a distributed health data network. In this paper, we propose a privacy- preserving distributed analysis framework for handling missing data when data are vertically partitioned. In this framework, each institution with a particular data source utilizes the local private data to calculate necessary intermediate aggregated statistics, which are then shared to build a global model for handling missing data. To evaluate our proposed methods, we conduct simulation studies that clearly demonstrate that the proposed privacy- preserving methods perform as well as the methods using the pooled data and outperform several naive methods. We further illustrate the proposed methods through the analysis of a real dataset. The proposed framework for handling vertically partitioned incomplete data is substantially more privacy-preserving than methods that require pooling of the data, since no individual-level data are shared, which can lower hurdles for collaboration across multiple institutions and build stronger public trust.
近年来,利用多源信息的分布式健康数据网络引起了广泛关注。然而,此类网络中普遍存在缺失数据,这给分析带来了重大挑战。目前处理缺失数据的最先进方法要求在分析前将数据汇集到中央存储库中,但在分布式健康数据网络中可能无法实现。在本文中,我们提出了一种隐私保护的分布式分析框架,用于处理垂直分区时的数据缺失问题。在该框架中,每个具有特定数据源的机构都利用本地私有数据来计算必要的中间聚合统计信息,然后将这些统计信息共享以构建用于处理缺失数据的全局模型。为了评估我们提出的方法,我们进行了模拟研究,这些研究清楚地表明,所提出的隐私保护方法的性能与使用汇集数据的方法一样好,并且优于几种简单的方法。我们通过对真实数据集的分析进一步说明了所提出的方法。与需要汇集数据的方法相比,用于处理垂直分区不完整数据的所提出框架在隐私保护方面有了实质性的提高,因为没有共享任何个人级别的数据,这可以降低多个机构之间合作的障碍,并建立更强的公众信任。