Vanderbilt University, 2525 West End Avenue, Nashville, TN 37240, United States.
Vanderbilt University, 2525 West End Avenue, Nashville, TN 37240, United States.
J Biomed Inform. 2022 Jan;125:103977. doi: 10.1016/j.jbi.2021.103977. Epub 2021 Dec 14.
Synthetic data generation has emerged as a promising method to protect patient privacy while sharing individual-level health data. Intuitively, sharing synthetic data should reduce disclosure risks because no explicit linkage is retained between the synthetic records and the real data upon which it is based. However, the risks associated with synthetic data are still evolving, and what seems protected today may not be tomorrow. In this paper, we show that membership inference attacks, whereby an adversary infers if the data from certain target individuals (known to the adversary a priori) were relied upon by the synthetic data generation process, can be substantially enhanced through state-of-the-art machine learning frameworks, which calls into question the protective nature of existing synthetic data generators. Specifically, we formulate the membership inference problem from the perspective of the data holder, who aims to perform a disclosure risk assessment prior to sharing any health data. To support such an assessment, we introduce a framework for effective membership inference against synthetic health data without specific assumptions about the generative model or a well-defined data structure, leveraging the principles of contrastive representation learning. To illustrate the potential for such an attack, we conducted experiments against synthesis approaches using two datasets derived from several health data resources (Vanderbilt University Medical Center, the All of Us Research Program) to determine the upper bound of risk brought by an adversary who invokes an optimal strategy. The results indicate that partially synthetic data are vulnerable to membership inference at a very high rate. By contrast, fully synthetic data are only marginally susceptible and, in most cases, could be deemed sufficiently protected from membership inference.
合成数据生成已成为一种有前途的方法,可以在共享个人健康数据的同时保护患者隐私。直观地说,共享合成数据应该会降低披露风险,因为在基于真实数据生成的合成记录中,不会保留任何明确的链接。然而,与合成数据相关的风险仍在不断发展,今天看起来受到保护的内容明天可能就不再受保护。在本文中,我们表明,成员推断攻击(membership inference attack)可以通过最先进的机器学习框架大大增强,在这种攻击中,对手推断合成数据生成过程是否依赖于某些目标个体(对手事先知道)的数据,这对现有合成数据生成器的保护性质提出了质疑。具体来说,我们从数据持有者的角度来制定成员推断问题,数据持有者旨在在共享任何健康数据之前进行披露风险评估。为了支持这种评估,我们引入了一个针对合成健康数据的有效成员推断框架,该框架无需对生成模型或明确定义的数据结构进行具体假设,而是利用对比表示学习的原则。为了说明这种攻击的可能性,我们针对使用两个来自多个健康数据资源(范德比尔特大学医学中心、全美研究计划)的数据集的合成方法进行了实验,以确定调用最佳策略的对手带来的风险上限。结果表明,部分合成数据非常容易受到成员推断攻击,而完全合成数据则只有轻微的易感性,并且在大多数情况下,成员推断攻击可以被认为对其有足够的保护。