针对合成健康数据的成员推理攻击。

Membership inference attacks against synthetic health data.

机构信息

Vanderbilt University, 2525 West End Avenue, Nashville, TN 37240, United States.

出版信息

J Biomed Inform. 2022 Jan;125:103977. doi: 10.1016/j.jbi.2021.103977. Epub 2021 Dec 14.

DOI:10.1016/j.jbi.2021.103977

PMID:34920126

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8766950/

Abstract

Synthetic data generation has emerged as a promising method to protect patient privacy while sharing individual-level health data. Intuitively, sharing synthetic data should reduce disclosure risks because no explicit linkage is retained between the synthetic records and the real data upon which it is based. However, the risks associated with synthetic data are still evolving, and what seems protected today may not be tomorrow. In this paper, we show that membership inference attacks, whereby an adversary infers if the data from certain target individuals (known to the adversary a priori) were relied upon by the synthetic data generation process, can be substantially enhanced through state-of-the-art machine learning frameworks, which calls into question the protective nature of existing synthetic data generators. Specifically, we formulate the membership inference problem from the perspective of the data holder, who aims to perform a disclosure risk assessment prior to sharing any health data. To support such an assessment, we introduce a framework for effective membership inference against synthetic health data without specific assumptions about the generative model or a well-defined data structure, leveraging the principles of contrastive representation learning. To illustrate the potential for such an attack, we conducted experiments against synthesis approaches using two datasets derived from several health data resources (Vanderbilt University Medical Center, the All of Us Research Program) to determine the upper bound of risk brought by an adversary who invokes an optimal strategy. The results indicate that partially synthetic data are vulnerable to membership inference at a very high rate. By contrast, fully synthetic data are only marginally susceptible and, in most cases, could be deemed sufficiently protected from membership inference.

摘要

合成数据生成已成为一种有前途的方法，可以在共享个人健康数据的同时保护患者隐私。直观地说，共享合成数据应该会降低披露风险，因为在基于真实数据生成的合成记录中，不会保留任何明确的链接。然而，与合成数据相关的风险仍在不断发展，今天看起来受到保护的内容明天可能就不再受保护。在本文中，我们表明，成员推断攻击（membership inference attack）可以通过最先进的机器学习框架大大增强，在这种攻击中，对手推断合成数据生成过程是否依赖于某些目标个体（对手事先知道）的数据，这对现有合成数据生成器的保护性质提出了质疑。具体来说，我们从数据持有者的角度来制定成员推断问题，数据持有者旨在在共享任何健康数据之前进行披露风险评估。为了支持这种评估，我们引入了一个针对合成健康数据的有效成员推断框架，该框架无需对生成模型或明确定义的数据结构进行具体假设，而是利用对比表示学习的原则。为了说明这种攻击的可能性，我们针对使用两个来自多个健康数据资源（范德比尔特大学医学中心、全美研究计划）的数据集的合成方法进行了实验，以确定调用最佳策略的对手带来的风险上限。结果表明，部分合成数据非常容易受到成员推断攻击，而完全合成数据则只有轻微的易感性，并且在大多数情况下，成员推断攻击可以被认为对其有足够的保护。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3589/8766950/6605fbf0877c/nihms-1765731-f0001.jpg

相似文献

Membership inference attacks against synthetic health data.针对合成健康数据的成员推理攻击。

J Biomed Inform. 2022 Jan;125:103977. doi: 10.1016/j.jbi.2021.103977. Epub 2021 Dec 14.

Validating a membership disclosure metric for synthetic health data.验证合成健康数据的成员披露指标。

JAMIA Open. 2022 Oct 11;5(4):ooac083. doi: 10.1093/jamiaopen/ooac083. eCollection 2022 Dec.

Tunable Privacy Risk Evaluation of Generative Adversarial Networks.生成式对抗网络的可调隐私风险评估。

Stud Health Technol Inform. 2024 Aug 22;316:1233-1237. doi: 10.3233/SHTI240634.

SynTEG: a framework for temporal structured electronic health data simulation.SynTEG：用于时间结构化电子健康数据模拟的框架。

J Am Med Inform Assoc. 2021 Mar 1;28(3):596-604. doi: 10.1093/jamia/ocaa262.

Defense against membership inference attack in graph neural networks through graph perturbation.通过图扰动防御图神经网络中的成员推理攻击

Int J Inf Secur. 2023;22(2):497-509. doi: 10.1007/s10207-022-00646-y. Epub 2022 Dec 16.

Inference attacks against differentially private query results from genomic datasets including dependent tuples.针对包含依赖元组的基因组数据集的差分隐私查询结果的推理攻击。

Bioinformatics. 2020 Jul 1;36(Suppl_1):i136-i145. doi: 10.1093/bioinformatics/btaa475.

Deep Neural Network Quantization Framework for Effective Defense against Membership Inference Attacks.用于有效防御成员推理攻击的深度神经网络量化框架

Sensors (Basel). 2023 Sep 7;23(18):7722. doi: 10.3390/s23187722.

Are You the Outlier? Identifying Targets for Privacy Attacks on Health Datasets.你是异常值吗？识别针对健康数据集的隐私攻击目标。

Stud Health Technol Inform. 2024 Aug 22;316:1224-1225. doi: 10.3233/SHTI240631.

Sharing Time-to-Event Data with Privacy Protection.在保护隐私的前提下共享事件发生时间数据。

Proc (IEEE Int Conf Healthc Inform). 2022 Jun;2022. doi: 10.1109/ichi54592.2022.00014. Epub 2022 Sep 8.

Differential Privacy Protection Against Membership Inference Attack on Machine Learning for Genomic Data.针对基因组数据机器学习的成员推理攻击的差分隐私保护。

Pac Symp Biocomput. 2021;26:26-37.

引用本文的文献

Federated learning and differential privacy: Machine learning and deep learning for biomedical image data classification.联邦学习与差分隐私：用于生物医学图像数据分类的机器学习与深度学习

Digit Health. 2025 Sep 11;11:20552076251358531. doi: 10.1177/20552076251358531. eCollection 2025 Jan-Dec.

Toward Identifying New Risk Aversions and Subsequent Limitations and Biases When Making De-identified Structured Data Sets Openly Available in a Post-LLM world.在大语言模型时代，当使去识别化的结构化数据集公开可用时，致力于识别新的风险规避以及随之而来的限制和偏差。

AMIA Annu Symp Proc. 2025 May 22;2024:262-270. eCollection 2024.

On the fidelity versus privacy and utility trade-off of synthetic patient data.论合成患者数据的保真度与隐私及效用之间的权衡

iScience. 2025 Apr 14;28(5):112382. doi: 10.1016/j.isci.2025.112382. eCollection 2025 May 16.

Addressing contemporary threats in anonymised healthcare data using privacy engineering.利用隐私工程应对匿名医疗保健数据中的当代威胁。

NPJ Digit Med. 2025 Mar 6;8(1):145. doi: 10.1038/s41746-025-01520-6.

Using UMAP for Partially Synthetic Healthcare Tabular Data Generation and Validation.使用UMAP进行部分合成医疗表格数据生成与验证。

Sensors (Basel). 2024 Dec 8;24(23):7843. doi: 10.3390/s24237843.

Prospects for AI clinical summarization to reduce the burden of patient chart review.人工智能临床总结减轻患者病历审查负担的前景。

Front Digit Health. 2024 Nov 7;6:1475092. doi: 10.3389/fdgth.2024.1475092. eCollection 2024.

Building an Ethical and Trustworthy Biomedical AI Ecosystem for the Translational and Clinical Integration of Foundation Models.构建一个用于基础模型转化与临床整合的道德且值得信赖的生物医学人工智能生态系统。

Bioengineering (Basel). 2024 Sep 29;11(10):984. doi: 10.3390/bioengineering11100984.

Privacy-Enhancing Technologies in Biomedical Data Science.生物医学数据科学中的隐私增强技术。

Annu Rev Biomed Data Sci. 2024 Aug;7(1):317-343. doi: 10.1146/annurev-biodatasci-120423-120107.

Generating Synthetic Electronic Health Record Data Using Generative Adversarial Networks: Tutorial.使用生成对抗网络生成合成电子健康记录数据：教程

JMIR AI. 2024 Apr 22;3:e52615. doi: 10.2196/52615.

A Multifaceted benchmarking of synthetic electronic health record generation models.综合电子健康记录生成模型的多方面基准测试。

Nat Commun. 2022 Dec 9;13(1):7609. doi: 10.1038/s41467-022-35295-1.

本文引用的文献

Differential privacy in health research: A scoping review.健康研究中的差分隐私：范围综述。

J Am Med Inform Assoc. 2021 Sep 18;28(10):2269-2276. doi: 10.1093/jamia/ocab135.

SynTEG: a framework for temporal structured electronic health data simulation.SynTEG：用于时间结构化电子健康数据模拟的框架。

J Am Med Inform Assoc. 2021 Mar 1;28(3):596-604. doi: 10.1093/jamia/ocaa262.

Optimizing the synthesis of clinical trial data using sequential trees.使用序贯树优化临床试验数据的合成

J Am Med Inform Assoc. 2021 Jan 15;28(1):3-13. doi: 10.1093/jamia/ocaa249.

Generating sequential electronic health records using dual adversarial autoencoder.使用对偶对抗自动编码器生成连续的电子健康记录。

J Am Med Inform Assoc. 2020 Jul 1;27(9):1411-1419. doi: 10.1093/jamia/ocaa119.

The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment.国家 COVID 队列协作组织（N3C）：原理、设计、基础设施和部署。

J Am Med Inform Assoc. 2021 Mar 1;28(3):427-443. doi: 10.1093/jamia/ocaa196.

The "All of Us" Research Program.“我们所有人”研究项目

N Engl J Med. 2019 Nov 7;381(19):1883-1884. doi: 10.1056/NEJMc1912496.

Ensuring electronic medical record simulation through better training, modeling, and evaluation.通过更好的培训、建模和评估来确保电子病历模拟。

J Am Med Inform Assoc. 2020 Jan 1;27(1):99-108. doi: 10.1093/jamia/ocz161.

Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing.隐私保护生成式深度神经网络支持临床数据共享。

Circ Cardiovasc Qual Outcomes. 2019 Jul;12(7):e005122. doi: 10.1161/CIRCOUTCOMES.118.005122. Epub 2019 Jul 9.

Focal Loss for Dense Object Detection.用于密集目标检测的焦散损失

IEEE Trans Pattern Anal Mach Intell. 2020 Feb;42(2):318-327. doi: 10.1109/TPAMI.2018.2858826. Epub 2018 Jul 23.

Optimizing drug outcomes through pharmacogenetics: a case for preemptive genotyping.通过药物遗传学优化药物效果：预先基因分型的案例。

Clin Pharmacol Ther. 2012 Aug;92(2):235-42. doi: 10.1038/clpt.2012.66. Epub 2012 Jun 27.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验