评估完全合成健康数据中的身份披露风险：模型开发与验证

Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation.

作者信息

El Emam Khaled, Mosquera Lucy, Bass Jason

机构信息

School of Epidemiology and Public Health, Faculty of Medicine, University of Ottawa, Ottawa, ON, Canada.

Children's Hospital of Eastern Ontario Research Institute, Ottawa, ON, Canada.

出版信息

J Med Internet Res. 2020 Nov 16;22(11):e23139. doi: 10.2196/23139.

DOI:10.2196/23139

PMID:33196453

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7704280/

Abstract

BACKGROUND

There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them.

OBJECTIVE

The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data.

METHODS

A full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this "meaningful identity disclosure risk." The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data.

RESULTS

The meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively.

CONCLUSIONS

We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data.

摘要

背景

为便于共享数据进行二次分析，人们对数据合成的兴趣日益浓厚；然而，对于完全合成数据，需要一个全面的隐私风险模型：如果生成模型过度拟合，那么就有可能从合成数据中识别出个体并了解有关他们的新信息。

目的

本研究的目的是开发并应用一种方法来评估完全合成数据的身份泄露风险。

方法

提出了一个完整的风险模型，该模型评估身份泄露以及如果合成记录与真实个体匹配，对手了解新信息的能力。我们将此称为“有意义的身份泄露风险”。该模型应用于华盛顿州医院出院数据库（2007年）和加拿大COVID-19病例数据库的样本。这两个数据集均使用常用于合成健康和社会科学数据的顺序决策树过程进行合成。

结果

这两个合成样本的有意义身份泄露风险均低于常用的0.09风险阈值（分别为0.0198和0.0086），分别比原始数据集的风险值低4倍和5倍。

结论

我们提出了一个针对完全合成数据的全面身份泄露风险模型。该合成方法在两个数据集上的结果表明，合成可以显著降低有意义的身份泄露风险。该风险模型未来可用于评估完全合成数据的隐私性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ce2/7704280/a2914b4cf4d9/jmir_v22i11e23139_fig1.jpg

相似文献

Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation.评估完全合成健康数据中的身份披露风险：模型开发与验证

J Med Internet Res. 2020 Nov 16;22(11):e23139. doi: 10.2196/23139.

A synthetic dataset primer for the biobehavioural sciences to promote reproducibility and hypothesis generation.生物行为科学合成数据集入门，促进可重复性和假设生成。

Elife. 2020 Mar 11;9:e53275. doi: 10.7554/eLife.53275.

Validating a membership disclosure metric for synthetic health data.验证合成健康数据的成员披露指标。

JAMIA Open. 2022 Oct 11;5(4):ooac083. doi: 10.1093/jamiaopen/ooac083. eCollection 2022 Dec.

An evaluation of the replicability of analyses using synthetic health data.利用合成健康数据评估分析结果的可重复性。

Sci Rep. 2024 Mar 24;14(1):6978. doi: 10.1038/s41598-024-57207-7.

Creating High-Quality Synthetic Health Data: Framework for Model Development and Validation.创建高质量合成健康数据：模型开发与验证框架。

JMIR Form Res. 2024 Apr 22;8:e53241. doi: 10.2196/53241.

Fully synthetic neuroimaging data for replication and exploration.全合成神经影像学数据，可用于复制和探索。

Neuroimage. 2020 Dec;223:117284. doi: 10.1016/j.neuroimage.2020.117284. Epub 2020 Aug 20.

A method for generating synthetic longitudinal health data.一种生成合成纵向健康数据的方法。

BMC Med Res Methodol. 2023 Mar 23;23(1):67. doi: 10.1186/s12874-023-01869-w.

Application of Bayesian networks to generate synthetic health data.贝叶斯网络在生成合成健康数据中的应用。

J Am Med Inform Assoc. 2021 Mar 18;28(4):801-811. doi: 10.1093/jamia/ocaa303.

Membership inference attacks against synthetic health data.针对合成健康数据的成员推理攻击。

J Biomed Inform. 2022 Jan;125:103977. doi: 10.1016/j.jbi.2021.103977. Epub 2021 Dec 14.

Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets.评估合成乳腺癌临床试验数据集的效用和隐私性。

JCO Clin Cancer Inform. 2023 Sep;7:e2300116. doi: 10.1200/CCI.23.00116.

引用本文的文献

Can Synthetic Data Allow for Smaller Sample Sizes in Chronic Urticaria Research?合成数据能否在慢性荨麻疹研究中减少样本量？

Clin Transl Allergy. 2025 Aug;15(8):e70087. doi: 10.1002/clt2.70087.

Augmenting Insufficiently Accruing Oncology Clinical Trials Using Generative Models: Validation Study.使用生成模型增强入组不足的肿瘤学临床试验：验证研究

J Med Internet Res. 2025 Mar 5;27:e66821. doi: 10.2196/66821.

GenAI synthetic data create ethical challenges for scientists. Here's how to address them.生成式人工智能（GenAI）合成数据给科学家带来了伦理挑战。以下是应对这些挑战的方法。

Proc Natl Acad Sci U S A. 2025 Mar 4;122(9):e2409182122. doi: 10.1073/pnas.2409182122. Epub 2025 Feb 26.

Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.大型语言模型生成合成临床数据集：与真实世界围手术期数据的可行性及对比分析

Front Artif Intell. 2025 Feb 5;8:1533508. doi: 10.3389/frai.2025.1533508. eCollection 2025.

Preserving information while respecting privacy through an information theoretic framework for synthetic health data generation.通过用于合成健康数据生成的信息理论框架，在尊重隐私的同时保存信息。

NPJ Digit Med. 2025 Jan 23;8(1):49. doi: 10.1038/s41746-025-01431-6.

Actionability of Synthetic Data in a Heterogeneous and Rare Health Care Demographic: Adolescents and Young Adults With Cancer.合成数据在异质性和罕见医疗保健人群中的适用性：患有癌症的青少年和青年成年人

JCO Clin Cancer Inform. 2024 Dec;8:e2400056. doi: 10.1200/CCI.24.00056. Epub 2024 Dec 3.

De-identification is not enough: a comparison between de-identified and synthetic clinical notes.去识别化是不够的：去识别化与合成临床记录的比较。

Sci Rep. 2024 Nov 29;14(1):29669. doi: 10.1038/s41598-024-81170-y.

Synthesis and quality assessment of combined time-series and static medical data using a real-world time-series generative adversarial network.使用真实世界时间序列生成对抗网络对组合时间序列和静态医学数据进行合成和质量评估。

Sci Rep. 2024 Aug 17;14(1):19064. doi: 10.1038/s41598-024-69812-7.

Flexibility of a large blindly synthetized avatar database for occupational research: Example from the CONSTANCES cohort for stroke and knee pain.大型盲目综合虚拟人数据库在职业研究中的灵活性：来自 CONSTANCES 队列研究中风和膝关节疼痛的例子。

PLoS One. 2024 Jul 31;19(7):e0308063. doi: 10.1371/journal.pone.0308063. eCollection 2024.

Generating Synthetic Electronic Health Record Data Using Generative Adversarial Networks: Tutorial.使用生成对抗网络生成合成电子健康记录数据：教程

JMIR AI. 2024 Apr 22;3:e52615. doi: 10.2196/52615.

本文引用的文献

Generating Electronic Health Records with Multiple Data Types and Constraints.生成具有多种数据类型和约束的电子健康记录。

AMIA Annu Symp Proc. 2021 Jan 25;2020:1335-1344. eCollection 2020.

Variational Autoencoder Modular Bayesian Networks for Simulation of Heterogeneous Clinical Study Data.用于模拟异质临床研究数据的变分自编码器模块化贝叶斯网络。

Front Big Data. 2020 May 28;3:16. doi: 10.3389/fdata.2020.00016. eCollection 2020.

Public perceptions on data sharing: key insights from the UK and the USA.公众对数据共享的看法：来自英国和美国的关键见解。

Lancet Digit Health. 2020 Sep;2(9):e444-e446. doi: 10.1016/S2589-7500(20)30161-8. Epub 2020 Jul 24.

Less than five is less than ideal: replacing the "less than 5 cell size" rule with a risk-based data disclosure protocol in a public health setting.少于五是不理想的：在公共卫生环境中，用基于风险的数据披露协议取代“小于 5 个细胞大小”的规则。

Can J Public Health. 2020 Oct;111(5):761-765. doi: 10.17269/s41997-020-00303-8. Epub 2020 Mar 11.

A synthetic dataset primer for the biobehavioural sciences to promote reproducibility and hypothesis generation.生物行为科学合成数据集入门，促进可重复性和假设生成。

Elife. 2020 Mar 11;9:e53275. doi: 10.7554/eLife.53275.

Evaluating the re-identification risk of a clinical study report anonymized under EMA Policy 0070 and Health Canada Regulations.评估根据 EMA 政策 0070 和加拿大卫生部法规进行匿名化的临床研究报告的再识别风险。

Trials. 2020 Feb 18;21(1):200. doi: 10.1186/s13063-020-4120-y.

Ensuring electronic medical record simulation through better training, modeling, and evaluation.通过更好的培训、建模和评估来确保电子病历模拟。

J Am Med Inform Assoc. 2020 Jan 1;27(1):99-108. doi: 10.1093/jamia/ocz161.

Re-identification Risks in HIPAA Safe Harbor Data: A study of data from one environmental health study.《健康保险流通与责任法案》安全港数据中的重新识别风险：一项对来自一项环境卫生研究数据的研究

Technol Sci. 2017;2017. Epub 2017 Aug 28.

Using 26,000 diary entries to show ovulatory changes in sexual desire and behavior.使用 26000 篇日记记录排卵周期中性欲和性行为的变化。

J Pers Soc Psychol. 2021 Aug;121(2):410-431. doi: 10.1037/pspp0000208. Epub 2018 Aug 27.

Identity and privacy. Unique in the shopping mall: on the reidentifiability of credit card metadata.身份与隐私。购物中心里的独特之处：信用卡元数据的可再识别性。

Science. 2015 Jan 30;347(6221):536-9. doi: 10.1126/science.1256297.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

评估完全合成健康数据中的身份披露风险：模型开发与验证

Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献