Suppr超能文献

去识别化是不够的:去识别化与合成临床记录的比较。

De-identification is not enough: a comparison between de-identified and synthetic clinical notes.

机构信息

Department of Computer Science, University of Manitoba, Winnipeg, R3T 5V6, Canada.

McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, 77030, USA.

出版信息

Sci Rep. 2024 Nov 29;14(1):29669. doi: 10.1038/s41598-024-81170-y.

Abstract

For sharing privacy-sensitive data, de-identification is commonly regarded as adequate for safeguarding privacy. Synthetic data is also being considered as a privacy-preserving alternative. Recent successes with numerical and tabular data generative models and the breakthroughs in large generative language models raise the question of whether synthetically generated clinical notes could be a viable alternative to real notes for research purposes. In this work, we demonstrated that (i) de-identification of real clinical notes does not protect records against a membership inference attack, (ii) proposed a novel approach to generate synthetic clinical notes using the current state-of-the-art large language models, (iii) evaluated the performance of the synthetically generated notes in a clinical domain task, and (iv) proposed a way to mount a membership inference attack where the target model is trained with synthetic data. We observed that when synthetically generated notes closely match the performance of real data, they also exhibit similar privacy concerns to the real data. Whether other approaches to synthetically generated clinical notes could offer better trade-offs and become a better alternative to sensitive real notes warrants further investigation.

摘要

为了共享隐私敏感数据,去识别通常被认为是保护隐私的充分手段。合成数据也被认为是一种隐私保护的替代方案。最近在数值和表格数据生成模型以及大型生成语言模型方面的成功,提出了一个问题,即合成生成的临床笔记是否可以作为研究目的的真实笔记的可行替代方案。在这项工作中,我们证明了(i)真实临床笔记的去识别并不能防止成员推断攻击,(ii)提出了一种使用当前最先进的大型语言模型生成合成临床笔记的新方法,(iii)评估了合成生成的笔记在临床领域任务中的性能,以及(iv)提出了一种在目标模型使用合成数据进行训练的成员推断攻击的方法。我们观察到,当合成生成的笔记与真实数据的性能非常接近时,它们也表现出与真实数据类似的隐私问题。其他方法生成的合成临床笔记是否可以提供更好的权衡,并成为敏感真实笔记的更好替代方案,值得进一步研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0d5f/11607336/b19f206f97fc/41598_2024_81170_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验