Suppr超能文献

合成癌症登记数据的保真度-隐私权衡。

On the Fidelity-Privacy Tradeoff of Synthetic Cancer Registry Data.

机构信息

Information Systems and Business Administration, Johannes Gutenberg University Mainz, Germany.

Institute for Digital Health Data Rhineland-Palatinate, Germany.

出版信息

Stud Health Technol Inform. 2024 Aug 22;316:621-625. doi: 10.3233/SHTI240490.

Abstract

The sharing of personal health data is highly regulated due to privacy and security concerns. An alternative to sharing personal data is to share synthetic data, because ideally it should be impossible to reconstruct real personal data from synthetic data, which is called privacy. At the same time, the structure of the synthetic data should be as similar as possible to the structure of the real data to ensure that conclusions drawn from the synthetic data are also valid for the real data, which is called fidelity. Typically, there is a tradeoff between fidelity and privacy for synthetic health data. We study the fidelity and privacy of cancer data synthesized using generative machine learning approaches. To generate synthetic cancer data, we use variational autoencoders (VAEs), generative adversarial networks (GANs), and denoising diffusion probabilistic models (DDPMs). The tabular cancer registry data studied have nine categorical variables from breast cancer patients. We find that DDPMs generate synthetic cancer data with higher fidelity; that is, the structure of the synthetic data is more similar to the real cancer data than the data generated by VAEs and GANs. At the same time, synthetic cancer data from DDPMs pose a greater privacy risk because the data are more likely to reveal information from real patients than synthetic data from VAEs and GANs.

摘要

由于隐私和安全问题,个人健康数据的共享受到高度监管。替代共享个人数据的方法是共享合成数据,因为从理论上讲,从合成数据中重建真实个人数据应该是不可能的,这被称为隐私性。同时,合成数据的结构应尽可能与真实数据的结构相似,以确保从合成数据中得出的结论也适用于真实数据,这被称为保真度。通常,合成健康数据的保真度和隐私性之间存在权衡。我们研究了使用生成式机器学习方法合成的癌症数据的保真度和隐私性。为了生成合成癌症数据,我们使用了变分自编码器(VAEs)、生成对抗网络(GANs)和去噪扩散概率模型(DDPMs)。所研究的表格式癌症登记数据有九个来自乳腺癌患者的类别变量。我们发现,DDPMs 生成的合成癌症数据具有更高的保真度,也就是说,与 VAEs 和 GANs 生成的数据相比,合成数据的结构与真实癌症数据更为相似。同时,来自 DDPMs 的合成癌症数据存在更大的隐私风险,因为与 VAEs 和 GANs 生成的合成数据相比,这些数据更有可能泄露真实患者的信息。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验