University of Cambridge, Cambridge, CB2 1TN, UK.
University College London, London, WC1E 6BT, UK.
Sci Rep. 2024 Oct 27;14(1):25676. doi: 10.1038/s41598-024-72894-y.
Synthetic data promise privacy-preserving data sharing for healthcare research and development. Compared with other privacy-enhancing approaches-such as federated learning-analyses performed on synthetic data can be applied downstream without modification, such that synthetic data can act in place of real data for a wide range of use cases. However, the role that synthetic data might play in all aspects of clinical model development remains unknown. In this work, we used state-of-the-art generators explicitly designed for privacy preservation to create a synthetic version of ever-smokers in the UK Biobank before building prognostic models for lung cancer under several data release assumptions. We demonstrate that synthetic data can be effectively used throughout the medical prognostic modeling pipeline even without eventual access to the real data. Furthermore, we show the implications of different data release approaches on how synthetic biobank data could be deployed within the healthcare system.
合成数据有望为医疗保健的研究和开发提供隐私保护的数据共享。与其他增强隐私的方法(例如联邦学习)相比,在合成数据上执行的分析可以无需修改即可应用于下游,从而使得合成数据可以替代真实数据用于广泛的用例。然而,合成数据在临床模型开发的各个方面可能发挥的作用尚不清楚。在这项工作中,我们使用专门为保护隐私而设计的最先进的生成器,在根据几种数据发布假设构建肺癌预后模型之前,在 UK Biobank 中创建了一个曾吸烟者的合成版本。我们证明,即使最终无法访问真实数据,合成数据也可以在整个医疗预后建模管道中有效地使用。此外,我们展示了不同数据发布方法对如何在医疗保健系统中部署合成生物库数据的影响。