Arora Anmol, Wagner Siegfried Karl, Carpenter Robin, Jena Rajesh, Keane Pearse A
School of Clinical Medicine, University of Cambridge, Cambridge, UK.
NIHR Biomedical Research Centre, Moorfields Eye Hospital NHS Foundation Trust, London, UK; Institute of Ophthalmology, University College London, London, UK.
Lancet Digit Health. 2025 Feb;7(2):e157-e160. doi: 10.1016/S2589-7500(24)00196-1. Epub 2024 Nov 26.
Synthetic data, generated through artificial intelligence technologies such as generative adversarial networks and latent diffusion models, maintain aggregate patterns and relationships present in the real data the technologies were trained on without exposing individual identities, thereby mitigating re-identification risks. This approach has been gaining traction in biomedical research because of its ability to preserve privacy and enable dataset sharing between organisations. Although the use of synthetic data has become widespread in other domains, such as finance and high-energy physics, use in medical research raises novel issues. The use of synthetic data as a method of preserving the privacy of data used to train models requires that the data are high fidelity with the original data to preserve utility, but must be sufficiently different as to protect against adversarial or accidental re-identification. There is a need for the development of standards for synthetic data generation and consensus standards for its evaluation. As synthetic data applications expand, ongoing legal and ethical evaluations are crucial to ensure that they remain a secure and effective tool for advancing medical research without compromising individual privacy.
通过生成对抗网络和潜在扩散模型等人工智能技术生成的合成数据,保留了用于训练这些技术的真实数据中存在的总体模式和关系,同时不暴露个体身份,从而降低了重新识别风险。由于这种方法能够保护隐私并实现组织间的数据集共享,因此在生物医学研究中越来越受到关注。尽管合成数据的使用在金融和高能物理等其他领域已广泛普及,但在医学研究中的应用引发了一些新问题。将合成数据用作保护模型训练所用数据隐私的一种方法,要求数据与原始数据具有高保真度以保持实用性,但又必须有足够差异以防止对抗性或意外的重新识别。需要制定合成数据生成标准及其评估的共识标准。随着合成数据应用的扩展,持续的法律和伦理评估对于确保其在不损害个人隐私的情况下仍然是推进医学研究的安全有效工具至关重要。