Rajotte Jean-Francois, Bergen Robert, Buckeridge David L, El Emam Khaled, Ng Raymond, Strome Elissa
Data Science Institute, University of British Columbia, Vancouver, BC, Canada.
McGill University and McGill University Health Centre, Montreal, QC, Canada.
iScience. 2022 Oct 13;25(11):105331. doi: 10.1016/j.isci.2022.105331. eCollection 2022 Nov 18.
Synthetic data generation is the process of using machine learning methods to train a model that captures the patterns in a real dataset. Then new or synthetic data can be generated from that trained model. The synthetic data does not have a one-to-one mapping to the original data or to real patients, and therefore has the potential of privacy preserving properties. There is a growing interest in the application of synthetic data across health and life sciences, but to fully realize the benefits, further education, research, and policy innovation is required. This article summarizes the opportunities and challenges of SDG for health data, and provides directions for how this technology can be leveraged to accelerate data access for secondary purposes.
合成数据生成是指利用机器学习方法训练一个能够捕捉真实数据集模式的模型的过程。然后,可以从该训练模型生成新的或合成数据。合成数据与原始数据或真实患者不存在一对一映射关系,因此具有隐私保护特性。合成数据在健康和生命科学领域的应用正受到越来越多的关注,但要充分实现其益处,还需要进一步的教育、研究和政策创新。本文总结了合成数据生成在健康数据方面的机遇和挑战,并为如何利用这项技术加速二次数据访问提供了指导方向。