Pezoulas Vasileios C, Zaridis Dimitrios I, Mylona Eugenia, Androutsos Christos, Apostolidis Kosmas, Tachos Nikolaos S, Fotiadis Dimitrios I
Unit of Medical Technology and Intelligent Information Systems, Dept. of Materials Science and Engineering, University of Ioannina, Ioannina GR45110, Greece.
Biomedical Research Institute - FORTH, University of Ioannina, Ioannina GR45110, Greece.
Comput Struct Biotechnol J. 2024 Jul 9;23:2892-2910. doi: 10.1016/j.csbj.2024.07.005. eCollection 2024 Dec.
Synthetic data generation has emerged as a promising solution to overcome the challenges which are posed by data scarcity and privacy concerns, as well as, to address the need for training artificial intelligence (AI) algorithms on unbiased data with sufficient sample size and statistical power. Our review explores the application and efficacy of synthetic data methods in healthcare considering the diversity of medical data. To this end, we systematically searched the PubMed and Scopus databases with a great focus on tabular, imaging, radiomics, time-series, and omics data. Studies involving multi-modal synthetic data generation were also explored. The type of method used for the synthetic data generation process was identified in each study and was categorized into statistical, probabilistic, machine learning, and deep learning. Emphasis was given to the programming languages used for the implementation of each method. Our evaluation revealed that the majority of the studies utilize synthetic data generators to: (i) reduce the cost and time required for clinical trials for rare diseases and conditions, (ii) enhance the predictive power of AI models in personalized medicine, (iii) ensure the delivery of fair treatment recommendations across diverse patient populations, and (iv) enable researchers to access high-quality, representative multimodal datasets without exposing sensitive patient information, among others. We underline the wide use of deep learning based synthetic data generators in 72.6 % of the included studies, with 75.3 % of the generators being implemented in Python. A thorough documentation of open-source repositories is finally provided to accelerate research in the field.
合成数据生成已成为一种很有前景的解决方案,可克服数据稀缺和隐私问题带来的挑战,同时满足在具有足够样本量和统计效力的无偏数据上训练人工智能(AI)算法的需求。考虑到医学数据的多样性,我们的综述探讨了合成数据方法在医疗保健中的应用和效果。为此,我们系统地搜索了PubMed和Scopus数据库,重点关注表格数据、影像数据、放射组学数据、时间序列数据和组学数据。还探索了涉及多模态合成数据生成的研究。在每项研究中确定了用于合成数据生成过程的方法类型,并将其分为统计方法、概率方法、机器学习方法和深度学习方法。重点介绍了用于每种方法实现的编程语言。我们的评估表明,大多数研究利用合成数据生成器来:(i)降低罕见疾病和病症临床试验所需的成本和时间,(ii)增强人工智能模型在个性化医疗中的预测能力,(iii)确保在不同患者群体中提供公平的治疗建议,以及(iv)使研究人员能够访问高质量、具有代表性的多模态数据集,同时不暴露患者敏感信息等。我们强调,在纳入的研究中,72.6%的研究广泛使用了基于深度学习的合成数据生成器,其中75.3%的生成器是用Python实现的。最后提供了开源存储库的详细文档,以加速该领域的研究。