Zhang Yili, Dong Jia Li, Xue Bai, Xiong Yanbao, Gupta Samir, Segbroeck Maarten Van, Shara Nawar, McGarvey Peter
Innovation Center for Biomedical Informatics, Georgetown University, Washington, DC.
Department of Computer Science, Yale University, New Haven, CT.
AMIA Annu Symp Proc. 2025 May 22;2024:1313-1322. eCollection 2024.
Privacy and security restrictions on medical data pose challenges to collaborative research, making synthetic data an increasingly attractive solution. Recent advancements in Generative AI technologies, like GAN models, have improved synthetic data generation. This study investigates the use of synthetic data in clustering models for opioid misuse analysis, generating a dataset that replicates real-world data from 2017 to 2019, including demographics and diagnosis codes. By maintaining patient privacy, we enable comprehensive analysis without compromising security. We developed unsupervised clustering models to identify opioid misuse patterns and assessed the effectiveness of synthetic data across four scenarios: training on real dataset and testing on real dataset (TRTR), training on real dataset and testing on synthetic dataset (TRTS), TSTR, and TSTS. Results demonstrate that synthetic data can replicate real data distributions and clustering characteristics as a training set, offering significant potential for collaborative model development and optimization without exposing privacy or security risks.
医疗数据的隐私和安全限制给合作研究带来了挑战,使得合成数据成为一种越来越有吸引力的解决方案。生成式人工智能技术(如GAN模型)的最新进展改进了合成数据的生成。本研究调查了合成数据在阿片类药物滥用分析聚类模型中的应用,生成了一个复制2017年至2019年真实世界数据的数据集,包括人口统计学和诊断代码。通过维护患者隐私,我们能够在不损害安全性的情况下进行全面分析。我们开发了无监督聚类模型来识别阿片类药物滥用模式,并在四种情况下评估了合成数据的有效性:在真实数据集上训练并在真实数据集上测试(TRTR)、在真实数据集上训练并在合成数据集上测试(TRTS)、TSTR和TSTS。结果表明,合成数据作为训练集可以复制真实数据分布和聚类特征,为合作模型开发和优化提供了巨大潜力,同时不会暴露隐私或安全风险。