Xu JiaLe, Hua Qing, Jia XiaoHong, Zheng YuHang, Hu Qiao, Bai BaoYan, Miao Juan, Zhu LiSha, Zhang MeiXiang, Tao RuoLin, Li YuHeng, Luo Ting, Xie Jun, Zheng XueBin, Gu PengChen, Xing FengYuan, He Chuan, Song YanYan, Dong YiJie, Xia ShuJun, Zhou JianQiao
Department of Ultrasound, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, 200025 Shanghai, China.
College of Health Science and Technology, Shanghai Jiao Tong University School of Medicine, 200025 Shanghai, China.
Research (Wash D C). 2024 Dec 3;7:0532. doi: 10.34133/research.0532. eCollection 2024.
The vast potential of medical big data to enhance healthcare outcomes remains underutilized due to privacy concerns, which restrict cross-center data sharing and the construction of diverse, large-scale datasets. To address this challenge, we developed a deep generative model aimed at synthesizing medical data to overcome data sharing barriers, with a focus on breast ultrasound (US) image synthesis. Specifically, we introduce CoLDiT, a conditional latent diffusion model with a transformer backbone, to generate US images of breast lesions across various Breast Imaging Reporting and Data System (BI-RADS) categories. Using a training dataset of 9,705 US images from 5,243 patients across 202 hospitals with diverse US systems, CoLDiT generated breast US images without duplicating private information, as confirmed through nearest-neighbor analysis. Blinded reader studies further validated the realism of these images, with area under the receiver operating characteristic curve (AUC) scores ranging from 0.53 to 0.77. Additionally, synthetic breast US images effectively augmented the training dataset for BI-RADS classification, achieving performance comparable to that using an equal-sized training set comprising solely real images ( = 0.81 for AUC). Our findings suggest that synthetic data, such as CoLDiT-generated images, offer a viable, privacy-preserving solution to facilitate secure medical data sharing and advance the utilization of medical big data.
由于隐私问题,医学大数据在改善医疗结果方面的巨大潜力仍未得到充分利用,隐私问题限制了跨中心数据共享以及多样化大规模数据集的构建。为应对这一挑战,我们开发了一种深度生成模型,旨在合成医学数据以克服数据共享障碍,重点是乳腺超声(US)图像合成。具体而言,我们引入了CoLDiT,一种具有Transformer主干的条件潜在扩散模型,以生成跨各种乳腺影像报告和数据系统(BI-RADS)类别的乳腺病变US图像。使用来自202家拥有不同US系统的医院的5243名患者的9705张US图像的训练数据集,CoLDiT生成了乳腺US图像,且未重复私人信息,最近邻分析证实了这一点。盲法读者研究进一步验证了这些图像的真实性,受试者操作特征曲线(AUC)下面积得分在0.53至0.77之间。此外,合成乳腺US图像有效地扩充了用于BI-RADS分类的训练数据集,其性能与使用仅包含真实图像的同等规模训练集相当(AUC = 0.81)。我们的研究结果表明,诸如CoLDiT生成的图像之类的合成数据提供了一种可行的、保护隐私的解决方案,以促进安全的医学数据共享并推动医学大数据的利用。