Deshpande Rucha, Kelkar Varun A, Gotsis Dimitrios, Kc Prabhat, Zeng Rongping, Myers Kyle J, Brooks Frank J, Anastasio Mark A
Dept. of Biomedical Engineering, Washington University in St. Louis, St. Louis, Missouri, USA.
Dept. of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Urbana, Illinois, USA.
Med Phys. 2025 Jan;52(1):4-20. doi: 10.1002/mp.17473. Epub 2024 Oct 24.
The findings of the 2023 AAPM Grand Challenge on Deep Generative Modeling for Learning Medical Image Statistics are reported in this Special Report.
The goal of this challenge was to promote the development of deep generative models for medical imaging and to emphasize the need for their domain-relevant assessments via the analysis of relevant image statistics.
As part of this Grand Challenge, a common training dataset and an evaluation procedure was developed for benchmarking deep generative models for medical image synthesis. To create the training dataset, an established 3D virtual breast phantom was adapted. The resulting dataset comprised about 108 000 images of size 512 512. For the evaluation of submissions to the Challenge, an ensemble of 10 000 DGM-generated images from each submission was employed. The evaluation procedure consisted of two stages. In the first stage, a preliminary check for memorization and image quality (via the Fréchet Inception Distance [FID]) was performed. Submissions that passed the first stage were then evaluated for the reproducibility of image statistics corresponding to several feature families including texture, morphology, image moments, fractal statistics, and skeleton statistics. A summary measure in this feature space was employed to rank the submissions. Additional analyses of submissions was performed to assess DGM performance specific to individual feature families, the four classes in the training data, and also to identify various artifacts.
Fifty-eight submissions from 12 unique users were received for this Challenge. Out of these 12 submissions, 9 submissions passed the first stage of evaluation and were eligible for ranking. The top-ranked submission employed a conditional latent diffusion model, whereas the joint runners-up employed a generative adversarial network, followed by another network for image superresolution. In general, we observed that the overall ranking of the top 9 submissions according to our evaluation method (i) did not match the FID-based ranking, and (ii) differed with respect to individual feature families. Another important finding from our additional analyses was that different DGMs demonstrated similar kinds of artifacts.
This Grand Challenge highlighted the need for domain-specific evaluation to further DGM design as well as deployment. It also demonstrated that the specification of a DGM may differ depending on its intended use.
本特别报告介绍了2023年美国医学物理学会(AAPM)关于用于学习医学图像统计的深度生成模型的大挑战的结果。
本次挑战的目标是促进用于医学成像的深度生成模型的开发,并强调通过分析相关图像统计数据对其进行领域相关评估的必要性。
作为本次大挑战的一部分,开发了一个通用训练数据集和一个评估程序,用于对医学图像合成的深度生成模型进行基准测试。为了创建训练数据集,对一个已建立的3D虚拟乳房模型进行了改编。生成的数据集包含约108000张大小为512×512的图像。为了评估挑战赛的参赛作品,使用了来自每个参赛作品的10000张由深度生成模型生成的图像组成的集合。评估程序包括两个阶段。在第一阶段,进行了记忆和图像质量的初步检查(通过弗雷歇因距离 [FID])。通过第一阶段的参赛作品随后被评估与包括纹理、形态、图像矩、分形统计和骨架统计在内的几个特征族对应的图像统计数据的可重复性。在这个特征空间中使用一个汇总度量对参赛作品进行排名。对参赛作品进行了额外分析,以评估特定于各个特征族、训练数据中的四个类别的深度生成模型性能,还识别各种伪像。
本次挑战赛共收到来自12个不同用户的58份参赛作品。在这58份参赛作品中,9份作品通过了第一阶段评估,有资格进行排名。排名第一的参赛作品采用了条件潜在扩散模型,并列亚军采用了生成对抗网络,其次是另一个用于图像超分辨率的网络。总体而言,我们观察到根据我们的评估方法,前9名参赛作品的总体排名(i)与基于FID的排名不匹配,并且(ii)在各个特征族方面存在差异。我们额外分析的另一个重要发现是,不同的深度生成模型表现出类似类型的伪像。
本次大挑战强调了进行特定领域评估以进一步设计和部署深度生成模型的必要性。它还表明,深度生成模型的规范可能因其预期用途而异。