评估文本到图像生成的人体解剖逼真图像。
Evaluating Text-to-Image Generated Photorealistic Images of Human Anatomy.
作者信息
Muhr Paula, Pan Yating, Tumescheit Charlotte, Kübler Ann-Kathrin, Parmaksiz Hatice Kübra, Chen Cheng, Bolaños Orozco Pablo Sebastián, Lienkamp Soeren S, Hastings Janna
机构信息
Faculty of Medicine, Institute for Implementation Science in Health Care, University of Zurich, Zurich, CHE.
Digital Society Initiative, University of Zurich, Zurich, CHE.
出版信息
Cureus. 2024 Nov 21;16(11):e74193. doi: 10.7759/cureus.74193. eCollection 2024 Nov.
BACKGROUND
Generative artificial intelligence (AI) models that can produce photorealistic images from text descriptions have many applications in medicine, including medical education and the generation of synthetic data. However, it can be challenging to evaluate their heterogeneous outputs and to compare between different models. There is a need for a systematic approach enabling image and model comparisons.
METHOD
To address this gap, we developed an error classification system for annotating errors in AI-generated photorealistic images of humans and applied our method to a corpus of 240 images generated with three different models (DALL-E 3, Stable Diffusion XL, and Stable Cascade) using 10 prompts with eight images per prompt.
RESULTS
The error classification system identifies five different error types with three different severities across five anatomical regions and specifies an associated quantitative scoring method based on aggregated proportions of errors per expected count of anatomical components for the generated image. We assessed inter-rater agreement by double-annotating 25% of the images and calculating Krippendorf's alpha and compared results across the three models and 10 prompts quantitatively using a cumulative score per image. The error classification system, accompanying training manual, generated image collection, annotations, and all associated scripts, is available from our GitHub repository at https://github.com/hastingslab-org/ai-human-images. Inter-rater agreement was relatively poor, reflecting the subjectivity of the error classification task. Model comparisons revealed that DALL-E 3 performed consistently better than Stable Diffusion; however, the latter generated images reflecting more diversity in personal attributes. Images with groups of people were more challenging for all the models than individuals or pairs; some prompts were challenging for all models.
CONCLUSION
Our method enables systematic comparison of AI-generated photorealistic images of humans; our results can serve to catalyse improvements in these models for medical applications.
背景
能够根据文本描述生成逼真图像的生成式人工智能(AI)模型在医学领域有许多应用,包括医学教育和合成数据的生成。然而,评估其异质输出并在不同模型之间进行比较可能具有挑战性。需要一种系统的方法来进行图像和模型比较。
方法
为了弥补这一差距,我们开发了一种错误分类系统,用于标注AI生成的人体逼真图像中的错误,并将我们的方法应用于一个由240张图像组成的语料库,这些图像由三种不同的模型(DALL-E 3、Stable Diffusion XL和Stable Cascade)使用10个提示生成,每个提示有8张图像。
结果
错误分类系统在五个解剖区域识别出五种不同的错误类型,具有三种不同的严重程度,并根据生成图像中每个预期解剖成分计数的错误汇总比例指定了一种相关的定量评分方法。我们通过对25%的图像进行双重标注并计算Krippendorf's alpha来评估评分者间的一致性,并使用每张图像的累积分数对三个模型和10个提示的结果进行定量比较。错误分类系统、配套的培训手册、生成的图像集、注释以及所有相关脚本可从我们的GitHub存储库(https://github.com/hastingslab-org/ai-human-images)获取。评分者间的一致性相对较差,反映了错误分类任务的主观性。模型比较显示,DALL-E 3的表现始终优于Stable Diffusion;然而,后者生成的图像在个人属性方面反映出更多样性。对于所有模型来说,有人群的图像比个体或两人的图像更具挑战性;一些提示对所有模型来说都具有挑战性。
结论
我们的方法能够对AI生成的人体逼真图像进行系统比较;我们的结果可有助于推动这些模型在医学应用方面的改进。