Suppr超能文献

人工智能生成的皮肤图像显示肤色多样性不足且诊断准确性差:一项实验研究。

AI-generated dermatologic images show deficient skin tone diversity and poor diagnostic accuracy: An experimental study.

作者信息

Joerg Lucie, Kabakova Margaret, Wang Jennifer Y, Austin Evan, Cohen Marc, Kurtti Alana, Jagdeo Jared

机构信息

Albany Medical College, Albany, New York, USA.

Department of Dermatology, State University of New York, Downstate Health Sciences University, Brooklyn, New York, USA.

出版信息

J Eur Acad Dermatol Venereol. 2025 Jul 16. doi: 10.1111/jdv.20849.

Abstract

BACKGROUND

Generative AI models are increasingly used in dermatology, yet biases in training datasets may reduce diagnostic accuracy and perpetuate ethnic health disparities.

OBJECTIVES

To evaluate two key AI outputs: (1) skin tone representation and (2) diagnostic accuracy of generated dermatologic conditions.

METHODS

Using the standard prompt 'Generate a photo of a person with [skin condition],' this cross-sectional study investigated skin tone diversity and accuracy of four leading AI models-Adobe Firefly, ChatGPT-4o, Midjourney and Stable Diffusion-across the 20 most common skin conditions. All images (n = 4000) were evaluated for skin tone representation from June to July 2024. Two independent raters used the Fitzpatrick scale to assess skin tone diversity compared to U.S. Census demographics using χ. Two blinded dermatology residents evaluated a randomized 200-image subset for diagnostic accuracy. An inter-rater kappa statistic was calculated to assess rater agreement.

RESULTS

Across all generated images, 89.8% depicted light skin, and 10.2% depicted dark skin. Adobe Firefly demonstrated the highest alignment with U.S. demographic data, with a non-significant chi-square result (38.1% dark skin, χ(1) = 0.320, p = 0.572), indicating no meaningful difference between its generated skin tone diversity and census demographics. ChatGPT-4o, Midjourney and Stable Diffusion significantly underrepresented dark skin with Fitzpatrick scores of >IV (6.0%, 3.9% and 8.7% dark skin, respectively; all p < 0.001). Across all platforms, only 15% of images were identifiable by raters as the intended condition. Adobe Firefly had the lowest accuracy (0.94%), while ChatGPT-4o, Midjourney and Stable Diffusion demonstrated higher but still suboptimal accuracy (22%, 12.2% and 22.5%, respectively).

CONCLUSIONS

The study highlights substantial deficiencies in the diversity and accuracy of AI-generated dermatological images. AI programs may exacerbate cognitive bias and health inequity, suggesting the need for ethical AI guidelines and diverse datasets to improve disease diagnosis and dermatologic care.

摘要

背景

生成式人工智能模型在皮肤病学中的应用越来越广泛,但训练数据集中的偏差可能会降低诊断准确性,并使种族健康差距长期存在。

目的

评估人工智能的两个关键输出:(1)肤色呈现;(2)生成的皮肤病状况的诊断准确性。

方法

本横断面研究使用标准提示语“生成一张患有[皮肤病]的人的照片”,调查了四种领先的人工智能模型——Adobe Firefly、ChatGPT-4o、Midjourney和Stable Diffusion——在20种最常见皮肤病状况下的肤色多样性和准确性。2024年6月至7月,对所有图像(n = 4000)的肤色呈现进行了评估。两名独立评估者使用菲茨帕特里克量表评估肤色多样性,并与美国人口普查数据进行χ检验比较。两名不知情的皮肤科住院医师对一个随机抽取的200张图像子集进行诊断准确性评估。计算评估者间的kappa统计量以评估评估者之间的一致性。

结果

在所有生成的图像中,89.8%描绘的是浅色皮肤,10.2%描绘的是深色皮肤。Adobe Firefly与美国人口数据的一致性最高,卡方检验结果不显著(38.1%为深色皮肤,χ(1)=0.320,p = 0.572),表明其生成的肤色多样性与人口普查数据之间没有显著差异。ChatGPT-4o、Midjourney和Stable Diffusion对菲茨帕特里克评分>IV的深色皮肤呈现明显不足(分别为6.0%、3.9%和8.7%的深色皮肤;所有p < 0.001)。在所有平台上,只有15%的图像被评估者识别为预期的状况。Adobe Firefly的准确性最低(0.94%),而ChatGPT-4o、Midjourney和Stable Diffusion的准确性较高,但仍未达到最佳水平(分别为22%、12.2%和22.5%)。

结论

该研究突出了人工智能生成的皮肤病图像在多样性和准确性方面的重大缺陷。人工智能程序可能会加剧认知偏差和健康不平等,这表明需要制定符合伦理的人工智能指导方针和多样化的数据集,以改善疾病诊断和皮肤病护理。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验