大语言模型在皮肤病诊断中的比较分析：诊断准确性评估

Comparative Analysis of Large Language Models in Dermatological Diagnosis: An Evaluation of Diagnostic Accuracy.

作者信息

Tekchandani Niharika, Mukherjee Anurup, Poonthottam Nandakumar, Boussios Stergios

机构信息

Medicine, Medway NHS Foundation Trust, Kent, GBR.

Digital Health/Internal Medicine, Kent and Medway Medical School/Maidstone and Tunbridge Wells NHS Trust, Kent, GBR.

出版信息

Cureus. 2025 Sep 11;17(9):e92089. doi: 10.7759/cureus.92089. eCollection 2025 Sep.

DOI:10.7759/cureus.92089

PMID:40949075

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12425564/

Abstract

BACKGROUND

The diagnostic process in dermatology often hinges on visual recognition and clinical pattern matching, making it an attractive field for the application of artificial intelligence (AI). Large language models (LLMs) like ChatGPT-4o, Claude 3.7 Sonnet, and Gemini 2.0 Flash offer new possibilities for augmenting diagnostic reasoning, particularly in rare or diagnostically challenging cases. This study evaluates and compares the diagnostic capabilities of these LLMs based solely on clinical presentations extracted from rare dermatological case reports.

METHODOLOGY

Fifteen published case reports of rare dermatological conditions were retrospectively selected. Key clinical features, excluding laboratory or histopathological findings, were input into each of the three LLMs using standardized prompts. Each model produced a most probable diagnosis and a list of differential diagnoses. The outputs were evaluated for top-match accuracy and whether the correct diagnosis was included in the differential list. Performance was analyzed descriptively, with visual aids (heatmaps, bar charts) illustrating comparative outcomes.

RESULTS

ChatGPT-4o and Claude 3.7 Sonnet each correctly identified the top diagnosis in 10 (66.7%) out of 15 cases, compared to 8 (53.3%) out of 15 for Gemini 2.0 Flash. When differential-only matches were included, both ChatGPT-4o and Claude 3.7 achieved a total coverage of 86.7%, while Gemini 2.0 reached 60.0%. Notably, all models failed to identify certain diagnoses, including blastic plasmacytoid dendritic cell neoplasm and amelanotic melanoma, underscoring the potential risks associated with plausible but incorrect outputs.

CONCLUSIONS

This study demonstrates that ChatGPT-4o and Claude 3.7 Sonnet show promising diagnostic potential in rare dermatologic cases, outperforming Gemini 2.0 Flash in both accuracy and diagnostic breadth. While LLMs may assist in clinical reasoning, particularly in settings with limited dermatology expertise, they should be used as adjunctive tools, not substitutes, for clinician judgment. Further refinement, validation, and integration into clinical workflows are warranted.

摘要

背景

皮肤病学的诊断过程通常依赖于视觉识别和临床模式匹配，这使其成为人工智能（AI）应用的一个有吸引力的领域。像ChatGPT-4o、Claude 3.7 Sonnet和Gemini 2.0 Flash这样的大语言模型为增强诊断推理提供了新的可能性，特别是在罕见或诊断具有挑战性的病例中。本研究仅基于从罕见皮肤病病例报告中提取的临床表现来评估和比较这些大语言模型的诊断能力。

方法

回顾性选择了15篇已发表的罕见皮肤病病例报告。将关键临床特征（不包括实验室或组织病理学结果）使用标准化提示输入到三个大语言模型中。每个模型都给出了最可能的诊断和鉴别诊断列表。对输出结果进行了顶级匹配准确性评估以及鉴别列表中是否包含正确诊断的评估。通过描述性分析性能，使用视觉辅助工具（热图、柱状图）展示比较结果。

结果

ChatGPT-4o和Claude 3.7 Sonnet在15例病例中分别有10例（66.7%）正确识别出顶级诊断，而Gemini 2.0 Flash在15例中有8例（53.3%）。当包括仅鉴别匹配时，ChatGPT-4o和Claude 3.7的总覆盖率均为86.7%，而Gemini 2.0为60.0%。值得注意的是，所有模型都未能识别某些诊断，包括母细胞性浆细胞样树突状细胞瘤和无色素性黑色素瘤，这突出了与看似合理但不正确的输出相关的潜在风险。

结论

本研究表明，ChatGPT-4o和Claude 3.7 Sonnet在罕见皮肤病病例中显示出有前景的诊断潜力，在准确性和诊断广度方面均优于Gemini 2.0 Flash。虽然大语言模型可能有助于临床推理，特别是在皮肤病学专业知识有限的环境中，但它们应作为辅助工具，而非替代临床医生的判断。进一步的改进、验证以及整合到临床工作流程中是必要的。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

大语言模型在皮肤病诊断中的比较分析：诊断准确性评估

Comparative Analysis of Large Language Models in Dermatological Diagnosis: An Evaluation of Diagnostic Accuracy.

作者信息

机构信息

出版信息

BACKGROUND

METHODOLOGY

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

本文引用的文献

大语言模型在皮肤病诊断中的比较分析：诊断准确性评估

Comparative Analysis of Large Language Models in Dermatological Diagnosis: An Evaluation of Diagnostic Accuracy.

作者信息

机构信息

出版信息

BACKGROUND

METHODOLOGY

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

本文引用的文献