• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大语言模型在皮肤病诊断中的比较分析:诊断准确性评估

Comparative Analysis of Large Language Models in Dermatological Diagnosis: An Evaluation of Diagnostic Accuracy.

作者信息

Tekchandani Niharika, Mukherjee Anurup, Poonthottam Nandakumar, Boussios Stergios

机构信息

Medicine, Medway NHS Foundation Trust, Kent, GBR.

Digital Health/Internal Medicine, Kent and Medway Medical School/Maidstone and Tunbridge Wells NHS Trust, Kent, GBR.

出版信息

Cureus. 2025 Sep 11;17(9):e92089. doi: 10.7759/cureus.92089. eCollection 2025 Sep.

DOI:10.7759/cureus.92089
PMID:40949075
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12425564/
Abstract

BACKGROUND

The diagnostic process in dermatology often hinges on visual recognition and clinical pattern matching, making it an attractive field for the application of artificial intelligence (AI). Large language models (LLMs) like ChatGPT-4o, Claude 3.7 Sonnet, and Gemini 2.0 Flash offer new possibilities for augmenting diagnostic reasoning, particularly in rare or diagnostically challenging cases. This study evaluates and compares the diagnostic capabilities of these LLMs based solely on clinical presentations extracted from rare dermatological case reports.

METHODOLOGY

Fifteen published case reports of rare dermatological conditions were retrospectively selected. Key clinical features, excluding laboratory or histopathological findings, were input into each of the three LLMs using standardized prompts. Each model produced a most probable diagnosis and a list of differential diagnoses. The outputs were evaluated for top-match accuracy and whether the correct diagnosis was included in the differential list. Performance was analyzed descriptively, with visual aids (heatmaps, bar charts) illustrating comparative outcomes.

RESULTS

ChatGPT-4o and Claude 3.7 Sonnet each correctly identified the top diagnosis in 10 (66.7%) out of 15 cases, compared to 8 (53.3%) out of 15 for Gemini 2.0 Flash. When differential-only matches were included, both ChatGPT-4o and Claude 3.7 achieved a total coverage of 86.7%, while Gemini 2.0 reached 60.0%. Notably, all models failed to identify certain diagnoses, including blastic plasmacytoid dendritic cell neoplasm and amelanotic melanoma, underscoring the potential risks associated with plausible but incorrect outputs.

CONCLUSIONS

This study demonstrates that ChatGPT-4o and Claude 3.7 Sonnet show promising diagnostic potential in rare dermatologic cases, outperforming Gemini 2.0 Flash in both accuracy and diagnostic breadth. While LLMs may assist in clinical reasoning, particularly in settings with limited dermatology expertise, they should be used as adjunctive tools, not substitutes, for clinician judgment. Further refinement, validation, and integration into clinical workflows are warranted.

摘要

背景

皮肤病学的诊断过程通常依赖于视觉识别和临床模式匹配,这使其成为人工智能(AI)应用的一个有吸引力的领域。像ChatGPT-4o、Claude 3.7 Sonnet和Gemini 2.0 Flash这样的大语言模型为增强诊断推理提供了新的可能性,特别是在罕见或诊断具有挑战性的病例中。本研究仅基于从罕见皮肤病病例报告中提取的临床表现来评估和比较这些大语言模型的诊断能力。

方法

回顾性选择了15篇已发表的罕见皮肤病病例报告。将关键临床特征(不包括实验室或组织病理学结果)使用标准化提示输入到三个大语言模型中。每个模型都给出了最可能的诊断和鉴别诊断列表。对输出结果进行了顶级匹配准确性评估以及鉴别列表中是否包含正确诊断的评估。通过描述性分析性能,使用视觉辅助工具(热图、柱状图)展示比较结果。

结果

ChatGPT-4o和Claude 3.7 Sonnet在15例病例中分别有10例(66.7%)正确识别出顶级诊断,而Gemini 2.0 Flash在15例中有8例(53.3%)。当包括仅鉴别匹配时,ChatGPT-4o和Claude 3.7的总覆盖率均为86.7%,而Gemini 2.0为60.0%。值得注意的是,所有模型都未能识别某些诊断,包括母细胞性浆细胞样树突状细胞瘤和无色素性黑色素瘤,这突出了与看似合理但不正确的输出相关的潜在风险。

结论

本研究表明,ChatGPT-4o和Claude 3.7 Sonnet在罕见皮肤病病例中显示出有前景的诊断潜力,在准确性和诊断广度方面均优于Gemini 2.0 Flash。虽然大语言模型可能有助于临床推理,特别是在皮肤病学专业知识有限的环境中,但它们应作为辅助工具,而非替代临床医生的判断。进一步的改进、验证以及整合到临床工作流程中是必要的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2d20/12425564/cec9127a6bb2/cureus-0017-00000092089-i02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2d20/12425564/d2acf5f7b778/cureus-0017-00000092089-i01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2d20/12425564/cec9127a6bb2/cureus-0017-00000092089-i02.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2d20/12425564/d2acf5f7b778/cureus-0017-00000092089-i01.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2d20/12425564/cec9127a6bb2/cureus-0017-00000092089-i02.jpg

相似文献

1
Comparative Analysis of Large Language Models in Dermatological Diagnosis: An Evaluation of Diagnostic Accuracy.大语言模型在皮肤病诊断中的比较分析:诊断准确性评估
Cureus. 2025 Sep 11;17(9):e92089. doi: 10.7759/cureus.92089. eCollection 2025 Sep.
2
Accuracy of large language models in generating differential diagnosis from clinical presentation and imaging findings in pediatric cases.大型语言模型根据儿科病例的临床表现和影像学检查结果生成鉴别诊断的准确性。
Pediatr Radiol. 2025 Jul 12. doi: 10.1007/s00247-025-06317-z.
3
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.ChatGPT-4o与四个开源大语言模型基于中国罕见病目录生成诊断的性能:比较研究
J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.
4
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
5
Rapidly Benchmarking Large Language Models for Diagnosing Comorbid Patients: Comparative Study Leveraging the LLM-as-a-Judge Method.快速对用于诊断合并症患者的大语言模型进行基准测试:利用“大语言模型即评判者”方法的比较研究
JMIRx Med. 2025 Aug 29;6:e67661. doi: 10.2196/67661.
6
Leveraging Large Language Models for Accurate AO Fracture Classification from CT Text Reports.利用大语言模型从CT文本报告中进行准确的AO骨折分类
J Imaging Inform Med. 2025 Jul 7. doi: 10.1007/s10278-025-01603-6.
7
Five advanced chatbots solving European Diploma in Radiology (EDiR) text-based questions: differences in performance and consistency.五个解决欧洲放射学文凭(EDiR)基于文本问题的先进聊天机器人:性能和一致性的差异。
Eur Radiol Exp. 2025 Aug 19;9(1):79. doi: 10.1186/s41747-025-00591-0.
8
Dedicated AI Expert System vs Generative AI With Large Language Model for Clinical Diagnoses.用于临床诊断的专用人工智能专家系统与具有大语言模型的生成式人工智能对比
JAMA Netw Open. 2025 May 1;8(5):e2512994. doi: 10.1001/jamanetworkopen.2025.12994.
9
Evaluation of the Reliability of AI-Based Large Language Models in Developing Orthodontic Treatment Plans.基于人工智能的大语言模型在制定正畸治疗方案中的可靠性评估。
Cureus. 2025 Jul 31;17(7):e89149. doi: 10.7759/cureus.89149. eCollection 2025 Jul.
10
Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple-choice questions: implications for the use of AI in medical education.ChatGPT、Gemini与急诊医学实习最后一年学生在回答多项选择题方面的表现比较:人工智能在医学教育中的应用启示
Int J Emerg Med. 2025 Aug 7;18(1):146. doi: 10.1186/s12245-025-00949-6.

本文引用的文献

1
Large language models for disease diagnosis: a scoping review.用于疾病诊断的大语言模型:一项范围综述。
NPJ Artif Intell. 2025;1(1):9. doi: 10.1038/s44387-025-00011-z. Epub 2025 Jun 9.
2
Pachydermodactyly: An Underdiagnosed Condition in Adolescence-A Case Report and Literature Review.厚皮性多指(趾)畸形:一种在青少年中未被充分诊断的病症——病例报告及文献综述
Case Rep Dermatol Med. 2025 May 8;2025:5560071. doi: 10.1155/crdm/5560071. eCollection 2025.
3
Towards accurate differential diagnosis with large language models.迈向使用大语言模型进行准确的鉴别诊断。
Nature. 2025 Apr 9. doi: 10.1038/s41586-025-08869-4.
4
Petechiae and a Persistent Violaceous Nodule: A Presentation of Blastic Plasmacytoid Dendritic Cell Neoplasm to Dermatology.瘀点与持续性紫罗兰色结节:一例向皮肤科就诊的母细胞性浆细胞样树突状细胞肿瘤病例
Case Rep Dermatol Med. 2025 Mar 18;2025:8628105. doi: 10.1155/crdm/8628105. eCollection 2025.
5
A Clinical Case of Idiopathic Atrophoderma of Pasini and Pierini With Literature Review.一例帕西尼和皮耶里尼特发性皮肤萎缩临床病例并文献复习
Case Rep Dermatol Med. 2025 Mar 14;2025:8886954. doi: 10.1155/crdm/8886954. eCollection 2025.
6
Pediatric Presentations of Granulomatosis With Polyangiitis: A Double Case Study.小儿肉芽肿性多血管炎的临床表现:双病例研究
Case Rep Dermatol Med. 2025 Mar 14;2025:6052518. doi: 10.1155/crdm/6052518. eCollection 2025.
7
Dermoscopic Features of Cutaneous Endometriosis Arising in a Cesarean Scar: A Case Report.剖宫产瘢痕处皮肤子宫内膜异位症的皮肤镜特征:一例报告
Case Rep Dermatol Med. 2024 Dec 31;2024:6880602. doi: 10.1155/crdm/6880602. eCollection 2024.
8
Comparative Evaluation of LLMs in Clinical Oncology.临床肿瘤学中大型语言模型的比较评估
NEJM AI. 2024 May;1(5). doi: 10.1056/aioa2300151. Epub 2024 Apr 16.
9
The potential and pitfalls of using a large language model such as ChatGPT, GPT-4, or LLaMA as a clinical assistant.使用大型语言模型(如 ChatGPT、GPT-4 或 Llama)作为临床助手的潜力和陷阱。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1884-1891. doi: 10.1093/jamia/ocae184.
10
Generative models improve fairness of medical classifiers under distribution shifts.生成式模型可提高分布偏移下医学分类器的公平性。
Nat Med. 2024 Apr;30(4):1166-1173. doi: 10.1038/s41591-024-02838-6. Epub 2024 Apr 10.