• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于黑色素瘤诊断的皮肤镜图像分析中Claude 3 Opus和配备GPT-4的ChatGPT:比较性能分析

Claude 3 Opus and ChatGPT With GPT-4 in Dermoscopic Image Analysis for Melanoma Diagnosis: Comparative Performance Analysis.

作者信息

Liu Xu, Duan Chaoli, Kim Min-Kyu, Zhang Lu, Jee Eunjin, Maharjan Beenu, Huang Yuwei, Du Dan, Jiang Xian

机构信息

Department of Dermatology, West China Hospital, Sichuan University, Chengdu, China.

出版信息

JMIR Med Inform. 2024 Aug 6;12:e59273. doi: 10.2196/59273.

DOI:10.2196/59273
PMID:39106482
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11336503/
Abstract

BACKGROUND

Recent advancements in artificial intelligence (AI) and large language models (LLMs) have shown potential in medical fields, including dermatology. With the introduction of image analysis capabilities in LLMs, their application in dermatological diagnostics has garnered significant interest. These capabilities are enabled by the integration of computer vision techniques into the underlying architecture of LLMs.

OBJECTIVE

This study aimed to compare the diagnostic performance of Claude 3 Opus and ChatGPT with GPT-4 in analyzing dermoscopic images for melanoma detection, providing insights into their strengths and limitations.

METHODS

We randomly selected 100 histopathology-confirmed dermoscopic images (50 malignant, 50 benign) from the International Skin Imaging Collaboration (ISIC) archive using a computer-generated randomization process. The ISIC archive was chosen due to its comprehensive and well-annotated collection of dermoscopic images, ensuring a diverse and representative sample. Images were included if they were dermoscopic images of melanocytic lesions with histopathologically confirmed diagnoses. Each model was given the same prompt, instructing it to provide the top 3 differential diagnoses for each image, ranked by likelihood. Primary diagnosis accuracy, accuracy of the top 3 differential diagnoses, and malignancy discrimination ability were assessed. The McNemar test was chosen to compare the diagnostic performance of the 2 models, as it is suitable for analyzing paired nominal data.

RESULTS

In the primary diagnosis, Claude 3 Opus achieved 54.9% sensitivity (95% CI 44.08%-65.37%), 57.14% specificity (95% CI 46.31%-67.46%), and 56% accuracy (95% CI 46.22%-65.42%), while ChatGPT demonstrated 56.86% sensitivity (95% CI 45.99%-67.21%), 38.78% specificity (95% CI 28.77%-49.59%), and 48% accuracy (95% CI 38.37%-57.75%). The McNemar test showed no significant difference between the 2 models (P=.17). For the top 3 differential diagnoses, Claude 3 Opus and ChatGPT included the correct diagnosis in 76% (95% CI 66.33%-83.77%) and 78% (95% CI 68.46%-85.45%) of cases, respectively. The McNemar test showed no significant difference (P=.56). In malignancy discrimination, Claude 3 Opus outperformed ChatGPT with 47.06% sensitivity, 81.63% specificity, and 64% accuracy, compared to 45.1%, 42.86%, and 44%, respectively. The McNemar test showed a significant difference (P<.001). Claude 3 Opus had an odds ratio of 3.951 (95% CI 1.685-9.263) in discriminating malignancy, while ChatGPT-4 had an odds ratio of 0.616 (95% CI 0.297-1.278).

CONCLUSIONS

Our study highlights the potential of LLMs in assisting dermatologists but also reveals their limitations. Both models made errors in diagnosing melanoma and benign lesions. These findings underscore the need for developing robust, transparent, and clinically validated AI models through collaborative efforts between AI researchers, dermatologists, and other health care professionals. While AI can provide valuable insights, it cannot yet replace the expertise of trained clinicians.

摘要

背景

人工智能(AI)和大语言模型(LLMs)的最新进展已在包括皮肤病学在内的医学领域展现出潜力。随着大语言模型引入图像分析能力,其在皮肤诊断中的应用引起了广泛关注。这些能力是通过将计算机视觉技术集成到大型语言模型的底层架构中实现的。

目的

本研究旨在比较Claude 3 Opus和ChatGPT与GPT-4在分析皮肤镜图像以检测黑色素瘤方面的诊断性能,深入了解它们的优势和局限性。

方法

我们使用计算机生成的随机化过程,从国际皮肤影像协作组织(ISIC)档案库中随机选择了100张经组织病理学证实的皮肤镜图像(50例恶性,50例良性)。选择ISIC档案库是因为其拥有全面且标注良好的皮肤镜图像集,确保了样本的多样性和代表性。如果图像是黑色素细胞病变的皮肤镜图像且组织病理学诊断得到证实,则纳入研究。每个模型都收到相同的提示,要求其按可能性对每张图像提供前3种鉴别诊断。评估了主要诊断准确性、前3种鉴别诊断的准确性以及恶性鉴别能力。选择McNemar检验来比较这两种模型的诊断性能,因为它适用于分析配对的名义数据。

结果

在主要诊断中,Claude 3 Opus的灵敏度为54.9%(95%CI 44.08%-65.37%),特异度为57.14%(95%CI 46.31%-67.46%),准确率为56%(95%CI 46.22%-65.42%),而ChatGPT的灵敏度为56.86%(95%CI 45.99%-67.21%),特异度为38.78%(95%CI 28.77%-49.59%),准确率为48%(95%CI 38.37%-57.75%)。McNemar检验显示这两种模型之间无显著差异(P = 0.17)。对于前3种鉴别诊断,Claude 3 Opus和ChatGPT分别在76%(95%CI 66.33%-83.77%)和78%(95%CI 68.46%-85.45%)的病例中包含了正确诊断。McNemar检验显示无显著差异(P = 0.56)。在恶性鉴别方面,Claude 3 Opus的灵敏度为47.06%,特异度为81.63%,准确率为64%,优于ChatGPT,后者的灵敏度、特异度和准确率分别为45.1%、42.86%和44%。McNemar检验显示有显著差异(P < 0.001)。Claude 3 Opus在鉴别恶性肿瘤方面的优势比为3.951(95%CI 1.685-9.263),而ChatGPT-4的优势比为0.616(95%CI 0.297-1.278)。

结论

我们的研究突出了大语言模型在协助皮肤科医生方面的潜力,但也揭示了它们的局限性。两种模型在诊断黑色素瘤和良性病变时都存在错误。这些发现强调了通过人工智能研究人员、皮肤科医生和其他医疗保健专业人员之间的合作,开发强大、透明且经过临床验证的人工智能模型的必要性。虽然人工智能可以提供有价值的见解,但它尚不能取代训练有素的临床医生的专业知识。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e5a3/11336503/aead905830da/medinform_v12i1e59273_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e5a3/11336503/aead905830da/medinform_v12i1e59273_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e5a3/11336503/aead905830da/medinform_v12i1e59273_fig1.jpg

相似文献

1
Claude 3 Opus and ChatGPT With GPT-4 in Dermoscopic Image Analysis for Melanoma Diagnosis: Comparative Performance Analysis.用于黑色素瘤诊断的皮肤镜图像分析中Claude 3 Opus和配备GPT-4的ChatGPT:比较性能分析
JMIR Med Inform. 2024 Aug 6;12:e59273. doi: 10.2196/59273.
2
Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images.评估ChatGPT-4o和Claude 3-Opus基于超声图像进行甲状腺结节分类的可行性。
Endocrine. 2025 Mar;87(3):1041-1049. doi: 10.1007/s12020-024-04066-x. Epub 2024 Oct 11.
3
Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in "Diagnosis Please" cases.GPT-4o、Claude 3 Opus 和 Gemini 1.5 Pro 在“诊断请”案例中的诊断性能。
Jpn J Radiol. 2024 Nov;42(11):1231-1235. doi: 10.1007/s11604-024-01619-y. Epub 2024 Jul 1.
4
Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.评估大语言模型在与《乳腺影像报告和数据系统》第5版相关问题上的文本和视觉诊断能力。
Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.
5
Diagnostic Performance of GPT-4o and Claude 3 Opus in Determining Causes of Death From Medical Histories and Postmortem CT Findings.GPT-4o和Claude 3 Opus根据病史和尸检CT结果确定死因的诊断性能
Cureus. 2024 Aug 20;16(8):e67306. doi: 10.7759/cureus.67306. eCollection 2024 Aug.
6
Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration.评估ChatGPT-4的诊断准确性:视觉数据整合的影响。
JMIR Med Inform. 2024 Apr 9;12:e55627. doi: 10.2196/55627.
7
Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology's "Diagnosis Please" cases.Claude 3 Opus 和 Claude 3.5 Sonnet 基于病史和放射科“诊断请”病例关键图像的诊断性能。
Jpn J Radiol. 2024 Dec;42(12):1399-1402. doi: 10.1007/s11604-024-01634-z. Epub 2024 Aug 3.
8
Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.利用生成式人工智能辅助学习罕见且复杂的诊断:对流行的大型语言模型的定性研究。
JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.
9
Assessing the Application of Large Language Models in Generating Dermatologic Patient Education Materials According to Reading Level: Qualitative Study.评估大语言模型在根据阅读水平生成皮肤科患者教育材料方面的应用:定性研究。
JMIR Dermatol. 2024 May 16;7:e55898. doi: 10.2196/55898.
10
Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study.大型语言模型在 3 个临床专业领域的治疗推荐中的应用:比较研究。
J Med Internet Res. 2023 Oct 30;25:e49324. doi: 10.2196/49324.

引用本文的文献

1
Artificial Intelligence and Large Language Models in the Fight Against Superficial Fungal Infections: Friend or Foe?人工智能和大语言模型在对抗浅表真菌感染中的作用:是友还是敌?
Clin Cosmet Investig Dermatol. 2025 Aug 20;18:1959-1969. doi: 10.2147/CCID.S522271. eCollection 2025.
2
Performance of AI Chatbots in Preliminary Diagnosis of Maxillofacial Pathologies.人工智能聊天机器人在颌面疾病初步诊断中的表现。
Med Sci Monit. 2025 Jul 9;31:e949076. doi: 10.12659/MSM.949076.
3
The Role of ChatGPT in Dermatology Diagnostics.ChatGPT在皮肤病诊断中的作用。

本文引用的文献

1
Can ChatGPT vision diagnose melanoma? An exploratory diagnostic accuracy study.ChatGPT视觉能否诊断黑色素瘤?一项探索性诊断准确性研究。
J Am Acad Dermatol. 2024 May;90(5):1057-1059. doi: 10.1016/j.jaad.2023.12.062. Epub 2024 Jan 19.
2
The underuse of AI in the health sector: Opportunity costs, success stories, risks and recommendations.人工智能在医疗领域的应用不足:机会成本、成功案例、风险与建议。
Health Technol (Berl). 2024;14(1):1-14. doi: 10.1007/s12553-023-00806-7. Epub 2023 Dec 12.
3
Analysis of ChatGPT generated differential diagnoses in response to physical exam findings for benign and malignant cutaneous neoplasms.
Diagnostics (Basel). 2025 Jun 16;15(12):1529. doi: 10.3390/diagnostics15121529.
4
Large Language Models in Medical Diagnostics: Scoping Review With Bibliometric Analysis.医学诊断中的大语言模型:基于文献计量分析的综述
J Med Internet Res. 2025 Jun 9;27:e72062. doi: 10.2196/72062.
5
Large language models for dermatological image interpretation - a comparative study.用于皮肤病图像解读的大语言模型——一项比较研究。
Diagnosis (Berl). 2025 May 23. doi: 10.1515/dx-2025-0014.
6
Assessing large language models as assistive tools in medical consultations for Kawasaki disease.评估大型语言模型作为川崎病医疗咨询辅助工具的作用。
Front Artif Intell. 2025 Mar 31;8:1571503. doi: 10.3389/frai.2025.1571503. eCollection 2025.
7
A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians.生成式人工智能与医生诊断性能比较的系统评价与荟萃分析
NPJ Digit Med. 2025 Mar 22;8(1):175. doi: 10.1038/s41746-025-01543-z.
8
Exploring the Potential of Claude 3 Opus in Renal Pathological Diagnosis: Performance Evaluation.探索 Claude 3 Opus 在肾脏病理诊断中的潜力:性能评估。
JMIR Med Inform. 2024 Nov 15;12:e65033. doi: 10.2196/65033.
9
Assessing the Impact of ChatGPT in Dermatology: A Comprehensive Rapid Review.评估ChatGPT在皮肤科的影响:一项全面的快速综述。
J Clin Med. 2024 Oct 3;13(19):5909. doi: 10.3390/jcm13195909.
分析ChatGPT针对良性和恶性皮肤肿瘤的体格检查结果生成的鉴别诊断。
J Am Acad Dermatol. 2024 Mar;90(3):615-616. doi: 10.1016/j.jaad.2023.10.040. Epub 2023 Oct 28.
4
Artificial Intelligence Applications in Dermatology: Where Do We Stand?人工智能在皮肤病学中的应用:我们目前的进展如何?
Front Med (Lausanne). 2020 Mar 31;7:100. doi: 10.3389/fmed.2020.00100. eCollection 2020.