Suppr超能文献

用于黑色素瘤诊断的皮肤镜图像分析中Claude 3 Opus和配备GPT-4的ChatGPT:比较性能分析

Claude 3 Opus and ChatGPT With GPT-4 in Dermoscopic Image Analysis for Melanoma Diagnosis: Comparative Performance Analysis.

作者信息

Liu Xu, Duan Chaoli, Kim Min-Kyu, Zhang Lu, Jee Eunjin, Maharjan Beenu, Huang Yuwei, Du Dan, Jiang Xian

机构信息

Department of Dermatology, West China Hospital, Sichuan University, Chengdu, China.

出版信息

JMIR Med Inform. 2024 Aug 6;12:e59273. doi: 10.2196/59273.

Abstract

BACKGROUND

Recent advancements in artificial intelligence (AI) and large language models (LLMs) have shown potential in medical fields, including dermatology. With the introduction of image analysis capabilities in LLMs, their application in dermatological diagnostics has garnered significant interest. These capabilities are enabled by the integration of computer vision techniques into the underlying architecture of LLMs.

OBJECTIVE

This study aimed to compare the diagnostic performance of Claude 3 Opus and ChatGPT with GPT-4 in analyzing dermoscopic images for melanoma detection, providing insights into their strengths and limitations.

METHODS

We randomly selected 100 histopathology-confirmed dermoscopic images (50 malignant, 50 benign) from the International Skin Imaging Collaboration (ISIC) archive using a computer-generated randomization process. The ISIC archive was chosen due to its comprehensive and well-annotated collection of dermoscopic images, ensuring a diverse and representative sample. Images were included if they were dermoscopic images of melanocytic lesions with histopathologically confirmed diagnoses. Each model was given the same prompt, instructing it to provide the top 3 differential diagnoses for each image, ranked by likelihood. Primary diagnosis accuracy, accuracy of the top 3 differential diagnoses, and malignancy discrimination ability were assessed. The McNemar test was chosen to compare the diagnostic performance of the 2 models, as it is suitable for analyzing paired nominal data.

RESULTS

In the primary diagnosis, Claude 3 Opus achieved 54.9% sensitivity (95% CI 44.08%-65.37%), 57.14% specificity (95% CI 46.31%-67.46%), and 56% accuracy (95% CI 46.22%-65.42%), while ChatGPT demonstrated 56.86% sensitivity (95% CI 45.99%-67.21%), 38.78% specificity (95% CI 28.77%-49.59%), and 48% accuracy (95% CI 38.37%-57.75%). The McNemar test showed no significant difference between the 2 models (P=.17). For the top 3 differential diagnoses, Claude 3 Opus and ChatGPT included the correct diagnosis in 76% (95% CI 66.33%-83.77%) and 78% (95% CI 68.46%-85.45%) of cases, respectively. The McNemar test showed no significant difference (P=.56). In malignancy discrimination, Claude 3 Opus outperformed ChatGPT with 47.06% sensitivity, 81.63% specificity, and 64% accuracy, compared to 45.1%, 42.86%, and 44%, respectively. The McNemar test showed a significant difference (P<.001). Claude 3 Opus had an odds ratio of 3.951 (95% CI 1.685-9.263) in discriminating malignancy, while ChatGPT-4 had an odds ratio of 0.616 (95% CI 0.297-1.278).

CONCLUSIONS

Our study highlights the potential of LLMs in assisting dermatologists but also reveals their limitations. Both models made errors in diagnosing melanoma and benign lesions. These findings underscore the need for developing robust, transparent, and clinically validated AI models through collaborative efforts between AI researchers, dermatologists, and other health care professionals. While AI can provide valuable insights, it cannot yet replace the expertise of trained clinicians.

摘要

背景

人工智能(AI)和大语言模型(LLMs)的最新进展已在包括皮肤病学在内的医学领域展现出潜力。随着大语言模型引入图像分析能力,其在皮肤诊断中的应用引起了广泛关注。这些能力是通过将计算机视觉技术集成到大型语言模型的底层架构中实现的。

目的

本研究旨在比较Claude 3 Opus和ChatGPT与GPT-4在分析皮肤镜图像以检测黑色素瘤方面的诊断性能,深入了解它们的优势和局限性。

方法

我们使用计算机生成的随机化过程,从国际皮肤影像协作组织(ISIC)档案库中随机选择了100张经组织病理学证实的皮肤镜图像(50例恶性,50例良性)。选择ISIC档案库是因为其拥有全面且标注良好的皮肤镜图像集,确保了样本的多样性和代表性。如果图像是黑色素细胞病变的皮肤镜图像且组织病理学诊断得到证实,则纳入研究。每个模型都收到相同的提示,要求其按可能性对每张图像提供前3种鉴别诊断。评估了主要诊断准确性、前3种鉴别诊断的准确性以及恶性鉴别能力。选择McNemar检验来比较这两种模型的诊断性能,因为它适用于分析配对的名义数据。

结果

在主要诊断中,Claude 3 Opus的灵敏度为54.9%(95%CI 44.08%-65.37%),特异度为57.14%(95%CI 46.31%-67.46%),准确率为56%(95%CI 46.22%-65.42%),而ChatGPT的灵敏度为56.86%(95%CI 45.99%-67.21%),特异度为38.78%(95%CI 28.77%-49.59%),准确率为48%(95%CI 38.37%-57.75%)。McNemar检验显示这两种模型之间无显著差异(P = 0.17)。对于前3种鉴别诊断,Claude 3 Opus和ChatGPT分别在76%(95%CI 66.33%-83.77%)和78%(95%CI 68.46%-85.45%)的病例中包含了正确诊断。McNemar检验显示无显著差异(P = 0.56)。在恶性鉴别方面,Claude 3 Opus的灵敏度为47.06%,特异度为81.63%,准确率为64%,优于ChatGPT,后者的灵敏度、特异度和准确率分别为45.1%、42.86%和44%。McNemar检验显示有显著差异(P < 0.001)。Claude 3 Opus在鉴别恶性肿瘤方面的优势比为3.951(95%CI 1.685-9.263),而ChatGPT-4的优势比为0.616(95%CI 0.297-1.278)。

结论

我们的研究突出了大语言模型在协助皮肤科医生方面的潜力,但也揭示了它们的局限性。两种模型在诊断黑色素瘤和良性病变时都存在错误。这些发现强调了通过人工智能研究人员、皮肤科医生和其他医疗保健专业人员之间的合作,开发强大、透明且经过临床验证的人工智能模型的必要性。虽然人工智能可以提供有价值的见解,但它尚不能取代训练有素的临床医生的专业知识。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e5a3/11336503/aead905830da/medinform_v12i1e59273_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验