先进人工智能算法在急性缺血性卒中诊断效能的评估：ChatGPT-4o与Claude 3.5 Sonnet模型的比较分析

Evaluation of Advanced Artificial Intelligence Algorithms' Diagnostic Efficacy in Acute Ischemic Stroke: A Comparative Analysis of ChatGPT-4o and Claude 3.5 Sonnet Models.

作者信息

Koyun Mustafa, Taskent Ismail

机构信息

Department of Radiology, Kastamonu Training and Research Hospital, Kastamonu 37150, Turkey.

Department of Radiology, Kastamonu University, Kastamonu 37150, Turkey.

出版信息

J Clin Med. 2025 Jan 17;14(2):571. doi: 10.3390/jcm14020571.

DOI:10.3390/jcm14020571

PMID:39860577

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11765597/

Abstract

Acute ischemic stroke (AIS) is a leading cause of mortality and disability worldwide, with early and accurate diagnosis being critical for timely intervention and improved patient outcomes. This retrospective study aimed to assess the diagnostic performance of two advanced artificial intelligence (AI) models, Chat Generative Pre-trained Transformer (ChatGPT-4o) and Claude 3.5 Sonnet, in identifying AIS from diffusion-weighted imaging (DWI). The DWI images of a total of 110 cases (AIS group: = 55, healthy controls: = 55) were provided to the AI models via standardized prompts. The models' responses were compared to radiologists' gold-standard evaluations, and performance metrics such as sensitivity, specificity, and diagnostic accuracy were calculated. Both models exhibited a high sensitivity for AIS detection (ChatGPT-4o: 100%, Claude 3.5 Sonnet: 94.5%). However, ChatGPT-4o demonstrated a significantly lower specificity (3.6%) compared to Claude 3.5 Sonnet (74.5%). The agreement with radiologists was poor for ChatGPT-4o (κ = 0.036; %95 CI: -0.013, 0.085) but good for Claude 3.5 Sonnet (κ = 0.691; %95 CI: 0.558, 0.824). In terms of the AIS hemispheric localization accuracy, Claude 3.5 Sonnet (67.2%) outperformed ChatGPT-4o (32.7%). Similarly, for specific AIS localization, Claude 3.5 Sonnet (30.9%) showed greater accuracy than ChatGPT-4o (7.3%), with these differences being statistically significant ( < 0.05). This study highlights the superior diagnostic performance of Claude 3.5 Sonnet compared to ChatGPT-4o in identifying AIS from DWI. Despite its advantages, both models demonstrated notable limitations in accuracy, emphasizing the need for further development before achieving full clinical applicability. These findings underline the potential of AI tools in radiological diagnostics while acknowledging their current limitations.

摘要

急性缺血性卒中（AIS）是全球范围内导致死亡和残疾的主要原因，早期准确诊断对于及时干预和改善患者预后至关重要。这项回顾性研究旨在评估两种先进的人工智能（AI）模型，即聊天生成预训练变换器（ChatGPT-4o）和Claude 3.5十四行诗，从扩散加权成像（DWI）中识别AIS的诊断性能。通过标准化提示将总共110例患者的DWI图像（AIS组：n = 55，健康对照组：n = 55）提供给AI模型。将模型的反应与放射科医生的金标准评估进行比较，并计算敏感性、特异性和诊断准确性等性能指标。两种模型对AIS检测均表现出高敏感性（ChatGPT-4o：100%，Claude 3.5十四行诗：94.5%）。然而，与Claude 3.5十四行诗（74.5%）相比，ChatGPT-4o的特异性显著更低（3.6%）。ChatGPT-4o与放射科医生的一致性较差（κ = 0.036；95%CI：-0.013，0.085），而Claude 3.5十四行诗与放射科医生的一致性良好（κ = 0.691；95%CI：0.558，0.824）。在AIS半球定位准确性方面，Claude 3.5十四行诗（67.2%）优于ChatGPT-4o（32.7%）。同样，对于特定的AIS定位，Claude 3.5十四行诗（30.9%）比ChatGPT-4o（7.3%）显示出更高的准确性，这些差异具有统计学意义（P < 0.05）。这项研究突出了Claude 3.5十四行诗在从DWI中识别AIS方面比ChatGPT-4o具有更好的诊断性能。尽管有其优势，但两种模型在准确性方面都表现出明显的局限性，强调在实现完全临床适用性之前需要进一步发展。这些发现强调了AI工具在放射诊断中的潜力，同时也认识到它们目前的局限性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b75a/11765597/44894faf3c10/jcm-14-00571-g001.jpg

相似文献

Evaluation of Advanced Artificial Intelligence Algorithms' Diagnostic Efficacy in Acute Ischemic Stroke: A Comparative Analysis of ChatGPT-4o and Claude 3.5 Sonnet Models.先进人工智能算法在急性缺血性卒中诊断效能的评估：ChatGPT-4o与Claude 3.5 Sonnet模型的比较分析

J Clin Med. 2025 Jan 17;14(2):571. doi: 10.3390/jcm14020571.

Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.评估大语言模型在与《乳腺影像报告和数据系统》第5版相关问题上的文本和视觉诊断能力。

Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.

From open-ended to multiple-choice: evaluating diagnostic performance and consistency of ChatGPT, Google Gemini and Claude AI.从开放式到多项选择题：评估ChatGPT、谷歌Gemini和Claude AI的诊断性能与一致性。

Wiad Lek. 2024;77(10):1852-1856. doi: 10.36740/WLek/195125.

Information from digital and human sources: A comparison of chatbot and clinician responses to orthodontic questions.来自数字和人工来源的信息：聊天机器人与临床医生对正畸问题回答的比较。

Am J Orthod Dentofacial Orthop. 2025 May 6. doi: 10.1016/j.ajodo.2025.04.008.

Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images.评估ChatGPT-4o和Claude 3-Opus基于超声图像进行甲状腺结节分类的可行性。

Endocrine. 2025 Mar;87(3):1041-1049. doi: 10.1007/s12020-024-04066-x. Epub 2024 Oct 11.

Comparative Analysis of ChatGPT-4o and Gemini Advanced Performance on Diagnostic Radiology In-Training Exams.ChatGPT-4o与Gemini在放射诊断学培训考试中的性能对比分析

Cureus. 2025 Mar 20;17(3):e80874. doi: 10.7759/cureus.80874. eCollection 2025 Mar.

Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology's "Diagnosis Please" cases.Claude 3 Opus 和 Claude 3.5 Sonnet 基于病史和放射科“诊断请”病例关键图像的诊断性能。

Jpn J Radiol. 2024 Dec;42(12):1399-1402. doi: 10.1007/s11604-024-01634-z. Epub 2024 Aug 3.

Assessing AI efficacy in medical knowledge tests: A study using Taiwan's internal medicine exam questions from 2020 to 2023.评估人工智能在医学知识测试中的效能：一项使用2020年至2023年台湾内科医师考试试题的研究。

Digit Health. 2024 Oct 18;10:20552076241291404. doi: 10.1177/20552076241291404. eCollection 2024 Jan-Dec.

Diagnostic performance of multimodal large language models in radiological quiz cases: the effects of prompt engineering and input conditions.多模态大语言模型在放射学问答病例中的诊断性能：提示工程和输入条件的影响

Ultrasonography. 2025 May;44(3):220-231. doi: 10.14366/usg.25012. Epub 2025 Mar 11.

Detection of Intracranial Hemorrhage from Computed Tomography Images: Diagnostic Role and Efficacy of ChatGPT-4o.从计算机断层扫描图像中检测颅内出血：ChatGPT-4o的诊断作用和效能

Diagnostics (Basel). 2025 Jan 9;15(2):143. doi: 10.3390/diagnostics15020143.

引用本文的文献

Limitations of broadly trained LLMs in interpreting orthopedic Walch glenoid classifications.广泛训练的语言模型在解释骨科Walch肩胛盂分类方面的局限性。

Front Artif Intell. 2025 Aug 28;8:1644093. doi: 10.3389/frai.2025.1644093. eCollection 2025.

Performance of Large Language Models in Recognizing Brain MRI Sequences: A Comparative Analysis of ChatGPT-4o, Claude 4 Opus, and Gemini 2.5 Pro.大语言模型在识别脑部磁共振成像序列方面的表现：ChatGPT-4o、Claude 4 Opus和Gemini 2.5 Pro的比较分析

Diagnostics (Basel). 2025 Jul 30;15(15):1919. doi: 10.3390/diagnostics15151919.

Chatbots in Radiology: Current Applications, Limitations and Future Directions of ChatGPT in Medical Imaging.放射学中的聊天机器人：ChatGPT在医学成像中的当前应用、局限性及未来方向

Diagnostics (Basel). 2025 Jun 26;15(13):1635. doi: 10.3390/diagnostics15131635.

AI in Medical Imaging and Image Processing.医学成像与图像处理中的人工智能

J Clin Med. 2025 Jun 11;14(12):4153. doi: 10.3390/jcm14124153.

The Impact of Language Variability on Artificial Intelligence Performance in Regenerative Endodontics.语言变异性对再生牙髓病学中人工智能性能的影响。

Healthcare (Basel). 2025 May 20;13(10):1190. doi: 10.3390/healthcare13101190.

Management of Burns: Multi-Center Assessment Comparing AI Models and Experienced Plastic Surgeons.烧伤管理：比较人工智能模型与经验丰富的整形外科医生的多中心评估

J Clin Med. 2025 Apr 29;14(9):3078. doi: 10.3390/jcm14093078.

本文引用的文献

Comparative Efficacy and Safety of Thrombectomy Versus Thrombolysis for Large Vessel Occlusion in Acute Ischemic Stroke: A Systemic Review.急性缺血性卒中大血管闭塞患者血栓切除术与溶栓治疗的疗效和安全性比较：一项系统评价

Cureus. 2024 Oct 24;16(10):e72323. doi: 10.7759/cureus.72323. eCollection 2024 Oct.

Generative artificial intelligence and ethical considerations in health care: a scoping review and ethics checklist.生成式人工智能与医疗保健中的伦理考量：范围综述与伦理检查表。

Lancet Digit Health. 2024 Nov;6(11):e848-e856. doi: 10.1016/S2589-7500(24)00143-2. Epub 2024 Sep 17.

A retrospective evaluation of the potential of ChatGPT in the accurate diagnosis of acute stroke.对ChatGPT在急性中风准确诊断中的潜力进行回顾性评估。

Diagn Interv Radiol. 2025 Apr 28;31(3):187-195. doi: 10.4274/dir.2024.242892. Epub 2024 Sep 2.

Evaluating ChatGPT-4V in chest CT diagnostics: a critical image interpretation assessment.评估 ChatGPT-4V 在胸部 CT 诊断中的应用：一项关键的图像解读评估。

Jpn J Radiol. 2024 Oct;42(10):1168-1177. doi: 10.1007/s11604-024-01606-3. Epub 2024 Jun 13.

Use of ChatGPT to Assign BI-RADS Assessment Categories to Breast Imaging Reports.使用ChatGPT为乳腺影像报告分配BI-RADS评估类别。

AJR Am J Roentgenol. 2024 Sep;223(3):e2431093. doi: 10.2214/AJR.24.31093. Epub 2024 May 8.

Evaluation of Multimodal ChatGPT (GPT-4V) in Describing Mammography Image Features.多模态ChatGPT（GPT-4V）在描述乳房X线摄影图像特征方面的评估。

Can Assoc Radiol J. 2024 Nov;75(4):947-949. doi: 10.1177/08465371241247043. Epub 2024 Apr 6.

Diagnostic power of ChatGPT 4 in distal radius fracture detection through wrist radiographs.通过腕关节 X 光片检测桡骨远端骨折的 ChatGPT 4 的诊断能力。

Arch Orthop Trauma Surg. 2024 May;144(5):2461-2467. doi: 10.1007/s00402-024-05298-2. Epub 2024 Apr 5.

Ethics and artificial intelligence.伦理学与人工智能。

Rev Clin Esp (Barc). 2024 Mar;224(3):178-186. doi: 10.1016/j.rceng.2024.02.003. Epub 2024 Feb 12.

Transformative Potential of AI in Healthcare: Definitions, Applications, and Navigating the Ethical Landscape and Public Perspectives.人工智能在医疗保健领域的变革潜力：定义、应用以及应对伦理格局和公众观点

Healthcare (Basel). 2024 Jan 5;12(2):125. doi: 10.3390/healthcare12020125.

Large language models in radiology: fundamentals, applications, ethical considerations, risks, and future directions.医学影像学中的大语言模型：基础、应用、伦理考量、风险和未来方向。

Diagn Interv Radiol. 2024 Mar 6;30(2):80-90. doi: 10.4274/dir.2023.232417. Epub 2023 Oct 3.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

先进人工智能算法在急性缺血性卒中诊断效能的评估：ChatGPT-4o与Claude 3.5 Sonnet模型的比较分析

Evaluation of Advanced Artificial Intelligence Algorithms' Diagnostic Efficacy in Acute Ischemic Stroke: A Comparative Analysis of ChatGPT-4o and Claude 3.5 Sonnet Models.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献