评估ChatGPT-4o和Claude 3-Opus基于超声图像进行甲状腺结节分类的可行性。

Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images.

作者信息

Chen Ziman, Chambara Nonhlanhla, Wu Chaoqun, Lo Xina, Liu Shirley Yuk Wah, Gunda Simon Takadiyi, Han Xinyang, Qu Jingguo, Chen Fei, Ying Michael Tin Cheung

机构信息

Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China.

School of Healthcare Sciences, Cardiff University, Cardiff, UK.

出版信息

Endocrine. 2025 Mar;87(3):1041-1049. doi: 10.1007/s12020-024-04066-x. Epub 2024 Oct 11.

DOI:10.1007/s12020-024-04066-x

PMID:39394537

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11845565/

Abstract

PURPOSE

Large language models (LLMs) are pivotal in artificial intelligence, demonstrating advanced capabilities in natural language understanding and multimodal interactions, with significant potential in medical applications. This study explores the feasibility and efficacy of LLMs, specifically ChatGPT-4o and Claude 3-Opus, in classifying thyroid nodules using ultrasound images.

METHODS

This study included 112 patients with a total of 116 thyroid nodules, comprising 75 benign and 41 malignant cases. Ultrasound images of these nodules were analyzed using ChatGPT-4o and Claude 3-Opus to diagnose the benign or malignant nature of the nodules. An independent evaluation by a junior radiologist was also conducted. Diagnostic performance was assessed using Cohen's Kappa and receiver operating characteristic (ROC) curve analysis, referencing pathological diagnoses.

RESULTS

ChatGPT-4o demonstrated poor agreement with pathological results (Kappa = 0.116), while Claude 3-Opus showed even lower agreement (Kappa = 0.034). The junior radiologist exhibited moderate agreement (Kappa = 0.450). ChatGPT-4o achieved an area under the ROC curve (AUC) of 57.0% (95% CI: 48.6-65.5%), slightly outperforming Claude 3-Opus (AUC of 52.0%, 95% CI: 43.2-60.9%). In contrast, the junior radiologist achieved a significantly higher AUC of 72.4% (95% CI: 63.7-81.1%). The unnecessary biopsy rates were 41.4% for ChatGPT-4o, 43.1% for Claude 3-Opus, and 12.1% for the junior radiologist.

CONCLUSION

While LLMs such as ChatGPT-4o and Claude 3-Opus show promise for future applications in medical imaging, their current use in clinical diagnostics should be approached cautiously due to their limited accuracy.

摘要

目的

大语言模型（LLMs）在人工智能中至关重要，在自然语言理解和多模态交互方面展现出先进能力，在医学应用中具有巨大潜力。本研究探讨了大语言模型，特别是ChatGPT-4o和Claude 3-Opus，利用超声图像对甲状腺结节进行分类的可行性和有效性。

方法

本研究纳入了112例患者，共116个甲状腺结节，其中良性75例，恶性41例。使用ChatGPT-4o和Claude 3-Opus分析这些结节的超声图像，以诊断结节的良性或恶性性质。还由一名初级放射科医生进行了独立评估。参考病理诊断，使用科恩kappa系数和受试者操作特征（ROC）曲线分析评估诊断性能。

结果

ChatGPT-4o与病理结果的一致性较差（kappa系数 = 0.116），而Claude 3-Opus的一致性更低（kappa系数 = 0.034）。初级放射科医生表现出中等一致性（kappa系数 = 0.450）。ChatGPT-4o的ROC曲线下面积（AUC）为57.0%（95%置信区间：48.6 - 65.5%），略优于Claude 3-Opus（AUC为52.0%，95%置信区间：43.2 - 60.9%）。相比之下，初级放射科医生的AUC显著更高，为72.4%（95%置信区间：63.7 - 81.1%）。ChatGPT-4o的不必要活检率为41.4%，Claude 3-Opus为43.1%，初级放射科医生为12.1%。

结论

虽然ChatGPT-4o和Claude 3-Opus等大语言模型在医学成像的未来应用中显示出前景，但由于其准确性有限，目前在临床诊断中的应用应谨慎对待。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a776/11845565/141cb98980c3/12020_2024_4066_Fig1_HTML.jpg

相似文献

Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images.评估ChatGPT-4o和Claude 3-Opus基于超声图像进行甲状腺结节分类的可行性。

Endocrine. 2025 Mar;87(3):1041-1049. doi: 10.1007/s12020-024-04066-x. Epub 2024 Oct 11.

Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.评估大语言模型在与《乳腺影像报告和数据系统》第5版相关问题上的文本和视觉诊断能力。

Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.

Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in "Diagnosis Please" cases.GPT-4o、Claude 3 Opus 和 Gemini 1.5 Pro 在“诊断请”案例中的诊断性能。

Jpn J Radiol. 2024 Nov;42(11):1231-1235. doi: 10.1007/s11604-024-01619-y. Epub 2024 Jul 1.

Claude 3 Opus and ChatGPT With GPT-4 in Dermoscopic Image Analysis for Melanoma Diagnosis: Comparative Performance Analysis.用于黑色素瘤诊断的皮肤镜图像分析中Claude 3 Opus和配备GPT-4的ChatGPT：比较性能分析

JMIR Med Inform. 2024 Aug 6;12:e59273. doi: 10.2196/59273.

Evaluation of Advanced Artificial Intelligence Algorithms' Diagnostic Efficacy in Acute Ischemic Stroke: A Comparative Analysis of ChatGPT-4o and Claude 3.5 Sonnet Models.先进人工智能算法在急性缺血性卒中诊断效能的评估：ChatGPT-4o与Claude 3.5 Sonnet模型的比较分析

J Clin Med. 2025 Jan 17;14(2):571. doi: 10.3390/jcm14020571.

Diagnostic Performance of GPT-4o and Claude 3 Opus in Determining Causes of Death From Medical Histories and Postmortem CT Findings.GPT-4o和Claude 3 Opus根据病史和尸检CT结果确定死因的诊断性能

Cureus. 2024 Aug 20;16(8):e67306. doi: 10.7759/cureus.67306. eCollection 2024 Aug.

Collaborative Enhancement of Consistency and Accuracy in US Diagnosis of Thyroid Nodules Using Large Language Models.利用大语言模型提高美国甲状腺结节诊断的一致性和准确性。

Radiology. 2024 Mar;310(3):e232255. doi: 10.1148/radiol.232255.

Evaluating Large Language Models in Dental Anesthesiology: A Comparative Analysis of ChatGPT-4, Claude 3 Opus, and Gemini 1.0 on the Japanese Dental Society of Anesthesiology Board Certification Exam.评估牙科麻醉学中的大语言模型：ChatGPT-4、Claude 3 Opus和Gemini 1.0在日本麻醉学牙科协会委员会认证考试中的比较分析。

Cureus. 2024 Sep 27;16(9):e70302. doi: 10.7759/cureus.70302. eCollection 2024 Sep.

Textual Proficiency and Visual Deficiency: A Comparative Study of Large Language Models and Radiologists in MRI Artifact Detection and Correction.文本能力与视觉缺陷：大语言模型与放射科医生在MRI伪影检测与校正方面的比较研究

Acad Radiol. 2025 May;32(5):2411-2421. doi: 10.1016/j.acra.2025.01.004. Epub 2025 Feb 11.

Performance of Multimodal Large Language Models in Japanese Diagnostic Radiology Board Examinations (2021-2023).多模态大语言模型在日本诊断放射学委员会考试（2021 - 2023年）中的表现

Acad Radiol. 2025 May;32(5):2394-2401. doi: 10.1016/j.acra.2024.10.035. Epub 2024 Nov 8.

引用本文的文献

Evaluation of deepseek, gemini, ChatGPT-4o, and perplexity in responding to salivary gland cancer.评估DeepSeek、Gemini、ChatGPT-4o和Perplexity对涎腺癌的回答。

BMC Oral Health. 2025 Aug 23;25(1):1358. doi: 10.1186/s12903-025-06726-4.

Foundation models for radiology-the position of the AI for Health Imaging (AI4HI) network.放射学基础模型——健康影像人工智能（AI4HI）网络的立场

Insights Imaging. 2025 Aug 6;16(1):168. doi: 10.1186/s13244-025-02056-9.

Exploring the Potential of ChatGPT-4o in Thyroid Nodule Diagnosis Using Multi-Modality Ultrasound Imaging: Dual- vs. Triple-Modality Approaches.利用多模态超声成像探索ChatGPT-4o在甲状腺结节诊断中的潜力：双模态与三模态方法

Cancers (Basel). 2025 Jun 20;17(13):2068. doi: 10.3390/cancers17132068.

Can AI-Based ChatGPT Models Accurately Analyze Hand-Wrist Radiographs? A Comparative Study.基于人工智能的ChatGPT模型能否准确分析手腕X光片？一项对比研究。

Diagnostics (Basel). 2025 Jun 14;15(12):1513. doi: 10.3390/diagnostics15121513.

Fine-tuning LLM hyperparameters to align semantic and physiological contexts of aging-related pathways.微调大语言模型超参数以匹配衰老相关通路的语义和生理背景。

Mol Divers. 2025 Jun 6. doi: 10.1007/s11030-025-11226-2.

Performance of GPT-4 Turbo and GPT-4o in Korean Society of Radiology In-Training Examinations.GPT-4 Turbo和GPT-4o在韩国放射学会住院医师培训考试中的表现。

Korean J Radiol. 2025 Jun;26(6):524-531. doi: 10.3348/kjr.2024.1096. Epub 2025 Apr 17.

Comparative analysis of ChatGPT-4o mini, ChatGPT-4o and Gemini Advanced in the treatment of postmenopausal osteoporosis.ChatGPT-4o mini、ChatGPT-4o与Gemini Advanced在绝经后骨质疏松症治疗中的对比分析。

BMC Musculoskelet Disord. 2025 Apr 16;26(1):369. doi: 10.1186/s12891-025-08601-3.

本文引用的文献

Assessing the accuracy and reliability of ChatGPT's medical responses about thyroid cancer.评估 ChatGPT 对甲状腺癌相关医疗回复的准确性和可靠性。

Int J Med Inform. 2024 Nov;191:105593. doi: 10.1016/j.ijmedinf.2024.105593. Epub 2024 Aug 13.

Capability of multimodal large language models to interpret pediatric radiological images.多模态大型语言模型解读儿科放射影像的能力。

Pediatr Radiol. 2024 Sep;54(10):1729-1737. doi: 10.1007/s00247-024-06025-0. Epub 2024 Aug 12.

A Systematic Review of Natural Language Processing Methods and Applications in Thyroidology.甲状腺学中自然语言处理方法与应用的系统评价

Mayo Clin Proc Digit Health. 2024 Jun;2(2):270-279. doi: 10.1016/j.mcpdig.2024.03.007. Epub 2024 May 21.

The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review.大型语言模型在变革急诊医学中的作用：范围综述

JMIR Med Inform. 2024 May 10;12:e53787. doi: 10.2196/53787.

Applications of Large Language Models in Pathology.大语言模型在病理学中的应用。

Bioengineering (Basel). 2024 Mar 31;11(4):342. doi: 10.3390/bioengineering11040342.

Transforming free-text radiology reports into structured reports using ChatGPT: A study on thyroid ultrasonography.使用 ChatGPT 将自由文本放射学报告转换为结构化报告：一项甲状腺超声研究。

Eur J Radiol. 2024 Jun;175:111458. doi: 10.1016/j.ejrad.2024.111458. Epub 2024 Apr 9.

ChatGPT-4 accuracy for patient education in laryngopharyngeal reflux.ChatGPT-4 在咽喉反流患者教育中的准确性。

Eur Arch Otorhinolaryngol. 2024 May;281(5):2547-2552. doi: 10.1007/s00405-024-08560-w. Epub 2024 Mar 16.

Collaborative Enhancement of Consistency and Accuracy in US Diagnosis of Thyroid Nodules Using Large Language Models.利用大语言模型提高美国甲状腺结节诊断的一致性和准确性。

Radiology. 2024 Mar;310(3):e232255. doi: 10.1148/radiol.232255.

Head-to-head comparison of American, European, and Asian TIRADSs in thyroid nodule assessment: systematic review and meta-analysis.头对头比较美国、欧洲和亚洲 TIRADS 在甲状腺结节评估中的应用：系统评价和荟萃分析。

Eur Thyroid J. 2024 Mar 15;13(2). doi: 10.1530/ETJ-23-0242. Print 2024 Apr 1.

Assessing the role of GPT-4 in thyroid ultrasound diagnosis and treatment recommendations: enhancing interpretability with a chain of thought approach.评估GPT-4在甲状腺超声诊断及治疗建议中的作用：采用思维链方法提高可解释性

Quant Imaging Med Surg. 2024 Feb 1;14(2):1602-1615. doi: 10.21037/qims-23-1180. Epub 2024 Jan 11.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

评估ChatGPT-4o和Claude 3-Opus基于超声图像进行甲状腺结节分类的可行性。

Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images.

作者信息

机构信息

出版信息

PURPOSE

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献