Suppr超能文献

评估ChatGPT-4o和Claude 3-Opus基于超声图像进行甲状腺结节分类的可行性。

Assessing the feasibility of ChatGPT-4o and Claude 3-Opus in thyroid nodule classification based on ultrasound images.

作者信息

Chen Ziman, Chambara Nonhlanhla, Wu Chaoqun, Lo Xina, Liu Shirley Yuk Wah, Gunda Simon Takadiyi, Han Xinyang, Qu Jingguo, Chen Fei, Ying Michael Tin Cheung

机构信息

Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China.

School of Healthcare Sciences, Cardiff University, Cardiff, UK.

出版信息

Endocrine. 2025 Mar;87(3):1041-1049. doi: 10.1007/s12020-024-04066-x. Epub 2024 Oct 11.

Abstract

PURPOSE

Large language models (LLMs) are pivotal in artificial intelligence, demonstrating advanced capabilities in natural language understanding and multimodal interactions, with significant potential in medical applications. This study explores the feasibility and efficacy of LLMs, specifically ChatGPT-4o and Claude 3-Opus, in classifying thyroid nodules using ultrasound images.

METHODS

This study included 112 patients with a total of 116 thyroid nodules, comprising 75 benign and 41 malignant cases. Ultrasound images of these nodules were analyzed using ChatGPT-4o and Claude 3-Opus to diagnose the benign or malignant nature of the nodules. An independent evaluation by a junior radiologist was also conducted. Diagnostic performance was assessed using Cohen's Kappa and receiver operating characteristic (ROC) curve analysis, referencing pathological diagnoses.

RESULTS

ChatGPT-4o demonstrated poor agreement with pathological results (Kappa = 0.116), while Claude 3-Opus showed even lower agreement (Kappa = 0.034). The junior radiologist exhibited moderate agreement (Kappa = 0.450). ChatGPT-4o achieved an area under the ROC curve (AUC) of 57.0% (95% CI: 48.6-65.5%), slightly outperforming Claude 3-Opus (AUC of 52.0%, 95% CI: 43.2-60.9%). In contrast, the junior radiologist achieved a significantly higher AUC of 72.4% (95% CI: 63.7-81.1%). The unnecessary biopsy rates were 41.4% for ChatGPT-4o, 43.1% for Claude 3-Opus, and 12.1% for the junior radiologist.

CONCLUSION

While LLMs such as ChatGPT-4o and Claude 3-Opus show promise for future applications in medical imaging, their current use in clinical diagnostics should be approached cautiously due to their limited accuracy.

摘要

目的

大语言模型(LLMs)在人工智能中至关重要,在自然语言理解和多模态交互方面展现出先进能力,在医学应用中具有巨大潜力。本研究探讨了大语言模型,特别是ChatGPT-4o和Claude 3-Opus,利用超声图像对甲状腺结节进行分类的可行性和有效性。

方法

本研究纳入了112例患者,共116个甲状腺结节,其中良性75例,恶性41例。使用ChatGPT-4o和Claude 3-Opus分析这些结节的超声图像,以诊断结节的良性或恶性性质。还由一名初级放射科医生进行了独立评估。参考病理诊断,使用科恩kappa系数和受试者操作特征(ROC)曲线分析评估诊断性能。

结果

ChatGPT-4o与病理结果的一致性较差(kappa系数 = 0.116),而Claude 3-Opus的一致性更低(kappa系数 = 0.034)。初级放射科医生表现出中等一致性(kappa系数 = 0.450)。ChatGPT-4o的ROC曲线下面积(AUC)为57.0%(95%置信区间:48.6 - 65.5%),略优于Claude 3-Opus(AUC为52.0%,95%置信区间:43.2 - 60.9%)。相比之下,初级放射科医生的AUC显著更高,为72.4%(95%置信区间:63.7 - 81.1%)。ChatGPT-4o的不必要活检率为41.4%,Claude 3-Opus为43.1%,初级放射科医生为12.1%。

结论

虽然ChatGPT-4o和Claude 3-Opus等大语言模型在医学成像的未来应用中显示出前景,但由于其准确性有限,目前在临床诊断中的应用应谨慎对待。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a776/11845565/141cb98980c3/12020_2024_4066_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验