Chen Ziman, Chambara Nonhlanhla, Wu Chaoqun, Lo Xina, Liu Shirley Yuk Wah, Gunda Simon Takadiyi, Han Xinyang, Qu Jingguo, Chen Fei, Ying Michael Tin Cheung
Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China.
School of Healthcare Sciences, Cardiff University, Cardiff, UK.
Endocrine. 2025 Mar;87(3):1041-1049. doi: 10.1007/s12020-024-04066-x. Epub 2024 Oct 11.
Large language models (LLMs) are pivotal in artificial intelligence, demonstrating advanced capabilities in natural language understanding and multimodal interactions, with significant potential in medical applications. This study explores the feasibility and efficacy of LLMs, specifically ChatGPT-4o and Claude 3-Opus, in classifying thyroid nodules using ultrasound images.
This study included 112 patients with a total of 116 thyroid nodules, comprising 75 benign and 41 malignant cases. Ultrasound images of these nodules were analyzed using ChatGPT-4o and Claude 3-Opus to diagnose the benign or malignant nature of the nodules. An independent evaluation by a junior radiologist was also conducted. Diagnostic performance was assessed using Cohen's Kappa and receiver operating characteristic (ROC) curve analysis, referencing pathological diagnoses.
ChatGPT-4o demonstrated poor agreement with pathological results (Kappa = 0.116), while Claude 3-Opus showed even lower agreement (Kappa = 0.034). The junior radiologist exhibited moderate agreement (Kappa = 0.450). ChatGPT-4o achieved an area under the ROC curve (AUC) of 57.0% (95% CI: 48.6-65.5%), slightly outperforming Claude 3-Opus (AUC of 52.0%, 95% CI: 43.2-60.9%). In contrast, the junior radiologist achieved a significantly higher AUC of 72.4% (95% CI: 63.7-81.1%). The unnecessary biopsy rates were 41.4% for ChatGPT-4o, 43.1% for Claude 3-Opus, and 12.1% for the junior radiologist.
While LLMs such as ChatGPT-4o and Claude 3-Opus show promise for future applications in medical imaging, their current use in clinical diagnostics should be approached cautiously due to their limited accuracy.
大语言模型(LLMs)在人工智能中至关重要,在自然语言理解和多模态交互方面展现出先进能力,在医学应用中具有巨大潜力。本研究探讨了大语言模型,特别是ChatGPT-4o和Claude 3-Opus,利用超声图像对甲状腺结节进行分类的可行性和有效性。
本研究纳入了112例患者,共116个甲状腺结节,其中良性75例,恶性41例。使用ChatGPT-4o和Claude 3-Opus分析这些结节的超声图像,以诊断结节的良性或恶性性质。还由一名初级放射科医生进行了独立评估。参考病理诊断,使用科恩kappa系数和受试者操作特征(ROC)曲线分析评估诊断性能。
ChatGPT-4o与病理结果的一致性较差(kappa系数 = 0.116),而Claude 3-Opus的一致性更低(kappa系数 = 0.034)。初级放射科医生表现出中等一致性(kappa系数 = 0.450)。ChatGPT-4o的ROC曲线下面积(AUC)为57.0%(95%置信区间:48.6 - 65.5%),略优于Claude 3-Opus(AUC为52.0%,95%置信区间:43.2 - 60.9%)。相比之下,初级放射科医生的AUC显著更高,为72.4%(95%置信区间:63.7 - 81.1%)。ChatGPT-4o的不必要活检率为41.4%,Claude 3-Opus为43.1%,初级放射科医生为12.1%。
虽然ChatGPT-4o和Claude 3-Opus等大语言模型在医学成像的未来应用中显示出前景,但由于其准确性有限,目前在临床诊断中的应用应谨慎对待。