Suppr超能文献

大语言模型与普通放射科医生在胸部放射学病例中的诊断性能:一项比较研究。

The Diagnostic Performance of Large Language Models and General Radiologists in Thoracic Radiology Cases: A Comparative Study.

作者信息

Gunes Yasin Celal, Cesur Turay

机构信息

Department of Radiology, Kirikkale Yuksek Ihtisas Hospital, Kirikkale.

Department of Radiology, Mamak State Hospital, Ankara, Türkiye.

出版信息

J Thorac Imaging. 2025 May 1;40(3):e0805. doi: 10.1097/RTI.0000000000000805.

Abstract

PURPOSE

To investigate and compare the diagnostic performance of 10 different large language models (LLMs) and 2 board-certified general radiologists in thoracic radiology cases published by The Society of Thoracic Radiology.

MATERIALS AND METHODS

We collected publicly available 124 "Case of the Month" from the Society of Thoracic Radiology website between March 2012 and December 2023. Medical history and imaging findings were input into LLMs for diagnosis and differential diagnosis, while radiologists independently visually provided their assessments. Cases were categorized anatomically (parenchyma, airways, mediastinum-pleura-chest wall, and vascular) and further classified as specific or nonspecific for radiologic diagnosis. Diagnostic accuracy and differential diagnosis scores (DDxScore) were analyzed using the χ 2 , Kruskal-Wallis, Wilcoxon, McNemar, and Mann-Whitney U tests.

RESULTS

Among the 124 cases, Claude 3 Opus showed the highest diagnostic accuracy (70.29%), followed by ChatGPT 4/Google Gemini 1.5 Pro (59.75%), Meta Llama 3 70b (57.3%), ChatGPT 3.5 (53.2%), outperforming radiologists (52.4% and 41.1%) and other LLMs ( P <0.05). Claude 3 Opus DDxScore was significantly better than other LLMs and radiologists, except ChatGPT 3.5 ( P <0.05). All LLMs and radiologists showed greater accuracy in specific cases ( P <0.05), with no DDxScore difference for Perplexity and Google Bard based on specificity ( P >0.05). There were no significant differences between LLMs and radiologists in the diagnostic accuracy of anatomic subgroups ( P >0.05), except for Meta Llama 3 70b in the vascular cases ( P =0.040).

CONCLUSIONS

Claude 3 Opus outperformed other LLMs and radiologists in text-based thoracic radiology cases. LLMs hold great promise for clinical decision systems under proper medical supervision.

摘要

目的

研究并比较10种不同的大语言模型(LLMs)和2名获得委员会认证的普通放射科医生对胸放射学会发表的胸部放射学病例的诊断性能。

材料与方法

我们收集了2012年3月至2023年12月期间胸放射学会网站上公开的124例“月度病例”。将病史和影像结果输入大语言模型进行诊断和鉴别诊断,同时放射科医生独立通过视觉提供他们的评估。病例按解剖学分类(实质、气道、纵隔-胸膜-胸壁和血管),并进一步分为放射学诊断的特异性或非特异性病例。使用χ2检验、Kruskal-Wallis检验、Wilcoxon检验、McNemar检验和Mann-Whitney U检验分析诊断准确性和鉴别诊断评分(DDxScore)。

结果

在124例病例中,Claude 3 Opus显示出最高的诊断准确性(70.29%),其次是ChatGPT 4/谷歌Gemini 1.5 Pro(59.75%)、Meta Llama 3 70b(57.3%)、ChatGPT 3.5(53.2%),其表现优于放射科医生(52.4%和41.1%)以及其他大语言模型(P<0.05)。Claude 3 Opus的DDxScore显著优于其他大语言模型和放射科医生,但ChatGPT 3.5除外(P<0.05)。所有大语言模型和放射科医生在特异性病例中显示出更高的准确性(P<0.05),基于特异性,Perplexity和谷歌Bard的DDxScore没有差异(P>0.05)。在解剖学子组的诊断准确性方面,大语言模型和放射科医生之间没有显著差异(P>0.05),血管病例中的Meta Llama 3 70b除外(P = 0.040)。

结论

在基于文本的胸部放射学病例中,Claude 3 Opus的表现优于其他大语言模型和放射科医生。在适当的医学监督下,大语言模型在临床决策系统方面具有巨大潜力。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验