Gunes Yasin Celal, Cesur Turay
Department of Radiology, Kirikkale Yuksek Ihtisas Hospital, Kirikkale.
Department of Radiology, Mamak State Hospital, Ankara, Türkiye.
J Thorac Imaging. 2025 May 1;40(3):e0805. doi: 10.1097/RTI.0000000000000805.
To investigate and compare the diagnostic performance of 10 different large language models (LLMs) and 2 board-certified general radiologists in thoracic radiology cases published by The Society of Thoracic Radiology.
We collected publicly available 124 "Case of the Month" from the Society of Thoracic Radiology website between March 2012 and December 2023. Medical history and imaging findings were input into LLMs for diagnosis and differential diagnosis, while radiologists independently visually provided their assessments. Cases were categorized anatomically (parenchyma, airways, mediastinum-pleura-chest wall, and vascular) and further classified as specific or nonspecific for radiologic diagnosis. Diagnostic accuracy and differential diagnosis scores (DDxScore) were analyzed using the χ 2 , Kruskal-Wallis, Wilcoxon, McNemar, and Mann-Whitney U tests.
Among the 124 cases, Claude 3 Opus showed the highest diagnostic accuracy (70.29%), followed by ChatGPT 4/Google Gemini 1.5 Pro (59.75%), Meta Llama 3 70b (57.3%), ChatGPT 3.5 (53.2%), outperforming radiologists (52.4% and 41.1%) and other LLMs ( P <0.05). Claude 3 Opus DDxScore was significantly better than other LLMs and radiologists, except ChatGPT 3.5 ( P <0.05). All LLMs and radiologists showed greater accuracy in specific cases ( P <0.05), with no DDxScore difference for Perplexity and Google Bard based on specificity ( P >0.05). There were no significant differences between LLMs and radiologists in the diagnostic accuracy of anatomic subgroups ( P >0.05), except for Meta Llama 3 70b in the vascular cases ( P =0.040).
Claude 3 Opus outperformed other LLMs and radiologists in text-based thoracic radiology cases. LLMs hold great promise for clinical decision systems under proper medical supervision.
研究并比较10种不同的大语言模型(LLMs)和2名获得委员会认证的普通放射科医生对胸放射学会发表的胸部放射学病例的诊断性能。
我们收集了2012年3月至2023年12月期间胸放射学会网站上公开的124例“月度病例”。将病史和影像结果输入大语言模型进行诊断和鉴别诊断,同时放射科医生独立通过视觉提供他们的评估。病例按解剖学分类(实质、气道、纵隔-胸膜-胸壁和血管),并进一步分为放射学诊断的特异性或非特异性病例。使用χ2检验、Kruskal-Wallis检验、Wilcoxon检验、McNemar检验和Mann-Whitney U检验分析诊断准确性和鉴别诊断评分(DDxScore)。
在124例病例中,Claude 3 Opus显示出最高的诊断准确性(70.29%),其次是ChatGPT 4/谷歌Gemini 1.5 Pro(59.75%)、Meta Llama 3 70b(57.3%)、ChatGPT 3.5(53.2%),其表现优于放射科医生(52.4%和41.1%)以及其他大语言模型(P<0.05)。Claude 3 Opus的DDxScore显著优于其他大语言模型和放射科医生,但ChatGPT 3.5除外(P<0.05)。所有大语言模型和放射科医生在特异性病例中显示出更高的准确性(P<0.05),基于特异性,Perplexity和谷歌Bard的DDxScore没有差异(P>0.05)。在解剖学子组的诊断准确性方面,大语言模型和放射科医生之间没有显著差异(P>0.05),血管病例中的Meta Llama 3 70b除外(P = 0.040)。
在基于文本的胸部放射学病例中,Claude 3 Opus的表现优于其他大语言模型和放射科医生。在适当的医学监督下,大语言模型在临床决策系统方面具有巨大潜力。