Sorin Vera, Klang Eyal, Sobeh Tamer, Konen Eli, Shrot Shai, Livne Adva, Weissbuch Yulian, Hoffmann Chen, Barash Yiftach
Department of Diagnostic Imaging, Chaim Sheba Medical Center, Ramat Gan, Israel.
The Faculty of Medicine, Tel-Aviv University, Tel Aviv-Yafo, Israel.
Quant Imaging Med Surg. 2024 Oct 1;14(10):7551-7560. doi: 10.21037/qims-24-200. Epub 2024 Sep 23.
Differential diagnosis in radiology relies on the accurate identification of imaging patterns. The use of large language models (LLMs) in radiology holds promise, with many potential applications that may enhance the efficiency of radiologists' workflow. The study aimed to evaluate the efficacy of generative pre-trained transformer (GPT)-4, a LLM, in providing differential diagnoses in neuroradiology, comparing its performance with board-certified neuroradiologists.
Sixty neuroradiology reports with variable diagnoses were inserted into GPT-4, which was tasked with generating a top-3 differential diagnosis for each case. The results were compared to the true diagnoses and to the differential diagnoses provided by three blinded neuroradiologists. Diagnostic accuracy and agreement between readers were assessed.
Of the 60 patients (mean age 47.8 years, 65% female), GPT-4 correctly included the diagnoses in its differentials in 61.7% (37/60) of cases, while the neuroradiologists' accuracy ranged from 63.3% (38/60) to 73.3% (44/60). Agreement between GPT-4 and the neuroradiologists, and among the neuroradiologists was fair to moderate [Cohen's kappa (kw) 0.34-0.44 and kw 0.39-0.54, respectively].
GPT-4 shows potential as a support tool for differential diagnosis in neuroradiology, though it was outperformed by human experts. Radiologists should remain mindful to the limitations of LLMs, while harboring their potential to enhance educational and clinical work.
放射学中的鉴别诊断依赖于对影像模式的准确识别。在放射学中使用大语言模型(LLMs)具有前景,有许多潜在应用可能会提高放射科医生的工作流程效率。本研究旨在评估生成式预训练变换器(GPT)-4(一种大语言模型)在神经放射学中提供鉴别诊断的效果,并将其表现与获得委员会认证的神经放射科医生进行比较。
将60份具有不同诊断结果的神经放射学报告输入GPT-4,要求其为每个病例生成前三位的鉴别诊断。将结果与真实诊断以及三位不知情的神经放射科医生提供的鉴别诊断进行比较。评估诊断准确性和读者之间的一致性。
在60例患者(平均年龄47.8岁,65%为女性)中,GPT-4在61.7%(37/60)的病例中正确地将诊断结果纳入其鉴别诊断中,而神经放射科医生的准确率在63.3%(38/60)至73.3%(44/60)之间。GPT-4与神经放射科医生之间以及神经放射科医生之间的一致性为中等[科恩kappa系数(kw)分别为0.34 - 0.44和kw为0.39 - 0.54]。
GPT-4显示出作为神经放射学鉴别诊断支持工具的潜力,尽管其表现不如人类专家。放射科医生应牢记大语言模型的局限性,同时也要利用其在加强教育和临床工作方面的潜力。