Nakaura Takeshi, Uetani Hiroyuki, Yoshida Naofumi, Kobayashi Naoki, Nagayama Yasunori, Kidoh Masafumi, Kuroda Jun-Ichiro, Mukasa Akitake, Hirai Toshinori
Department of Diagnostic Radiology, Graduate School of Medical Sciences, Kumamoto University, Kumamoto, Japan.
Department of Neurosurgery, Graduate School of Medical Sciences, Kumamoto University, Kumamoto, Japan.
Eur Radiol. 2025 Aug 22. doi: 10.1007/s00330-025-11924-3.
Aimed to evaluate the potential of large language models (LLMs) in differentiating intra-axial primary brain tumors using structured magnetic resonance imaging (MRI) reports and compare their performance with radiologists.
Structured reports of preoperative MRI findings from 137 surgically confirmed intra-axial primary brain tumors, including Glioblastoma (n = 77), Central Nervous System (CNS) Lymphoma (n = 22), Astrocytoma (n = 9), Oligodendroglioma (n = 9), and others (n = 20), were analyzed by multiple LLMs, including GPT-4, Claude-3-Opus, Claude-3-Sonnet, GPT-3.5, Llama-2-70B, Qwen1.5-72B, and Gemini-Pro-1.0. The models provided the top 5 differential diagnoses based on the preoperative MRI findings, and their top 1, 3, and 5 accuracies were compared with board-certified neuroradiologists' interpretations of the actual preoperative MRI images.
Radiologists achieved top 1, 3, and 5 accuracies of 85.4%, 94.9%, and 94.9%, respectively. Among the LLMs, GPT-4 performed best with top 1, 3, and 5 accuracies of 65.7%, 84.7%, and 90.5%, respectively. Notably, GPT-4's top 3 accuracy of 84.7% approached the radiologists' top 1 accuracy of 85.4%. Other LLMs showed varying performance levels, with average accuracies ranging from 62.3% to 75.9%. LLMs demonstrated high accuracy for Glioblastoma but struggled with CNS Lymphoma and other less common tumors, particularly in top 1 accuracy.
LLMs show promise as assistive tools for differentiating intra-axial primary brain tumors using structured MRI reports. However, a significant gap remains between their performance and that of board-certified neuroradiologists interpreting actual images. The choice of LLM and tumor type significantly influences the results.
Question How do Large Language Models (LLM) perform when differentiating complex intra-axial primary brain tumors from structured MRI reports compared to radiologists interpreting images? Findings Radiologists outperformed all tested LLMs in diagnostic accuracy. The best model, GPT-4, showed promise but lagged considerably behind radiologists, particularly for less common tumors. Clinical relevance LLMs show potential as assistive tools for generating differential diagnoses from structured MRI reports, particularly for non-specialists, but they cannot currently replace the nuanced diagnostic expertise of a board-certified radiologist interpreting the primary image data.
旨在评估大语言模型(LLMs)利用结构化磁共振成像(MRI)报告鉴别脑内原发性肿瘤的潜力,并将其表现与放射科医生进行比较。
对137例经手术确诊的脑内原发性肿瘤的术前MRI检查结果的结构化报告进行分析,这些肿瘤包括胶质母细胞瘤(n = 77)、中枢神经系统(CNS)淋巴瘤(n = 22)、星形细胞瘤(n = 9)、少突胶质细胞瘤(n = 9)以及其他类型(n = 20)。多个大语言模型参与分析,包括GPT-4、Claude-3-Opus、Claude-3-Sonnet、GPT-3.5、Llama-2-70B、Qwen1.5-72B和Gemini-Pro-1.0。这些模型根据术前MRI检查结果提供前5种鉴别诊断,并将其前1、3和5的准确率与获得委员会认证的神经放射科医生对实际术前MRI图像的解读进行比较。
放射科医生前1、3和5的准确率分别为85.4%、94.9%和94.9%。在大语言模型中,GPT-4表现最佳,前1、3和5的准确率分别为65.7%、84.7%和90.5%。值得注意的是,GPT-4的前3准确率84.7%接近放射科医生的前1准确率85.4%。其他大语言模型表现出不同的水平,平均准确率在62.3%至75.9%之间。大语言模型对胶质母细胞瘤显示出较高的准确率,但在鉴别CNS淋巴瘤和其他不太常见的肿瘤方面存在困难,尤其是在前l准确率方面。
大语言模型有望成为利用结构化MRI报告鉴别脑内原发性肿瘤的辅助工具。然而,它们的表现与获得委员会认证的神经放射科医生解读实际图像的表现之间仍存在显著差距。大语言模型的选择和肿瘤类型对结果有显著影响。
问题与解读图像的放射科医生相比,大语言模型(LLM)在根据结构化MRI报告鉴别复杂的脑内原发性肿瘤时表现如何?研究结果放射科医生在诊断准确性方面优于所有测试的大语言模型。最佳模型GPT-4显示出一定潜力,但与放射科医生相比仍有很大差距,尤其是对于不太常见的肿瘤。临床意义大语言模型显示出作为从结构化MRI报告生成鉴别诊断的辅助工具的潜力,特别是对于非专科医生,但目前它们无法取代获得委员会认证的放射科医生解读原始图像数据的细致诊断专业知识。