Gunes Yasin Celal, Cesur Turay, Camur Eren, Cifci Bilal Egemen, Kaya Turan, Colakoglu Mehmet Numan, Koc Ural, Okten Rıza Sarper
Department of Radiology, Kirikkale Yuksek Ihtisas Hospital, Kirikkale, Turkey (Y.C.G.).
Department of Radiology, Mamak State Hospital, Ankara, Turkey (T.C.).
Acad Radiol. 2025 May;32(5):2411-2421. doi: 10.1016/j.acra.2025.01.004. Epub 2025 Feb 11.
To assess the performance of Large Language Models (LLMs) in detecting and correcting MRI artifacts compared to radiologists using text-based and visual questions.
This cross-sectional observational study included three phases. Phase 1 involved six LLMs (ChatGPT o1-preview, ChatGPT-4o, ChatGPT-4V, Google Gemini 1.5 Pro, Claude 3.5 Sonnet, Claude 3 Opus) and five radiologists (two residents, two junior radiologists, one senior radiologist) answering 42 text-based questions on MRI artifacts. In Phase 2, the same radiologists and five multimodal LLMs evaluated 100 MRI images, each containing a single artifact. Phase 3 reassessed the identical tasks 1.5 months later to evaluate temporal consistency. Responses were graded using 4-point Likert scales for "Management Score" (text-based) and "Correction Score" (visual). McNemar's test compared response accuracy, and the Wilcoxon test assessed score differences.
LLMs outperformed radiologists in text-based tasks, with ChatGPT o1-preview scoring the highest (3.71±0.60 in Round 1; 3.76±0.84 in Round 2) (p<0.05). In visual tasks, radiologists performed significantly better, with the Senior Radiologist achieving 92% and 94% accuracy in Rounds 1 and 2, respectively (p<0.05). The top-performing LLM (ChatGPT-4o) achieved only 20% and 18% accuracy. Correction Scores mirrored this difference, with radiologists consistently scoring higher than LLMs (p<0.05).
LLMs excel in text-based tasks but have notable limitations in visual artifact interpretation, making them unsuitable for independent diagnostics. They are promising as educational tools or adjuncts in "human-in-the-loop" systems, with multimodal AI improvements necessary to bridge these gaps.
与放射科医生相比,使用基于文本和视觉问题评估大语言模型(LLMs)在检测和纠正MRI伪影方面的表现。
这项横断面观察性研究包括三个阶段。第一阶段涉及6个大语言模型(ChatGPT o1-preview、ChatGPT-4o、ChatGPT-4V、谷歌Gemini 1.5 Pro、Claude 3.5 Sonnet、Claude 3 Opus)和5名放射科医生(2名住院医师、2名初级放射科医生、1名高级放射科医生)回答42个关于MRI伪影的基于文本的问题。在第二阶段,相同的放射科医生和5个多模态大语言模型评估了100幅MRI图像,每幅图像包含一个单一伪影。第三阶段在1.5个月后重新评估相同任务以评估时间一致性。使用4点李克特量表对“管理得分”(基于文本)和“纠正得分”(视觉)的回答进行评分。McNemar检验比较回答准确性,Wilcoxon检验评估得分差异。
在基于文本的任务中,大语言模型的表现优于放射科医生,ChatGPT o1-preview得分最高(第一轮为3.71±0.60;第二轮为3.76±0.84)(p<0.05)。在视觉任务中,放射科医生的表现明显更好,高级放射科医生在第一轮和第二轮的准确率分别达到92%和94%(p<0.05)。表现最佳的大语言模型(ChatGPT-4o)的准确率仅为20%和18%。纠正得分反映了这一差异,放射科医生的得分始终高于大语言模型(p<0.05)。
大语言模型在基于文本的任务中表现出色,但在视觉伪影解释方面存在显著局限性,使其不适合独立诊断。它们作为教育工具或“人在回路”系统中的辅助工具很有前景,需要多模态人工智能改进来弥合这些差距。