Reiner Lara Noelle, Chelbi Moudather, Fetscher Leonard, Stöckel Juliane C, Csapó-Schmidt Christoph, Guseynova Shakhnaz, Al Mohamad Fares, Bressem Keno Kyrill, Nawabi Jawed, Siebert Eberhard, Wattjes Mike P, Scheel Michael, Meddeb Aymen
Department of Neuroradiology, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353, Berlin, Germany.
Department of Radiology, Technical University Munich, Klinikum Rechts Der Isar, Ismaninger Str. 22, 81675, Munich, Germany.
Radiol Med. 2025 Jul 11. doi: 10.1007/s11547-025-02040-9.
This study investigates the automation of MRI protocoling, a routine task in radiology, using large language models (LLMs), comparing an open-source (LLama 3.1 405B) and a proprietary model (GPT-4o) with and without retrieval-augmented generation (RAG), a method for incorporating domain-specific knowledge.
This retrospective study included MRI studies conducted between January and December 2023, along with institution-specific protocol assignment guidelines. Clinical questions were extracted, and a neuroradiologist established the gold standard protocol. LLMs were tasked with assigning MRI protocols and contrast medium administration with and without RAG. The results were compared to protocols selected by four radiologists. Token-based symmetric accuracy, the Wilcoxon signed-rank test, and the McNemar test were used for evaluation.
Data from 100 neuroradiology reports (mean age = 54.2 years ± 18.41, women 50%) were included. RAG integration significantly improved accuracy in sequence and contrast media prediction for LLama 3.1 (Sequences: 38% vs. 70%, P < .001, Contrast Media: 77% vs. 94%, P < .001), and GPT-4o (Sequences: 43% vs. 81%, P < .001, Contrast Media: 79% vs. 92%, P = .006). GPT-4o outperformed LLama 3.1 in MRI sequence prediction (81% vs. 70%, P < .001), with comparable accuracies to the radiologists (81% ± 0.21, P = .43). Both models equaled radiologists in predicting contrast media administration (LLama 3.1 RAG: 94% vs. 91% ± 0.2, P = .37, GPT-4o RAG: 92% vs. 91% ± 0.24, P = .48).
Large language models show great potential as decision-support tools for MRI protocoling, with performance similar to radiologists. RAG enhances the ability of LLMs to provide accurate, institution-specific protocol recommendations.
本研究使用大语言模型(LLMs)探究放射学中的常规任务——磁共振成像(MRI)检查方案制定的自动化,比较一个开源模型(Llama 3.1 405B)和一个专有模型(GPT-4o)在有无检索增强生成(RAG,一种纳入特定领域知识的方法)情况下的表现。
这项回顾性研究纳入了2023年1月至12月期间进行的MRI检查,以及机构特定的检查方案分配指南。提取临床问题,由一名神经放射科医生确定金标准检查方案。要求大语言模型在有无RAG的情况下分配MRI检查方案和造影剂使用方案。将结果与四位放射科医生选择的检查方案进行比较。使用基于令牌的对称准确率、Wilcoxon符号秩检验和McNemar检验进行评估。
纳入了100份神经放射学报告的数据(平均年龄 = 54.2岁 ± 18.41,女性占50%)。对于Llama 3.1,RAG整合显著提高了序列和造影剂预测的准确率(序列:38% 对 70%,P <.001,造影剂:77% 对 94%,P <.001),对于GPT-4o也是如此(序列:43% 对 81%,P <.001,造影剂:79% 对 92%,P =.006)。在MRI序列预测方面,GPT-4o优于Llama 3.1(81% 对 70%,P <.001),与放射科医生的准确率相当(81% ± 0.21,P =.43)。在预测造影剂使用方面,两个模型与放射科医生相当(Llama 3.1 RAG:94% 对 91% ± 0.2,P =.37,GPT-4o RAG:92% 对 91% ± 0.24,P =.48)。
大语言模型作为MRI检查方案制定的决策支持工具显示出巨大潜力,其表现与放射科医生相似。RAG增强了大语言模型提供准确的、机构特定检查方案建议的能力。