Suppr超能文献

大语言模型时代神经放射学中的自动MRI协议制定

Automated MRI protocoling in neuroradiology in the era of large language models.

作者信息

Reiner Lara Noelle, Chelbi Moudather, Fetscher Leonard, Stöckel Juliane C, Csapó-Schmidt Christoph, Guseynova Shakhnaz, Al Mohamad Fares, Bressem Keno Kyrill, Nawabi Jawed, Siebert Eberhard, Wattjes Mike P, Scheel Michael, Meddeb Aymen

机构信息

Department of Neuroradiology, Charité-Universitätsmedizin Berlin, Augustenburger Platz 1, 13353, Berlin, Germany.

Department of Radiology, Technical University Munich, Klinikum Rechts Der Isar, Ismaninger Str. 22, 81675, Munich, Germany.

出版信息

Radiol Med. 2025 Jul 11. doi: 10.1007/s11547-025-02040-9.

Abstract

PURPOSE

This study investigates the automation of MRI protocoling, a routine task in radiology, using large language models (LLMs), comparing an open-source (LLama 3.1 405B) and a proprietary model (GPT-4o) with and without retrieval-augmented generation (RAG), a method for incorporating domain-specific knowledge.

MATERIAL AND METHODS

This retrospective study included MRI studies conducted between January and December 2023, along with institution-specific protocol assignment guidelines. Clinical questions were extracted, and a neuroradiologist established the gold standard protocol. LLMs were tasked with assigning MRI protocols and contrast medium administration with and without RAG. The results were compared to protocols selected by four radiologists. Token-based symmetric accuracy, the Wilcoxon signed-rank test, and the McNemar test were used for evaluation.

RESULTS

Data from 100 neuroradiology reports (mean age = 54.2 years ± 18.41, women 50%) were included. RAG integration significantly improved accuracy in sequence and contrast media prediction for LLama 3.1 (Sequences: 38% vs. 70%, P < .001, Contrast Media: 77% vs. 94%, P < .001), and GPT-4o (Sequences: 43% vs. 81%, P < .001, Contrast Media: 79% vs. 92%, P = .006). GPT-4o outperformed LLama 3.1 in MRI sequence prediction (81% vs. 70%, P < .001), with comparable accuracies to the radiologists (81% ± 0.21, P = .43). Both models equaled radiologists in predicting contrast media administration (LLama 3.1 RAG: 94% vs. 91% ± 0.2, P = .37, GPT-4o RAG: 92% vs. 91% ± 0.24, P = .48).

CONCLUSION

Large language models show great potential as decision-support tools for MRI protocoling, with performance similar to radiologists. RAG enhances the ability of LLMs to provide accurate, institution-specific protocol recommendations.

摘要

目的

本研究使用大语言模型(LLMs)探究放射学中的常规任务——磁共振成像(MRI)检查方案制定的自动化,比较一个开源模型(Llama 3.1 405B)和一个专有模型(GPT-4o)在有无检索增强生成(RAG,一种纳入特定领域知识的方法)情况下的表现。

材料与方法

这项回顾性研究纳入了2023年1月至12月期间进行的MRI检查,以及机构特定的检查方案分配指南。提取临床问题,由一名神经放射科医生确定金标准检查方案。要求大语言模型在有无RAG的情况下分配MRI检查方案和造影剂使用方案。将结果与四位放射科医生选择的检查方案进行比较。使用基于令牌的对称准确率、Wilcoxon符号秩检验和McNemar检验进行评估。

结果

纳入了100份神经放射学报告的数据(平均年龄 = 54.2岁 ± 18.41,女性占50%)。对于Llama 3.1,RAG整合显著提高了序列和造影剂预测的准确率(序列:38% 对 70%,P <.001,造影剂:77% 对 94%,P <.001),对于GPT-4o也是如此(序列:43% 对 81%,P <.001,造影剂:79% 对 92%,P =.006)。在MRI序列预测方面,GPT-4o优于Llama 3.1(81% 对 70%,P <.001),与放射科医生的准确率相当(81% ± 0.21,P =.43)。在预测造影剂使用方面,两个模型与放射科医生相当(Llama 3.1 RAG:94% 对 91% ± 0.2,P =.37,GPT-4o RAG:92% 对 91% ± 0.24,P =.48)。

结论

大语言模型作为MRI检查方案制定的决策支持工具显示出巨大潜力,其表现与放射科医生相似。RAG增强了大语言模型提供准确的、机构特定检查方案建议的能力。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验