Suppr超能文献

评估大语言模型生成的脑部磁共振成像协议:GPT4o、o3-mini、DeepSeek-R1和Qwen2.5-72B的性能

Evaluating large language model-generated brain MRI protocols: performance of GPT4o, o3-mini, DeepSeek-R1 and Qwen2.5-72B.

作者信息

Kim Su Hwan, Schramm Severin, Schmitzer Lena, Serguen Kerem, Ziegelmayer Sebastian, Busch Felix, Komenda Alexander, Makowski Marcus R, Adams Lisa C, Bressem Keno K, Zimmer Claus, Kirschke Jan, Wiestler Benedikt, Hedderich Dennis, Finck Tom, Bodden Jannis

机构信息

Institute of Diagnostic and Interventional Radiology, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany.

Institute of Diagnostic and Interventional Neuroradiology, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany.

出版信息

Eur Radiol. 2025 Sep 3. doi: 10.1007/s00330-025-11989-0.

Abstract

OBJECTIVES

To evaluate the potential of LLMs to generate sequence-level brain MRI protocols.

MATERIALS AND METHODS

This retrospective study employed a dataset of 150 brain MRI cases derived from local imaging request forms. Reference protocols were established by two neuroradiologists. GPT-4o, o3-mini, DeepSeek-R1 and Qwen2.5-72B were employed to generate brain MRI protocols based on the case descriptions. Protocol generation was conducted (1) with additional in-context learning involving local standard protocols (enhanced) and (2) without additional information (base). Additionally, two radiology residents independently defined MRI protocols. The sum of redundant and missing sequences (accuracy index) was defined as performance metric. Accuracy indices were compared between groups using paired t-tests.

RESULTS

The two neuroradiologists achieved substantial inter-rater agreement (Cohen's κ = 0.74). o3-mini demonstrated superior performance (base: 2.65 ± 1.61; enhanced: 1.94 ± 1.25), followed by GPT-4o (base: 3.11 ± 1.83; enhanced: 2.23 ± 1.48), DeepSeek-R1 (base: 3.42 ± 1.84; enhanced: 2.37 ± 1.42) and Qwen2.5-72B (base: 5.95 ± 2.78; enhanced: 2.75 ± 1.54). o3-mini consistently outperformed the other models with a significant margin. All four models showed highly significant performance improvements under the enhanced condition (adj. p < 0.001 for all models). The highest-performing LLM (o3-mini [enhanced]) yielded an accuracy index comparable to residents (o3-mini [enhanced]: 1.94 ± 1.25, resident 1: 1.77 ± 1.29, resident 2: 1.77 ± 1.28).

CONCLUSION

Our findings demonstrate the promising potential of LLMs in automating brain MRI protocoling, especially when augmented through in-context learning. o3-mini exhibited superior performance, followed by GPT-4o.

KEY POINTS

QuestionBrain MRI protocoling is a time-consuming, non-interpretative task, exacerbating radiologist workload. Findingso3-mini demonstrated superior brain MRI protocoling performance. All models showed notable improvements when augmented with local standard protocols. Clinical relevanceMRI protocoling is a time-intensive, non-interpretative task that adds to radiologist workload; large language models offer potential for (semi-)automation of this process.

摘要

目的

评估大语言模型生成序列级脑磁共振成像(MRI)检查方案的潜力。

材料与方法

这项回顾性研究使用了一个由150例脑MRI病例组成的数据集,这些病例来自当地的影像检查申请单。由两名神经放射科医生制定参考检查方案。使用GPT-4o、o3-mini、DeepSeek-R1和Qwen2.5-72B根据病例描述生成脑MRI检查方案。检查方案生成(1)采用涉及当地标准检查方案的额外上下文学习(增强版),(2)不使用额外信息(基础版)。此外,两名放射科住院医师独立定义MRI检查方案。将冗余和缺失序列的总和(准确性指标)定义为性能指标。使用配对t检验比较各组之间的准确性指标。

结果

两名神经放射科医生之间达成了较高的评分者间一致性(科恩kappa系数=0.74)。o3-mini表现出卓越性能(基础版:2.65±1.61;增强版:1.94±1.25),其次是GPT-4o(基础版:3.11±1.83;增强版:2.23±1.48)、DeepSeek-R1(基础版:3.42±1.84;增强版:2.37±1.42)和Qwen2.5-72B(基础版:5.95±2.78;增强版:2.75±1.54)。o3-mini始终以显著优势优于其他模型。在增强条件下,所有四个模型的性能都有极显著提升(所有模型的校正p<0.001)。表现最佳的大语言模型(o3-mini[增强版])产生的准确性指标与住院医师相当(o3-mini[增强版]:1.94±1.25,住院医师1:1.77±1.29,住院医师2:1.77±1.28)。

结论

我们的研究结果表明大语言模型在自动化脑MRI检查方案制定方面具有广阔前景,特别是通过上下文学习进行增强时。o3-mini表现出卓越性能,其次是GPT-4o。

关键点

问题脑MRI检查方案制定是一项耗时且无需解读的任务,加剧了放射科医生的工作量。发现o3-mini在脑MRI检查方案制定方面表现卓越。所有模型在结合当地标准检查方案时都有显著改进。临床意义MRI检查方案制定是一项耗时且无需解读的任务,增加了放射科医生的工作量;大语言模型为这一过程的(半)自动化提供了潜力。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验