• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

轴内原发性脑肿瘤鉴别:比较基于结构化MRI报告的大语言模型与阅片放射科医生的表现

Intra-axial primary brain tumor differentiation: comparing large language models on structured MRI reports vs. radiologists on images.

作者信息

Nakaura Takeshi, Uetani Hiroyuki, Yoshida Naofumi, Kobayashi Naoki, Nagayama Yasunori, Kidoh Masafumi, Kuroda Jun-Ichiro, Mukasa Akitake, Hirai Toshinori

机构信息

Department of Diagnostic Radiology, Graduate School of Medical Sciences, Kumamoto University, Kumamoto, Japan.

Department of Neurosurgery, Graduate School of Medical Sciences, Kumamoto University, Kumamoto, Japan.

出版信息

Eur Radiol. 2025 Aug 22. doi: 10.1007/s00330-025-11924-3.

DOI:10.1007/s00330-025-11924-3
PMID:40847080
Abstract

OBJECTIVE

Aimed to evaluate the potential of large language models (LLMs) in differentiating intra-axial primary brain tumors using structured magnetic resonance imaging (MRI) reports and compare their performance with radiologists.

MATERIALS AND METHODS

Structured reports of preoperative MRI findings from 137 surgically confirmed intra-axial primary brain tumors, including Glioblastoma (n = 77), Central Nervous System (CNS) Lymphoma (n = 22), Astrocytoma (n = 9), Oligodendroglioma (n = 9), and others (n = 20), were analyzed by multiple LLMs, including GPT-4, Claude-3-Opus, Claude-3-Sonnet, GPT-3.5, Llama-2-70B, Qwen1.5-72B, and Gemini-Pro-1.0. The models provided the top 5 differential diagnoses based on the preoperative MRI findings, and their top 1, 3, and 5 accuracies were compared with board-certified neuroradiologists' interpretations of the actual preoperative MRI images.

RESULTS

Radiologists achieved top 1, 3, and 5 accuracies of 85.4%, 94.9%, and 94.9%, respectively. Among the LLMs, GPT-4 performed best with top 1, 3, and 5 accuracies of 65.7%, 84.7%, and 90.5%, respectively. Notably, GPT-4's top 3 accuracy of 84.7% approached the radiologists' top 1 accuracy of 85.4%. Other LLMs showed varying performance levels, with average accuracies ranging from 62.3% to 75.9%. LLMs demonstrated high accuracy for Glioblastoma but struggled with CNS Lymphoma and other less common tumors, particularly in top 1 accuracy.

CONCLUSION

LLMs show promise as assistive tools for differentiating intra-axial primary brain tumors using structured MRI reports. However, a significant gap remains between their performance and that of board-certified neuroradiologists interpreting actual images. The choice of LLM and tumor type significantly influences the results.

KEY POINTS

Question How do Large Language Models (LLM) perform when differentiating complex intra-axial primary brain tumors from structured MRI reports compared to radiologists interpreting images? Findings Radiologists outperformed all tested LLMs in diagnostic accuracy. The best model, GPT-4, showed promise but lagged considerably behind radiologists, particularly for less common tumors. Clinical relevance LLMs show potential as assistive tools for generating differential diagnoses from structured MRI reports, particularly for non-specialists, but they cannot currently replace the nuanced diagnostic expertise of a board-certified radiologist interpreting the primary image data.

摘要

目的

旨在评估大语言模型(LLMs)利用结构化磁共振成像(MRI)报告鉴别脑内原发性肿瘤的潜力,并将其表现与放射科医生进行比较。

材料与方法

对137例经手术确诊的脑内原发性肿瘤的术前MRI检查结果的结构化报告进行分析,这些肿瘤包括胶质母细胞瘤(n = 77)、中枢神经系统(CNS)淋巴瘤(n = 22)、星形细胞瘤(n = 9)、少突胶质细胞瘤(n = 9)以及其他类型(n = 20)。多个大语言模型参与分析,包括GPT-4、Claude-3-Opus、Claude-3-Sonnet、GPT-3.5、Llama-2-70B、Qwen1.5-72B和Gemini-Pro-1.0。这些模型根据术前MRI检查结果提供前5种鉴别诊断,并将其前1、3和5的准确率与获得委员会认证的神经放射科医生对实际术前MRI图像的解读进行比较。

结果

放射科医生前1、3和5的准确率分别为85.4%、94.9%和94.9%。在大语言模型中,GPT-4表现最佳,前1、3和5的准确率分别为65.7%、84.7%和90.5%。值得注意的是,GPT-4的前3准确率84.7%接近放射科医生的前1准确率85.4%。其他大语言模型表现出不同的水平,平均准确率在62.3%至75.9%之间。大语言模型对胶质母细胞瘤显示出较高的准确率,但在鉴别CNS淋巴瘤和其他不太常见的肿瘤方面存在困难,尤其是在前l准确率方面。

结论

大语言模型有望成为利用结构化MRI报告鉴别脑内原发性肿瘤的辅助工具。然而,它们的表现与获得委员会认证的神经放射科医生解读实际图像的表现之间仍存在显著差距。大语言模型的选择和肿瘤类型对结果有显著影响。

关键点

问题与解读图像的放射科医生相比,大语言模型(LLM)在根据结构化MRI报告鉴别复杂的脑内原发性肿瘤时表现如何?研究结果放射科医生在诊断准确性方面优于所有测试的大语言模型。最佳模型GPT-4显示出一定潜力,但与放射科医生相比仍有很大差距,尤其是对于不太常见的肿瘤。临床意义大语言模型显示出作为从结构化MRI报告生成鉴别诊断的辅助工具的潜力,特别是对于非专科医生,但目前它们无法取代获得委员会认证的放射科医生解读原始图像数据的细致诊断专业知识。

相似文献

1
Intra-axial primary brain tumor differentiation: comparing large language models on structured MRI reports vs. radiologists on images.轴内原发性脑肿瘤鉴别:比较基于结构化MRI报告的大语言模型与阅片放射科医生的表现
Eur Radiol. 2025 Aug 22. doi: 10.1007/s00330-025-11924-3.
2
Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers.大型语言模型在数值与语义医学知识方面的表现:基于循证问答的横断面基准研究
J Med Internet Res. 2025 Jul 14;27:e64452. doi: 10.2196/64452.
3
Accuracy of large language models in generating differential diagnosis from clinical presentation and imaging findings in pediatric cases.大型语言模型根据儿科病例的临床表现和影像学检查结果生成鉴别诊断的准确性。
Pediatr Radiol. 2025 Jul 12. doi: 10.1007/s00247-025-06317-z.
4
Performance analysis of large language models in multi-disease detection from chest computed tomography reports: a comparative study: Experimental Research.基于胸部计算机断层扫描报告的多疾病检测中大型语言模型的性能分析:一项比较研究:实验研究
Int J Surg. 2025 Jun 5. doi: 10.1097/JS9.0000000000002582.
5
Evaluation of radiology residents' reporting skills using large language models: an observational study.使用大语言模型评估放射科住院医师的报告技能:一项观察性研究。
Jpn J Radiol. 2025 Mar 8. doi: 10.1007/s11604-025-01764-y.
6
Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines.使用GPT-4o和Llama-3.3-70B从自由文本中风CT报告中提取数据:注释指南的影响
Eur Radiol Exp. 2025 Jun 19;9(1):61. doi: 10.1186/s41747-025-00600-2.
7
Performance of open-source and proprietary large language models in generating patient-friendly radiology chest CT reports.开源和专有大语言模型在生成患者友好型放射科胸部CT报告方面的表现。
Clin Imaging. 2025 Sep;125:110557. doi: 10.1016/j.clinimag.2025.110557. Epub 2025 Jul 5.
8
Rapidly Benchmarking Large Language Models for Diagnosing Comorbid Patients: Comparative Study Leveraging the LLM-as-a-Judge Method.快速对用于诊断合并症患者的大语言模型进行基准测试:利用“大语言模型即评判者”方法的比较研究
JMIRx Med. 2025 Aug 29;6:e67661. doi: 10.2196/67661.
9
Large Language Models for CAD-RADS 2.0 Extraction From Semi-Structured Coronary CT Angiography Reports: A Multi-Institutional Study.用于从半结构化冠状动脉CT血管造影报告中提取CAD-RADS 2.0的大语言模型:一项多机构研究
Korean J Radiol. 2025 Sep;26(9):817-831. doi: 10.3348/kjr.2025.0293.
10
Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents.大型语言模型在欧洲普通外科医师资格考试中表现不佳:与专家及外科住院医师的比较研究
BMC Med Educ. 2025 Aug 23;25(1):1193. doi: 10.1186/s12909-025-07856-7.

本文引用的文献

1
GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination.GPT-4 Turbo with Vision 在日本诊断放射学委员会考试中未能优于仅文本的 GPT-4 Turbo。
Jpn J Radiol. 2024 Aug;42(8):918-926. doi: 10.1007/s11604-024-01561-z. Epub 2024 May 11.
2
Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study.探讨 ChatGPT 版本 3.5、4 和 4 与 Vision 在智利医师执照考试中的表现:观察性研究。
JMIR Med Educ. 2024 Apr 29;10:e55048. doi: 10.2196/55048.
3
The impact of large language models on radiology: a guide for radiologists on the latest innovations in AI.
大语言模型对放射学的影响:放射科医生了解 AI 最新创新的指南。
Jpn J Radiol. 2024 Jul;42(7):685-696. doi: 10.1007/s11604-024-01552-0. Epub 2024 Mar 29.
4
Generative Pre-trained Transformer 4 makes cardiovascular magnetic resonance reports easy to understand.生成式预训练转换器 4 使得心血管磁共振报告易于理解。
J Cardiovasc Magn Reson. 2024 Summer;26(1):101035. doi: 10.1016/j.jocmr.2024.101035. Epub 2024 Mar 7.
5
Large Language Models: A Guide for Radiologists.大语言模型:放射科医师指南。
Korean J Radiol. 2024 Feb;25(2):126-133. doi: 10.3348/kjr.2023.0997.
6
Chatbots and Large Language Models in Radiology: A Practical Primer for Clinical and Research Applications.放射科中的聊天机器人和大型语言模型:临床和研究应用的实用入门指南。
Radiology. 2024 Jan;310(1):e232756. doi: 10.1148/radiol.232756.
7
Preliminary assessment of automated radiology report generation with generative pre-trained transformers: comparing results to radiologist-generated reports.基于生成式预训练转换器的自动化放射学报告生成的初步评估:与放射科医生生成的报告进行比较。
Jpn J Radiol. 2024 Feb;42(2):190-200. doi: 10.1007/s11604-023-01487-y. Epub 2023 Sep 15.
8
Feasibility of Differential Diagnosis Based on Imaging Patterns Using a Large Language Model.基于成像模式利用大语言模型进行鉴别诊断的可行性
Radiology. 2023 Jul;308(1):e231167. doi: 10.1148/radiol.231167.
9
Leveraging GPT-4 for Post Hoc Transformation of Free-text Radiology Reports into Structured Reporting: A Multilingual Feasibility Study.利用GPT-4将自由文本放射学报告进行事后转换为结构化报告:一项多语言可行性研究。
Radiology. 2023 May;307(4):e230725. doi: 10.1148/radiol.230725. Epub 2023 Apr 4.
10
The 2021 WHO Classification of Tumors of the Central Nervous System: a summary.2021 年世卫组织中枢神经系统肿瘤分类:概述。
Neuro Oncol. 2021 Aug 2;23(8):1231-1251. doi: 10.1093/neuonc/noab106.