Suppr超能文献

大型语言模型对磁共振成像有“洞察力”吗?评估GPT-4o、Grok和Gemini在脑部磁共振成像性能方面的表现:Grok在医学成像中的首次评估及比较分析

Do LLMs Have 'the Eye' for MRI? Evaluating GPT-4o, Grok, and Gemini on Brain MRI Performance: First Evaluation of Grok in Medical Imaging and a Comparative Analysis.

作者信息

Sozer Alperen, Sahin Mustafa Caglar, Sozer Batuhan, Erol Gokberk, Tufek Ozan Yavuz, Nernekli Kerem, Demirtas Zuhal, Celtikci Emrah

机构信息

Department of Neurosurgery, Sincan Training and Research Hospital, Ankara 06949, Turkey.

Department of Neurosurgery, Kulu State Hospital, Konya 42780, Turkey.

出版信息

Diagnostics (Basel). 2025 May 24;15(11):1320. doi: 10.3390/diagnostics15111320.

Abstract

: Large language models (LLMs) are revolutionizing the world and the field of medicine while constantly improving themselves. With recent advancements in image interpretation, evaluating the reasoning capabilities of these models and benchmarking their performance on brain MRI tasks has become crucial, as they may be utilized-albeit off-label-for patient care by both neurosurgeons and non-neurosurgeons. : ChatGPT-4o, Grok, and Gemini were presented with 35,711 slices of brain MRI, including various pathologies and normal MRIs. Models were asked to identify the MRI sequence and determine the presence of pathology. Their individual performances were measured and compared with one another. : GPT refused to answer 28.02% of the slices despite three attempts, whereas Grok and Gemini provided responses on the first attempt for every slice. Gemini achieved 74.54% pathology prediction and 46.38% sequence prediction accuracy. GPT-4o achieved 74.33% pathology prediction and 85.98% sequence prediction accuracy for questions that it had answered (53.50% and 61.67% in total, respectively). Grok achieved 65.64% pathology prediction and 66.23% sequence prediction accuracy. : The image interpretation capabilities of the investigated LLMs are limited for now and require further refinement before competing with specifically trained and fine-tuned dedicated applications. Amongst them, Gemini outperforms the others in pathology prediction while Grok outperforms others in sequence prediction. These limitations should be kept in mind if use during patient care is planned.

摘要

大型语言模型(LLMs)正在彻底改变世界和医学领域,同时也在不断自我完善。随着图像解读技术的最新进展,评估这些模型的推理能力并在脑磁共振成像(MRI)任务上对其性能进行基准测试变得至关重要,因为神经外科医生和非神经外科医生都可能会将其用于患者护理,尽管这属于超说明书用药。向ChatGPT-4o、Grok和Gemini展示了35711张脑MRI切片,包括各种病变和正常MRI。要求模型识别MRI序列并确定病变的存在。测量了它们各自的表现并相互比较。GPT尽管尝试了三次,仍拒绝回答28.02%的切片,而Grok和Gemini对每一张切片都在首次尝试时就给出了回答。Gemini的病变预测准确率为74.54%,序列预测准确率为46.38%。GPT-4o对其回答的问题的病变预测准确率为74.33%,序列预测准确率为85.98%(总计分别为53.50%和61.67%)。Grok的病变预测准确率为65.64%,序列预测准确率为66.23%。目前,所研究的大型语言模型的图像解读能力有限,在与经过专门训练和微调的专用应用程序竞争之前,还需要进一步完善。其中,Gemini在病变预测方面优于其他模型,而Grok在序列预测方面优于其他模型。如果计划在患者护理中使用,应牢记这些局限性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af5f/12154409/24e964b61d5c/diagnostics-15-01320-g0A1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验