• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大语言模型在神经影像临床决策支持中的效用比较评估

A Comparative Evaluation of Large Language Model Utility in Neuroimaging Clinical Decision Support.

作者信息

Miller Luke, Kamel Peter, Patel Jigar, Agrawal Jay, Zhan Min, Bumbarger Nathan, Wang Kenneth

机构信息

Department of Radiology, University of Maryland Medical Center, Baltimore, MD, USA.

Department of Radiology, Baltimore VA Medical Center, Baltimore, MD, USA.

出版信息

J Imaging Inform Med. 2024 Nov 7. doi: 10.1007/s10278-024-01161-3.

DOI:10.1007/s10278-024-01161-3
PMID:39508992
Abstract

Imaging utilization has increased dramatically in recent years, and at least some of these studies are not appropriate for the clinical scenario. The development of large language models (LLMs) may address this issue by providing a more accessible reference resource for ordering providers, but their relative performance is currently understudied. Evaluate and compare the relative appropriateness and usefulness of imaging recommendations generated by eight publicly available models in response to neuroradiology clinical scenarios. Twenty-four common neuroradiology clinical scenarios were selected which often yield suboptimal imaging utilization. Questions were crafted to assess the ability of LLMs to provide accurate and actionable advice. The LLMs were assessed in August 2023 using natural-language 1-2 sentence queries requesting advice about optimal image ordering given certain clinical parameters. Eight of the most well-known LLMs were chosen for evaluation: ChatGPT, GPT4, Bard (Versions 1 and 2), Bing Chat, Llama 2, Perplexity, and Claude. The models were graded by three fellowship-trained neuroradiologists on whether their advice was "optimal" or "not optimal" according to the ACR Appropriateness Criteria or the New Orleans Head CT Criteria. The raters also ranked the models based on the appropriateness, helpfulness, concision, and source-citations in their response. The models varied in their ability to deliver an "optimal" recommendation based on these scenarios as follows: ChatGPT (20/24), GPT4 (23/24), Bard 1 (13/24), Bard 2 (14/24), Bing Chat (14/24), Llama (5/24), Perplexity (19/24), and Claude (19/24). The median ranks of the LLMs were as follows: ChatGPT (3), GPT4 (1.5), Bard 1 (4.5), Bard 2 (5), Bing Chat (6), Llama (7.5), Perplexity (4), and Claude (3). Characteristic errors are described and discussed. GPT-4, ChatGPT, and Claude generally outperformed Bard, Bing Chat, and Llama 2. This study evaluates the performance of a greater variety of publicly available LLMs in settings that more closely mimic real-world use cases as well as discussing the practical challenges of doing so. This is the first study to evaluate and compare a wide range of publicly available LLMs to determine appropriateness of their neuroradiology imaging recommendations.

摘要

近年来,影像学检查的使用急剧增加,而且其中至少有一些检查并不适用于临床情况。大语言模型(LLMs)的发展可能通过为开检查单的医生提供更容易获取的参考资源来解决这个问题,但目前对它们的相对性能研究不足。评估并比较八个公开可用模型针对神经放射学临床情况生成的影像学检查建议的相对适宜性和有用性。选择了24种常见的神经放射学临床情况,这些情况往往导致影像学检查的使用不够理想。精心设计了问题,以评估大语言模型提供准确且可操作建议的能力。2023年8月,使用自然语言的1 - 2句话查询对大语言模型进行评估,这些查询要求在给定某些临床参数的情况下提供关于最佳影像检查单开具的建议。选择了八个最知名的大语言模型进行评估:ChatGPT、GPT4、Bard(版本1和2)、必应聊天、Llama 2、Perplexity和Claude。三位经过专科培训的神经放射科医生根据美国放射学会适宜性标准或新奥尔良头部CT标准,对这些模型的建议是否“最佳”进行评分。评分者还根据模型回复的适宜性、有用性、简洁性和来源引用对模型进行排名。在这些情况下,各模型给出“最佳”建议的能力各不相同,具体如下:ChatGPT(20/24)、GPT4(23/24)、Bard 1(13/24)、Bard 2(14/24)、必应聊天(14/24)、Llama(5/24)、Perplexity(19/24)和Claude(19/24)。描述并讨论了各模型的典型错误。GPT - 4、ChatGPT和Claude通常比Bard、必应聊天和Llama 2表现更好。本研究评估了更多种类的公开可用大语言模型在更接近真实世界用例的场景中的性能,并讨论了这样做的实际挑战。这是第一项评估和比较广泛的公开可用大语言模型以确定其神经放射学影像学检查建议适宜性的研究。

相似文献

1
A Comparative Evaluation of Large Language Model Utility in Neuroimaging Clinical Decision Support.大语言模型在神经影像临床决策支持中的效用比较评估
J Imaging Inform Med. 2024 Nov 7. doi: 10.1007/s10278-024-01161-3.
2
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
3
Clinical Management of Wasp Stings Using Large Language Models: Cross-Sectional Evaluation Study.使用大语言模型对黄蜂蜇伤进行临床管理:横断面评估研究
J Med Internet Res. 2025 Jun 4;27:e67489. doi: 10.2196/67489.
4
Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review.ChatGPT 及其他会话型大型语言模型在医疗保健中的应用及关注:系统评价。
J Med Internet Res. 2024 Nov 7;26:e22769. doi: 10.2196/22769.
5
Examining the Role of Large Language Models in Orthopedics: Systematic Review.检查大型语言模型在骨科中的作用:系统评价。
J Med Internet Res. 2024 Nov 15;26:e59607. doi: 10.2196/59607.
6
Stench of Errors or the Shine of Potential: The Challenge of (Ir)Responsible Use of ChatGPT in Speech-Language Pathology.错误的恶臭还是潜力的光辉:言语病理学中(不)负责任地使用ChatGPT的挑战。
Int J Lang Commun Disord. 2025 Jul-Aug;60(4):e70088. doi: 10.1111/1460-6984.70088.
7
Large Language Models and Empathy: Systematic Review.大语言模型与同理心:系统综述
J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.
8
Comparing large language models for antibiotic prescribing in different clinical scenarios: which performs better?比较不同临床场景下用于抗生素处方的大语言模型:哪种表现更佳?
Clin Microbiol Infect. 2025 Aug;31(8):1336-1342. doi: 10.1016/j.cmi.2025.03.002. Epub 2025 Mar 19.
9
The agreement of phonetic transcriptions between paediatric speech and language therapists transcribing a disordered speech sample.儿科言语和语言治疗师转写语音样本的音标转录的一致性。
Int J Lang Commun Disord. 2024 Sep-Oct;59(5):1981-1995. doi: 10.1111/1460-6984.13043. Epub 2024 Jun 8.
10
Use of Large Language Models to Classify Epidemiological Characteristics in Synthetic and Real-World Social Media Posts About Conjunctivitis Outbreaks: Infodemiology Study.利用大语言模型对合成及真实世界社交媒体上有关结膜炎爆发的帖子中的流行病学特征进行分类:信息流行病学研究
J Med Internet Res. 2025 Jul 2;27:e65226. doi: 10.2196/65226.

本文引用的文献

1
ChatGPT-4 Performance on USMLE Step 1 Style Questions and Its Implications for Medical Education: A Comparative Study Across Systems and Disciplines.ChatGPT-4在美国医师执照考试第一步(USMLE Step 1)题型问题上的表现及其对医学教育的影响:跨系统和学科的比较研究
Med Sci Educ. 2023 Dec 27;34(1):145-152. doi: 10.1007/s40670-023-01956-z. eCollection 2024 Feb.
2
Chatbots and Large Language Models in Radiology: A Practical Primer for Clinical and Research Applications.放射科中的聊天机器人和大型语言模型:临床和研究应用的实用入门指南。
Radiology. 2024 Jan;310(1):e232756. doi: 10.1148/radiol.232756.
3
Economic and Environmental Costs of Cloud Technologies for Medical Imaging and Radiology Artificial Intelligence.
医学成像与放射学人工智能中云技术的经济与环境成本
J Am Coll Radiol. 2024 Feb;21(2):248-256. doi: 10.1016/j.jacr.2023.11.011. Epub 2023 Dec 9.
4
Analysis of ChatGPT publications in radiology: Literature so far.分析放射学领域中关于 ChatGPT 的出版物:迄今为止的文献。
Curr Probl Diagn Radiol. 2024 Mar-Apr;53(2):215-225. doi: 10.1067/j.cpradiol.2023.10.013. Epub 2023 Oct 20.
5
Effect of Provider Type on Overutilization of CT Angiograms of the Head and Neck for Patients Presenting to the Emergency Department with Nonfocal Neurologic Symptoms.就诊于急诊科且以非局灶性神经系统症状为主的患者,其头部和颈部 CT 血管造影过度使用与提供者类型的关系。
J Am Coll Radiol. 2024 Jun;21(6):890-895. doi: 10.1016/j.jacr.2023.08.042. Epub 2023 Sep 16.
6
Provision of evaluation and management visits by nurse practitioners and physician assistants in the USA from 2013 to 2019: cross-sectional time series study.美国 2013 年至 2019 年期间,护士从业者和医师助理提供的评估和管理就诊:横断面时间序列研究。
BMJ. 2023 Sep 14;382:e073933. doi: 10.1136/bmj-2022-073933.
7
A Context-based Chatbot Surpasses Trained Radiologists and Generic ChatGPT in Following the ACR Appropriateness Guidelines.基于语境的聊天机器人在遵循 ACR 适宜性准则方面超越了经过培训的放射科医生和通用的 ChatGPT。
Radiology. 2023 Jul;308(1):e230970. doi: 10.1148/radiol.230970.
8
Use of Large Language Models to Predict Neuroimaging.大语言模型在神经影像学预测中的应用。
J Am Coll Radiol. 2023 Oct;20(10):1004-1009. doi: 10.1016/j.jacr.2023.06.008. Epub 2023 Jul 8.
9
Patient, Provider, and Practice Characteristics Predicting Use of Diagnostic Imaging in Primary Care: Cross-Sectional Data From the National Ambulatory Medical Care Survey.患者、提供者和实践特征预测初级保健中诊断成像的使用:来自全国门诊医疗调查的横断面数据。
J Am Coll Radiol. 2023 Dec;20(12):1193-1206. doi: 10.1016/j.jacr.2023.04.021. Epub 2023 Jul 7.
10
The Potential for Using ChatGPT to Improve Imaging Appropriateness.使用ChatGPT提高影像检查合理性的潜力。
J Am Coll Radiol. 2023 Oct;20(10):988-989. doi: 10.1016/j.jacr.2023.06.005. Epub 2023 Jul 1.