大语言模型在神经影像临床决策支持中的效用比较评估

A Comparative Evaluation of Large Language Model Utility in Neuroimaging Clinical Decision Support.

作者信息

Miller Luke, Kamel Peter, Patel Jigar, Agrawal Jay, Zhan Min, Bumbarger Nathan, Wang Kenneth

机构信息

Department of Radiology, University of Maryland Medical Center, Baltimore, MD, USA.

Department of Radiology, Baltimore VA Medical Center, Baltimore, MD, USA.

出版信息

J Imaging Inform Med. 2024 Nov 7. doi: 10.1007/s10278-024-01161-3.

DOI:10.1007/s10278-024-01161-3

PMID:39508992

Abstract

Imaging utilization has increased dramatically in recent years, and at least some of these studies are not appropriate for the clinical scenario. The development of large language models (LLMs) may address this issue by providing a more accessible reference resource for ordering providers, but their relative performance is currently understudied. Evaluate and compare the relative appropriateness and usefulness of imaging recommendations generated by eight publicly available models in response to neuroradiology clinical scenarios. Twenty-four common neuroradiology clinical scenarios were selected which often yield suboptimal imaging utilization. Questions were crafted to assess the ability of LLMs to provide accurate and actionable advice. The LLMs were assessed in August 2023 using natural-language 1-2 sentence queries requesting advice about optimal image ordering given certain clinical parameters. Eight of the most well-known LLMs were chosen for evaluation: ChatGPT, GPT4, Bard (Versions 1 and 2), Bing Chat, Llama 2, Perplexity, and Claude. The models were graded by three fellowship-trained neuroradiologists on whether their advice was "optimal" or "not optimal" according to the ACR Appropriateness Criteria or the New Orleans Head CT Criteria. The raters also ranked the models based on the appropriateness, helpfulness, concision, and source-citations in their response. The models varied in their ability to deliver an "optimal" recommendation based on these scenarios as follows: ChatGPT (20/24), GPT4 (23/24), Bard 1 (13/24), Bard 2 (14/24), Bing Chat (14/24), Llama (5/24), Perplexity (19/24), and Claude (19/24). The median ranks of the LLMs were as follows: ChatGPT (3), GPT4 (1.5), Bard 1 (4.5), Bard 2 (5), Bing Chat (6), Llama (7.5), Perplexity (4), and Claude (3). Characteristic errors are described and discussed. GPT-4, ChatGPT, and Claude generally outperformed Bard, Bing Chat, and Llama 2. This study evaluates the performance of a greater variety of publicly available LLMs in settings that more closely mimic real-world use cases as well as discussing the practical challenges of doing so. This is the first study to evaluate and compare a wide range of publicly available LLMs to determine appropriateness of their neuroradiology imaging recommendations.

摘要

近年来，影像学检查的使用急剧增加，而且其中至少有一些检查并不适用于临床情况。大语言模型（LLMs）的发展可能通过为开检查单的医生提供更容易获取的参考资源来解决这个问题，但目前对它们的相对性能研究不足。评估并比较八个公开可用模型针对神经放射学临床情况生成的影像学检查建议的相对适宜性和有用性。选择了24种常见的神经放射学临床情况，这些情况往往导致影像学检查的使用不够理想。精心设计了问题，以评估大语言模型提供准确且可操作建议的能力。2023年8月，使用自然语言的1 - 2句话查询对大语言模型进行评估，这些查询要求在给定某些临床参数的情况下提供关于最佳影像检查单开具的建议。选择了八个最知名的大语言模型进行评估：ChatGPT、GPT4、Bard（版本1和2）、必应聊天、Llama 2、Perplexity和Claude。三位经过专科培训的神经放射科医生根据美国放射学会适宜性标准或新奥尔良头部CT标准，对这些模型的建议是否“最佳”进行评分。评分者还根据模型回复的适宜性、有用性、简洁性和来源引用对模型进行排名。在这些情况下，各模型给出“最佳”建议的能力各不相同，具体如下：ChatGPT（20/24）、GPT4（23/24）、Bard 1（13/24）、Bard 2（14/24）、必应聊天（14/24）、Llama（5/24）、Perplexity（19/24）和Claude（19/24）。描述并讨论了各模型的典型错误。GPT - 4、ChatGPT和Claude通常比Bard、必应聊天和Llama 2表现更好。本研究评估了更多种类的公开可用大语言模型在更接近真实世界用例的场景中的性能，并讨论了这样做的实际挑战。这是第一项评估和比较广泛的公开可用大语言模型以确定其神经放射学影像学检查建议适宜性的研究。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

大语言模型在神经影像临床决策支持中的效用比较评估

A Comparative Evaluation of Large Language Model Utility in Neuroimaging Clinical Decision Support.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

大语言模型在神经影像临床决策支持中的效用比较评估

A Comparative Evaluation of Large Language Model Utility in Neuroimaging Clinical Decision Support.

作者信息

机构信息

出版信息

相似文献

本文引用的文献