Zakka Cyril, Shad Rohan, Chaurasia Akash, Dalal Alex R, Kim Jennifer L, Moor Michael, Fong Robyn, Phillips Curran, Alexander Kevin, Ashley Euan, Boyd Jack, Boyd Kathleen, Hirsch Karen, Langlotz Curt, Lee Rita, Melia Joanna, Nelson Joanna, Sallam Karim, Tullis Stacey, Vogelsong Melissa Ann, Cunningham John Patrick, Hiesinger William
Department of Cardiothoracic Surgery, Stanford Medicine, Stanford, CA.
Division of Cardiovascular Surgery, Penn Medicine, Philadelphia.
NEJM AI. 2024 Feb;1(2). doi: 10.1056/aioa2300068. Epub 2024 Jan 25.
Large language models (LLMs) have recently shown impressive zero-shot capabilities, whereby they can use auxiliary data, without the availability of task-specific training examples, to complete a variety of natural language tasks, such as summarization, dialogue generation, and question answering. However, despite many promising applications of LLMs in clinical medicine, adoption of these models has been limited by their tendency to generate incorrect and sometimes even harmful statements.
We tasked a panel of eight board-certified clinicians and two health care practitioners with evaluating Almanac, an LLM framework augmented with retrieval capabilities from curated medical resources for medical guideline and treatment recommendations. The panel compared responses from Almanac and standard LLMs (ChatGPT-4, Bing, and Bard) versus a novel data set of 314 clinical questions spanning nine medical specialties.
Almanac showed a significant improvement in performance compared with the standard LLMs across axes of factuality, completeness, user preference, and adversarial safety.
Our results show the potential for LLMs with access to domain-specific corpora to be effective in clinical decision-making. The findings also underscore the importance of carefully testing LLMs before deployment to mitigate their shortcomings. (Funded by the National Institutes of Health, National Heart, Lung, and Blood Institute.).
大型语言模型(LLMs)最近展现出了令人印象深刻的零样本能力,即它们能够在没有特定任务训练示例的情况下,利用辅助数据来完成各种自然语言任务,如文本摘要、对话生成和问答。然而,尽管大型语言模型在临床医学中有许多有前景的应用,但这些模型的采用受到其生成不正确甚至有时有害陈述倾向的限制。
我们让一个由八名获得董事会认证的临床医生和两名医疗从业者组成的小组评估Almanac,这是一个通过从精心策划的医学资源中检索信息来增强医学指南和治疗建议检索能力的大型语言模型框架。该小组将Almanac和标准大型语言模型(ChatGPT-4、必应和巴德)的回答与一个包含九个医学专业的314个临床问题的新数据集进行了比较。
在事实性、完整性、用户偏好和对抗性安全性等方面,Almanac与标准大型语言模型相比表现出显著的性能提升。
我们的结果表明,能够访问特定领域语料库的大型语言模型在临床决策中具有有效性。研究结果还强调了在部署大型语言模型之前仔细测试以减轻其缺点的重要性。(由美国国立卫生研究院国家心肺血液研究所资助。)