Suppr超能文献

大语言模型编码临床知识。

Large language models encode clinical knowledge.

机构信息

Google Research, Mountain View, CA, USA.

National Library of Medicine, Bethesda, MD, USA.

出版信息

Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.

Abstract

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA and Measuring Massive Multitask Language Understanding (MMLU) clinical topics), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.

摘要

大型语言模型 (LLM) 已经展现出令人印象深刻的能力,但要达到临床应用的标准还很高。尝试评估模型的临床知识通常依赖于基于有限基准的自动化评估。在这里,为了解决这些限制,我们提出了 MultiMedQA,这是一个结合了六个现有的医学问答数据集的基准,涵盖了专业医学、研究和消费者查询,以及一个新的在线搜索医疗问题数据集 HealthSearchQA。我们提出了一个针对模型答案的人类评估框架,包括事实性、理解、推理、可能的伤害和偏见等多个方面。此外,我们还在 MultiMedQA 上评估了 PaLM(一个 5400 亿参数的 LLM)及其指令调优变体 Flan-PaLM。使用组合提示策略,Flan-PaLM 在 MultiMedQA 的每个多项选择题数据集(MedQA、MedMCQA、PubMedQA 和 Measuring Massive Multitask Language Understanding (MMLU) 临床主题)上都达到了最先进的准确性,包括 MedQA 上 67.6%的准确性(美国医学执照考试风格的问题),比之前的最先进水平高出 17%以上。然而,人类评估揭示了关键的差距。为了解决这个问题,我们引入了指令提示调整,这是一种使用少数示例将 LLM 对齐到新领域的参数有效的方法。由此产生的模型 Med-PaLM 表现令人鼓舞,但仍不如临床医生。我们表明,理解、知识回忆和推理随着模型规模和指令提示调整而提高,这表明 LLM 在医学中的潜在应用价值。我们的人类评估揭示了当今模型的局限性,这强调了在创建安全、有用的临床应用 LLM 时,评估框架和方法开发的重要性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5d3d/10396962/8d80c68f21fd/41586_2023_6291_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验