用于临床医学的年鉴检索增强语言模型。

Almanac - Retrieval-Augmented Language Models for Clinical Medicine.

作者信息

Zakka Cyril, Shad Rohan, Chaurasia Akash, Dalal Alex R, Kim Jennifer L, Moor Michael, Fong Robyn, Phillips Curran, Alexander Kevin, Ashley Euan, Boyd Jack, Boyd Kathleen, Hirsch Karen, Langlotz Curt, Lee Rita, Melia Joanna, Nelson Joanna, Sallam Karim, Tullis Stacey, Vogelsong Melissa Ann, Cunningham John Patrick, Hiesinger William

机构信息

Department of Cardiothoracic Surgery, Stanford Medicine, Stanford, CA.

Division of Cardiovascular Surgery, Penn Medicine, Philadelphia.

出版信息

NEJM AI. 2024 Feb;1(2). doi: 10.1056/aioa2300068. Epub 2024 Jan 25.

DOI:10.1056/aioa2300068

PMID:38343631

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10857783/

Abstract

BACKGROUND

Large language models (LLMs) have recently shown impressive zero-shot capabilities, whereby they can use auxiliary data, without the availability of task-specific training examples, to complete a variety of natural language tasks, such as summarization, dialogue generation, and question answering. However, despite many promising applications of LLMs in clinical medicine, adoption of these models has been limited by their tendency to generate incorrect and sometimes even harmful statements.

METHODS

We tasked a panel of eight board-certified clinicians and two health care practitioners with evaluating Almanac, an LLM framework augmented with retrieval capabilities from curated medical resources for medical guideline and treatment recommendations. The panel compared responses from Almanac and standard LLMs (ChatGPT-4, Bing, and Bard) versus a novel data set of 314 clinical questions spanning nine medical specialties.

RESULTS

Almanac showed a significant improvement in performance compared with the standard LLMs across axes of factuality, completeness, user preference, and adversarial safety.

CONCLUSIONS

Our results show the potential for LLMs with access to domain-specific corpora to be effective in clinical decision-making. The findings also underscore the importance of carefully testing LLMs before deployment to mitigate their shortcomings. (Funded by the National Institutes of Health, National Heart, Lung, and Blood Institute.).

摘要

背景

大型语言模型（LLMs）最近展现出了令人印象深刻的零样本能力，即它们能够在没有特定任务训练示例的情况下，利用辅助数据来完成各种自然语言任务，如文本摘要、对话生成和问答。然而，尽管大型语言模型在临床医学中有许多有前景的应用，但这些模型的采用受到其生成不正确甚至有时有害陈述倾向的限制。

方法

我们让一个由八名获得董事会认证的临床医生和两名医疗从业者组成的小组评估Almanac，这是一个通过从精心策划的医学资源中检索信息来增强医学指南和治疗建议检索能力的大型语言模型框架。该小组将Almanac和标准大型语言模型（ChatGPT-4、必应和巴德）的回答与一个包含九个医学专业的314个临床问题的新数据集进行了比较。

结果

在事实性、完整性、用户偏好和对抗性安全性等方面，Almanac与标准大型语言模型相比表现出显著的性能提升。

结论

我们的结果表明，能够访问特定领域语料库的大型语言模型在临床决策中具有有效性。研究结果还强调了在部署大型语言模型之前仔细测试以减轻其缺点的重要性。（由美国国立卫生研究院国家心肺血液研究所资助。）

相似文献

Almanac - Retrieval-Augmented Language Models for Clinical Medicine.

NEJM AI. 2024 Feb;1(2). doi: 10.1056/aioa2300068. Epub 2024 Jan 25.

Almanac: Retrieval-Augmented Language Models for Clinical Medicine.

Res Sq. 2023 May 2:rs.3.rs-2883198. doi: 10.21203/rs.3.rs-2883198/v1.

Large Language Models in Hematology Case Solving: A Comparative Study of ChatGPT-3.5, Google Bard, and Microsoft Bing.

Cureus. 2023 Aug 21;15(8):e43861. doi: 10.7759/cureus.43861. eCollection 2023 Aug.

Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology.

Cureus. 2023 Aug 4;15(8):e42972. doi: 10.7759/cureus.42972. eCollection 2023 Aug.

Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.

J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.

Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.

Eur J Orthod. 2024 Apr 13. doi: 10.1093/ejo/cjae017.

Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis.

Asia Pac J Ophthalmol (Phila). 2024 Sep-Oct;13(5):100106. doi: 10.1016/j.apjo.2024.100106. Epub 2024 Oct 5.

Evaluation of Large language model performance on the Multi-Specialty Recruitment Assessment (MSRA) exam.

Comput Biol Med. 2024 Jan;168:107794. doi: 10.1016/j.compbiomed.2023.107794. Epub 2023 Nov 30.

Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study.

ArXiv. 2024 Jan 23:arXiv:2402.01693v1.

Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.

EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.

引用本文的文献

From large language models to multimodal AI: a scoping review on the potential of generative AI in medicine.

Biomed Eng Lett. 2025 Aug 22;15(5):845-863. doi: 10.1007/s13534-025-00497-1. eCollection 2025 Sep.

Development and evaluation of a lightweight large language model chatbot for medication enquiry.

PLOS Digit Health. 2025 Sep 4;4(9):e0000961. doi: 10.1371/journal.pdig.0000961. eCollection 2025 Sep.

Evaluating Retrieval Augmented Generation-enhanced Large Language Models for Question Answering On German Neurovascular Guidelines.

Clin Neuroradiol. 2025 Sep 2. doi: 10.1007/s00062-025-01562-z.

Enhancing Clinical Decision Support with Adaptive Iterative Self-Query Retrieval for Retrieval-Augmented Large Language Models.

Bioengineering (Basel). 2025 Aug 21;12(8):895. doi: 10.3390/bioengineering12080895.

Implementing a context-augmented large language model to guide precision cancer medicine.

medRxiv. 2025 Jul 24:2025.05.09.25327312. doi: 10.1101/2025.05.09.25327312.

The assessment of ChatGPT-4's performance compared to expert's consensus on chronic lateral ankle instability.

J Exp Orthop. 2025 Aug 5;12(3):e70393. doi: 10.1002/jeo2.70393. eCollection 2025 Jul.

The TRIPOD-LLM reporting guideline for studies using large language models: a Korean translation.

Ewha Med J. 2025 Jul;48(3):e49. doi: 10.12771/emj.2025.00661. Epub 2025 Jul 31.

Adaptive RAG-Assisted MRI Platform (ARAMP) for Brain Metastasis Detection and Reporting: A Retrospective Evaluation Using Post-Contrast T1-Weighted Imaging.

Bioengineering (Basel). 2025 Jun 26;12(7):698. doi: 10.3390/bioengineering12070698.

A scoping review of natural language processing in addressing medically inaccurate information: Errors, misinformation, and hallucination.

J Biomed Inform. 2025 Jul 22:104866. doi: 10.1016/j.jbi.2025.104866.

Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study.

Front Digit Health. 2025 Jun 27;7:1574287. doi: 10.3389/fdgth.2025.1574287. eCollection 2025.

本文引用的文献

Can large language models reason about medical questions?

Patterns (N Y). 2024 Mar 1;5(3):100943. doi: 10.1016/j.patter.2024.100943. eCollection 2024 Mar 8.

Perceptions of the Emergency Medicine Resident Selection Process by Program Directors Following the Transition to a Pass/Fail USMLE Step 1.

Open Access Emerg Med. 2023 Jan 12;15:15-20. doi: 10.2147/OAEM.S389868. eCollection 2023.

BioGPT: generative pre-trained transformer for biomedical text generation and mining.

Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac409.

Association Between USMLE Step 1 Scores and In-Training Examination Performance: A Meta-Analysis.

Acad Med. 2021 Dec 1;96(12):1742-1754. doi: 10.1097/ACM.0000000000004227.

Are USMLE Scores Valid Measures for Chief Resident Selection?

J Grad Med Educ. 2020 Aug;12(4):441-446. doi: 10.4300/JGME-D-19-00782.1.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

BioReader: a text mining tool for performing classification of biomedical literature.

BMC Bioinformatics. 2019 Feb 4;19(Suppl 13):57. doi: 10.1186/s12859-019-2607-x.

Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.

IEEE Trans Pattern Anal Mach Intell. 2020 Apr;42(4):824-836. doi: 10.1109/TPAMI.2018.2889473. Epub 2018 Dec 28.

Are United States Medical Licensing Exam Step 1 and 2 scores valid measures for postgraduate medical residency selection decisions?

Acad Med. 2011 Jan;86(1):48-52. doi: 10.1097/ACM.0b013e3181ffacdb.

Correlation between housestaff performance on the United States Medical Licensing Examination and standardized patient encounters.

Mt Sinai J Med. 2005 Jan;72(1):47-9.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于临床医学的年鉴检索增强语言模型。

Almanac - Retrieval-Augmented Language Models for Clinical Medicine.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献