文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

10种大语言模型的检索增强生成及其在评估医学适用性方面的通用性。

Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness.

作者信息

Ke Yu He, Jin Liyuan, Elangovan Kabilan, Abdullah Hairil Rizal, Liu Nan, Sia Alex Tiong Heng, Soh Chai Rick, Tung Joshua Yi Min, Ong Jasmine Chiat Ling, Kuo Chang-Fu, Wu Shao-Chun, Kovacheva Vesela P, Ting Daniel Shu Wei

机构信息

Department of Anesthesiology, Singapore General Hospital, Singapore, Singapore.

Data Science and Artificial Intelligence Lab, Singapore General Hospital, Singapore, Singapore.

出版信息

NPJ Digit Med. 2025 Apr 5;8(1):187. doi: 10.1038/s41746-025-01519-z.


DOI:10.1038/s41746-025-01519-z
PMID:40185842
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11971376/
Abstract

Large Language Models (LLMs) hold promise for medical applications but often lack domain-specific expertise. Retrieval Augmented Generation (RAG) enables customization by integrating specialized knowledge. This study assessed the accuracy, consistency, and safety of LLM-RAG models in determining surgical fitness and delivering preoperative instructions using 35 local and 23 international guidelines. Ten LLMs (e.g., GPT3.5, GPT4, GPT4o, Gemini, Llama2, and Llama3, Claude) were tested across 14 clinical scenarios. A total of 3234 responses were generated and compared to 448 human-generated answers. The GPT4 LLM-RAG model with international guidelines generated answers within 20 s and achieved the highest accuracy, which was significantly better than human-generated responses (96.4% vs. 86.6%, p = 0.016). Additionally, the model exhibited an absence of hallucinations and produced more consistent output than humans. This study underscores the potential of GPT-4-based LLM-RAG models to deliver highly accurate, efficient, and consistent preoperative assessments.

摘要

大语言模型(LLMs)在医学应用方面具有潜力,但往往缺乏特定领域的专业知识。检索增强生成(RAG)通过整合专业知识实现定制化。本研究使用35项本地指南和23项国际指南,评估了LLM-RAG模型在确定手术适合性和提供术前指导方面的准确性、一致性和安全性。在14个临床场景中测试了10个大语言模型(如GPT3.5、GPT4、GPT4o、Gemini、Llama2和Llama3、Claude)。总共生成了3234个回答,并与448个人工生成的答案进行比较。使用国际指南的GPT4 LLM-RAG模型在20秒内生成答案,准确率最高,显著优于人工生成的回答(96.4%对86.6%,p = 0.016)。此外,该模型没有出现幻觉,输出比人类更一致。本研究强调了基于GPT-4的LLM-RAG模型在提供高度准确、高效和一致的术前评估方面的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/41f3/11971376/710d41364200/41746_2025_1519_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/41f3/11971376/195110d25f51/41746_2025_1519_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/41f3/11971376/b4adb02ba946/41746_2025_1519_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/41f3/11971376/d51ec5e84ac6/41746_2025_1519_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/41f3/11971376/710d41364200/41746_2025_1519_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/41f3/11971376/195110d25f51/41746_2025_1519_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/41f3/11971376/b4adb02ba946/41746_2025_1519_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/41f3/11971376/d51ec5e84ac6/41746_2025_1519_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/41f3/11971376/710d41364200/41746_2025_1519_Fig4_HTML.jpg

相似文献

[1]
Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness.

NPJ Digit Med. 2025-4-5

[2]
Custom Large Language Models Improve Accuracy: Comparing Retrieval Augmented Generation and Artificial Intelligence Agents to Noncustom Models for Evidence-Based Medicine.

Arthroscopy. 2025-3

[3]
Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model.

PLOS Digit Health. 2024-8-21

[4]
Application of NotebookLM, a large language model with retrieval-augmented generation, for lung cancer staging.

Jpn J Radiol. 2025-4

[5]
Evaluation of the integration of retrieval-augmented generation in large language model for breast cancer nursing care responses.

Sci Rep. 2024-12-28

[6]
Accuracy of Current Large Language Models and the Retrieval-Augmented Generation Model in Determining Dietary Principles in Chronic Kidney Disease.

J Ren Nutr. 2025-5

[7]
Assessing Retrieval-Augmented Large Language Model Performance in Emergency Department ICD-10-CM Coding Compared to Human Coders.

medRxiv. 2024-10-17

[8]
Use of Retrieval-Augmented Large Language Model for COVID-19 Fact-Checking: Development and Usability Study.

J Med Internet Res. 2025-4-30

[9]
Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.

J Med Internet Res. 2024-4-17

[10]
Semantic Clinical Artificial Intelligence vs Native Large Language Model Performance on the USMLE.

JAMA Netw Open. 2025-4-1

引用本文的文献

[1]
Development and evaluation of a lightweight large language model chatbot for medication enquiry.

PLOS Digit Health. 2025-9-4

[2]
The impact of prompting on ChatGPT's adherence to status epilepticus treatment guidelines.

Sci Rep. 2025-8-28

[3]
Enhancing Clinical Decision Support with Adaptive Iterative Self-Query Retrieval for Retrieval-Augmented Large Language Models.

Bioengineering (Basel). 2025-8-21

[4]
Graph retrieval augmented large language models for facial phenotype associated rare genetic disease.

NPJ Digit Med. 2025-8-24

[5]
A Pipeline for Automating Emergency Medicine Documentation Using LLMs with Retrieval-Augmented Text Generation.

Appl Artif Intell. 2025-6-18

[6]
Clinical and economic impact of a large language model in perioperative medicine: a randomized crossover trial.

NPJ Digit Med. 2025-7-21

[7]
Retrieval augmented generation for large language models in healthcare: A systematic review.

PLOS Digit Health. 2025-6-11

本文引用的文献

[1]
Almanac - Retrieval-Augmented Language Models for Clinical Medicine.

NEJM AI. 2024-2

[2]
Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial.

J Med Internet Res. 2023-10-4

[3]
Utilizing ChatGPT in clinical research related to anesthesiology: a comprehensive review of opportunities and limitations.

Anesth Pain Med (Seoul). 2023-7

[4]
Incidence and root causes of surgery cancellations at an academic medical center in Iran: a retrospective cohort study on 29,978 elective surgical cases.

Patient Saf Surg. 2023-9-6

[5]
Large language models in anaesthesiology: use of ChatGPT for American Society of Anesthesiologists physical status classification.

Br J Anaesth. 2023-9

[6]
Large language models in medicine.

Nat Med. 2023-8

[7]
A Domain-Specific Next-Generation Large Language Model (LLM) or ChatGPT is Required for Biomedical Engineering and Research.

Ann Biomed Eng. 2024-3

[8]
Preoperative assessment clinics and case cancellations: a prospective study from a large medical center in China.

Ann Transl Med. 2021-10

[9]
Evaluating factors associated with the cancellation and delay of elective surgical procedures: a systematic review.

Int J Qual Health Care. 2021-6-26

[10]
The role of artificial intelligence in achieving the Sustainable Development Goals.

Nat Commun. 2020-1-13

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索