• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

Enhancing Large Language Models for Improved Accuracy and Safety in Medical Question Answering: Comparative Study.

作者信息

Wang Dingqiao, Ye Jinguo, Li Jingni, Liang Jiangbo, Zhang Qikai, Hu Qiuling, Pan Caineng, Wang Dongliang, Liu Zhong, Shi Wen, Guo Mengxiang, Li Fei, Du Wei, Zheng Ying-Feng

机构信息

State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou, China.

Department of Ophthalmology, Eighth Affiliated Hospital of Sun Yat-sen University, Shenzhen, Guangdong, China.

出版信息

JMIR Med Educ. 2025 Dec 2;11:e70190. doi: 10.2196/70190.

DOI:10.2196/70190
PMID:41329953
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12709156/
Abstract

BACKGROUND

Large language models (LLMs) offer the potential to improve virtual patient-physician communication and reduce health care professionals' workload. However, limitations in accuracy, outdated knowledge, and safety issues restrict their effective use in real clinical settings. Addressing these challenges is crucial for making LLMs a reliable health care tool.

OBJECTIVE

This study aimed to evaluate the efficacy of Med-RISE, an information retrieval and augmentation tool, in comparison with baseline LLMs, focusing on enhancing accuracy and safety in medical question answering across diverse clinical domains.

METHODS

This comparative study introduces Med-RISE, an enhanced version of a retrieval-augmented generation framework specifically designed to improve question-answering performance across wide-ranging medical domains and diverse disciplines. Med-RISE consists of 4 key steps: query rewriting, information retrieval (providing local and real-time retrieval), summarization, and execution (a fact and safety filter before output). This study integrated Med-RISE with 4 LLMs (GPT-3.5, GPT-4, Vicuna-13B, and ChatGLM-6B) and assessed their performance on 4 multiple-choice medical question datasets: MedQA (US Medical Licensing Examination), PubMedQA (original and revised versions), MedMCQA, and EYE500. Primary outcome measures included answer accuracy and hallucination rates, with hallucinations categorized into factuality (inaccurate information) or faithfulness (inconsistency with instructions) types. This study was conducted between March 2024 and August 2024.

RESULTS

The integration of Med-RISE with each LLM led to a substantial increase in accuracy, with improvements ranging from 9.8% to 16.3% (mean 13%, SD 2.3%) across the 4 datasets. The enhanced accuracy rates were 16.3%, 12.9%, 13%, and 9.8% for GPT-3.5, GPT-4, Vicuna-13B, and ChatGLM-6B, respectively. In addition, Med-RISE effectively reduced hallucinations, with reductions ranging from 11.8% to 18% (mean 15.1%, SD 2.8%), factuality hallucinations decreasing by 13.5%, and faithfulness hallucinations decreasing by 5.8%. The hallucination rate reductions were 17.7%, 12.8%, 18%, and 11.8% for GPT-3.5, GPT-4, Vicuna-13B, and ChatGLM-6B, respectively.

CONCLUSIONS

The Med-RISE framework significantly improves the accuracy and reduces the hallucinations of LLMs in medical question answering across benchmark datasets. By providing local and real-time information retrieval and fact and safety filtering, Med-RISE enhances the reliability and interpretability of LLMs in the medical domain, offering a promising tool for clinical practice and decision support.

摘要

相似文献

1
Enhancing Large Language Models for Improved Accuracy and Safety in Medical Question Answering: Comparative Study.
JMIR Med Educ. 2025 Dec 2;11:e70190. doi: 10.2196/70190.
2
Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study.用于医学问答集成学习的大语言模型协同作用:设计与评估研究
J Med Internet Res. 2025 Jul 14;27:e70080. doi: 10.2196/70080.
3
Enhancement of the Performance of Large Language Models in Diabetes Education through Retrieval-Augmented Generation: Comparative Study.通过检索增强生成提高大语言模型在糖尿病教育中的性能:比较研究
J Med Internet Res. 2024 Nov 8;26:e58041. doi: 10.2196/58041.
4
Use of Retrieval-Augmented Large Language Model for COVID-19 Fact-Checking: Development and Usability Study.使用检索增强大语言模型进行COVID-19事实核查:开发与可用性研究。
J Med Internet Res. 2025 Apr 30;27:e66098. doi: 10.2196/66098.
5
RadioRAG: Online Retrieval-augmented Generation for Radiology Question Answering.RadioRAG:用于放射学问答的在线检索增强生成
Radiol Artif Intell. 2025 Jun 18:e240476. doi: 10.1148/ryai.240476.
6
Improving Dietary Supplement Information Retrieval: Development of a Retrieval-Augmented Generation System With Large Language Models.改善膳食补充剂信息检索:利用大语言模型开发检索增强生成系统
J Med Internet Res. 2025 Mar 19;27:e67677. doi: 10.2196/67677.
7
MKRAG: Medical Knowledge Retrieval Augmented Generation for Medical Question Answering.MKRAG:用于医学问答的医学知识检索增强生成
AMIA Annu Symp Proc. 2025 May 22;2024:1011-1020. eCollection 2024.
8
Performance Assessment of ChatGPT-4.0 and ChatGLM Series in Traditional Chinese Medicine for Metabolic Associated Fatty Liver Disease: Comparative Study.ChatGPT-4.0与ChatGLM系列在中医代谢相关脂肪性肝病中的性能评估:比较研究
JMIR Form Res. 2025 Aug 25;9:e66503. doi: 10.2196/66503.
9
Toward expert-level medical question answering with large language models.迈向使用大语言模型实现专家级医学问答
Nat Med. 2025 Mar;31(3):943-950. doi: 10.1038/s41591-024-03423-7. Epub 2025 Jan 8.
10
Large language models in healthcare: a systematic evaluation on medical Q/A datasets.
Health Inf Sci Syst. 2025 Nov 21;14(1):2. doi: 10.1007/s13755-025-00397-9. eCollection 2026 Dec.

本文引用的文献

1
Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review.大语言模型在医疗保健应用中的测试与评估:一项系统综述。
JAMA. 2025 Jan 28;333(4):319-328. doi: 10.1001/jama.2024.21700.
2
"Doctor ChatGPT, Can You Help Me?" The Patient's Perspective: Cross-Sectional Study.“医生 ChatGPT,你能帮我吗?”患者视角:横断面研究。
J Med Internet Res. 2024 Oct 1;26:e58831. doi: 10.2196/58831.
3
Comparative Assessment of Otolaryngology Knowledge Among Large Language Models.大型语言模型中耳鼻喉科知识的比较评估
Laryngoscope. 2025 Feb;135(2):629-634. doi: 10.1002/lary.31781. Epub 2024 Sep 21.
4
Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model.使用检索增强语言模型提高GPT-3/4在生物医学数据上的结果准确性。
PLOS Digit Health. 2024 Aug 21;3(8):e0000568. doi: 10.1371/journal.pdig.0000568. eCollection 2024 Aug.
5
Enhancement of the Performance of Large Language Models in Diabetes Education through Retrieval-Augmented Generation: Comparative Study.通过检索增强生成提高大语言模型在糖尿病教育中的性能:比较研究
J Med Internet Res. 2024 Nov 8;26:e58041. doi: 10.2196/58041.
6
Development and Evaluation of a Retrieval-Augmented Large Language Model Framework for Ophthalmology.开发和评估眼科检索增强型大型语言模型框架。
JAMA Ophthalmol. 2024 Sep 1;142(9):798-805. doi: 10.1001/jamaophthalmol.2024.2513.
7
Large Language Model-Based Responses to Patients' In-Basket Messages.基于大语言模型的患者收件箱消息回复。
JAMA Netw Open. 2024 Jul 1;7(7):e2422399. doi: 10.1001/jamanetworkopen.2024.22399.
8
Performance of Large Language Models on Medical Oncology Examination Questions.大语言模型在医学肿瘤学考试问题上的表现。
JAMA Netw Open. 2024 Jun 3;7(6):e2417641. doi: 10.1001/jamanetworkopen.2024.17641.
9
Integrating Retrieval-Augmented Generation with Large Language Models in Nephrology: Advancing Practical Applications.将检索增强生成与大型语言模型在肾脏病学中的整合:推进实际应用。
Medicina (Kaunas). 2024 Mar 8;60(3):445. doi: 10.3390/medicina60030445.
10
Can large language models reason about medical questions?大型语言模型能对医学问题进行推理吗?
Patterns (N Y). 2024 Mar 1;5(3):100943. doi: 10.1016/j.patter.2024.100943. eCollection 2024 Mar 8.