• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过检索增强生成推进眼科问答:对开源和专有大语言模型进行基准测试

Advancing Question-Answering in Ophthalmology With Retrieval-Augmented Generation: Benchmarking Open-Source and Proprietary Large Language Models.

作者信息

Nguyen Quang, Nguyen Duy-Anh, Dang Khang, Liu Siyin, Wang Sophia Y, Woof William A, Thomas Peter B M, Patel Praveen J, Balaskas Konstantinos, Thygesen Johan H, Wu Honghan, Pontikos Nikolas

机构信息

UCL Institute of Ophthalmology, London, UK.

UCL Institute of Health Informatics, London, UK.

出版信息

Transl Vis Sci Technol. 2025 Sep 2;14(9):18. doi: 10.1167/tvst.14.9.18.

DOI:10.1167/tvst.14.9.18
PMID:40938068
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12439504/
Abstract

PURPOSE

The purpose of this study was to evaluate the application of combining information retrieval with text generation using Retrieval-Augmented Generation (RAG) to benchmark the performance of open-source and proprietary generative large language models (LLMs) in question-answering in ophthalmology.

METHODS

Our dataset comprised 260 multiple-choice questions sourced from two question-answer banks designed to assess ophthalmic knowledge: the American Academy of Ophthalmology's (AAO) Basic and Clinical Science Course (BCSC) Self-Assessment program and OphthoQuestions. Our RAG pipeline retrieves documents in the BCSC companion textbook using ChromaDB, followed by reranking with Cohere to refine the context provided to the LLMs. Generative Pretrained Transformer (GPT)-4-turbo and 3 open-source models (Llama-3-70B, Gemma-2-27B, and Mixtral-8 × 7B) are benchmarked using zero-shot, zero-shot with Chain-of-Thought (zero-shot-CoT), and RAG. Model performance is evaluated using accuracy on the two datasets. Quantization is applied to improve the efficiency of the open-source models. Effects of quantization level are also measured.

RESULTS

Using RAG, GPT-4-turbo's accuracy increased by 11.54% on BCSC and by 10.96% on OphthoQuestions. Importantly, the RAG pipeline greatly enhances overall performance of Llama-3 by 23.85%, Gemma-2 by 17.11%, and Mixtral-8 × 7B by 22.11%. Zero-shot-CoT had overall no significant improvement on the models' performance. Quantization using 4 bit was shown to be as effective as using 8 bits while requiring half the resources.

CONCLUSIONS

Our work demonstrates that integrating RAG significantly enhances LLM accuracy especially for smaller LLMs.

TRANSLATION RELEVANCE

Using our RAG, smaller privacy-preserving open-source LLMs can be run in sensitive and resource-constrained environments, such as within hospitals, offering a viable alternative to cloud-based LLMs like GPT-4-turbo.

摘要

目的

本研究旨在评估结合信息检索与文本生成的检索增强生成(RAG)技术在眼科问答中对开源和专有生成式大语言模型(LLM)性能进行基准测试的应用。

方法

我们的数据集包含260道多项选择题,这些题目来自两个旨在评估眼科知识的问答库:美国眼科学会(AAO)的基础与临床科学课程(BCSC)自我评估项目和眼科问题库。我们的RAG管道使用ChromaDB在BCSC配套教科书中检索文档,然后使用Cohere重新排序以优化提供给LLM的上下文。使用零样本、带思维链的零样本(零样本-CoT)和RAG对生成式预训练变换器(GPT)-4-turbo和3个开源模型(Llama-3-70B、Gemma-2-27B和Mixtral-8×7B)进行基准测试。使用两个数据集上的准确率评估模型性能。应用量化来提高开源模型的效率。还测量了量化水平的影响。

结果

使用RAG,GPT-4-turbo在BCSC上的准确率提高了11.54%,在眼科问题库上提高了10.96%。重要的是,RAG管道极大地提高了Llama-3的整体性能23.85%,Gemma-2提高了17.11%,Mixtral-8×7B提高了22.11%。零样本-CoT总体上对模型性能没有显著改善。结果表明,使用4位量化与使用8位量化一样有效,同时所需资源减半。

结论

我们的工作表明,集成RAG可显著提高LLM的准确率,尤其是对于较小的LLM。

翻译相关性

使用我们的RAG,较小的隐私保护开源LLM可以在敏感和资源受限的环境中运行,例如在医院内部,为像GPT-4-turbo这样的基于云的LLM提供了可行的替代方案。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e383/12439504/f9290936a649/tvst-14-9-18-f004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e383/12439504/2a4924e34f6b/tvst-14-9-18-f001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e383/12439504/c97fe97c4750/tvst-14-9-18-f002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e383/12439504/24c18194f60e/tvst-14-9-18-f003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e383/12439504/f9290936a649/tvst-14-9-18-f004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e383/12439504/2a4924e34f6b/tvst-14-9-18-f001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e383/12439504/c97fe97c4750/tvst-14-9-18-f002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e383/12439504/24c18194f60e/tvst-14-9-18-f003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e383/12439504/f9290936a649/tvst-14-9-18-f004.jpg

相似文献

1
Advancing Question-Answering in Ophthalmology With Retrieval-Augmented Generation: Benchmarking Open-Source and Proprietary Large Language Models.通过检索增强生成推进眼科问答:对开源和专有大语言模型进行基准测试
Transl Vis Sci Technol. 2025 Sep 2;14(9):18. doi: 10.1167/tvst.14.9.18.
2
RadioRAG: Online Retrieval-augmented Generation for Radiology Question Answering.RadioRAG:用于放射学问答的在线检索增强生成
Radiol Artif Intell. 2025 Jun 18:e240476. doi: 10.1148/ryai.240476.
3
Assessing Retrieval-Augmented Large Language Model Performance in Emergency Department ICD-10-CM Coding Compared to Human Coders.与人工编码员相比,评估检索增强型大语言模型在急诊科ICD-10-CM编码中的性能。
medRxiv. 2024 Oct 17:2024.10.15.24315526. doi: 10.1101/2024.10.15.24315526.
4
Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report.使用具有特征总结和混合检索增强生成功能的大语言模型增强肺部疾病预测:基于放射学报告的多中心方法学研究
J Med Internet Res. 2025 Jun 11;27:e72638. doi: 10.2196/72638.
5
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.ChatGPT-4o与四个开源大语言模型基于中国罕见病目录生成诊断的性能:比较研究
J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.
6
Semantic Clinical Artificial Intelligence vs Native Large Language Model Performance on the USMLE.语义临床人工智能与原生大语言模型在美国医师执照考试中的表现对比
JAMA Netw Open. 2025 Apr 1;8(4):e256359. doi: 10.1001/jamanetworkopen.2025.6359.
7
Biomedical knowledge graph-optimized prompt generation for large language models.生物医学知识图谱优化的大语言模型提示生成。
Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae560.
8
A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试,采用了适配的大语言模型。
J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.
9
Improving automated deep phenotyping through large language models using retrieval-augmented generation.通过使用检索增强生成的大语言模型改进自动化深度表型分析。
Genome Med. 2025 Aug 18;17(1):91. doi: 10.1186/s13073-025-01521-w.
10
Using a Diverse Test Suite to Assess Large Language Models on Fast Health Care Interoperability Resources Knowledge: Comparative Analysis.使用多样化测试套件在快速医疗保健互操作性资源知识方面评估大语言模型:比较分析
J Med Internet Res. 2025 Aug 12;27:e73540. doi: 10.2196/73540.

本文引用的文献

1
Enhancing medical AI with retrieval-augmented generation: A mini narrative review.利用检索增强生成技术提升医学人工智能:一项小型叙述性综述。
Digit Health. 2025 Apr 21;11:20552076251337177. doi: 10.1177/20552076251337177. eCollection 2025 Jan-Dec.
2
A commentary on ophthalmic patients co-designing a new tool to better understand their hospital letters.一篇关于眼科患者共同设计一种新工具以更好理解其医院信件的评论。
Res Involv Engagem. 2025 Mar 18;11(1):25. doi: 10.1186/s40900-025-00697-0.
3
Applying generative AI with retrieval augmented generation to summarize and extract key clinical information from electronic health records.
运用生成式人工智能与检索增强生成相结合,从电子健康记录中总结和提取关键临床信息。
J Biomed Inform. 2024 Aug;156:104662. doi: 10.1016/j.jbi.2024.104662. Epub 2024 Jun 14.
4
Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework.基于检索增强生成框架的大语言模型对肝病临床指南解读的优化
NPJ Digit Med. 2024 Apr 23;7(1):102. doi: 10.1038/s41746-024-01091-y.
5
Development of a liver disease-specific large language model chat interface using retrieval-augmented generation.使用检索增强生成技术开发肝脏疾病特异性大语言模型聊天界面。
Hepatology. 2024 Nov 1;80(5):1158-1168. doi: 10.1097/HEP.0000000000000834. Epub 2024 Mar 7.
6
Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering.GPT-4 在眼科领域的能力:对模型熵的分析及迈向人类水平医学问答的进展。
Br J Ophthalmol. 2024 Sep 20;108(10):1371-1378. doi: 10.1136/bjo-2023-324438.
7
Large language models encode clinical knowledge.大语言模型编码临床知识。
Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.
8
Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings.评估ChatGPT在眼科领域的表现:对其优缺点的分析。
Ophthalmol Sci. 2023 May 5;3(4):100324. doi: 10.1016/j.xops.2023.100324. eCollection 2023 Dec.