Suppr超能文献

通过检索增强生成推进眼科问答:对开源和专有大语言模型进行基准测试

Advancing Question-Answering in Ophthalmology With Retrieval-Augmented Generation: Benchmarking Open-Source and Proprietary Large Language Models.

作者信息

Nguyen Quang, Nguyen Duy-Anh, Dang Khang, Liu Siyin, Wang Sophia Y, Woof William A, Thomas Peter B M, Patel Praveen J, Balaskas Konstantinos, Thygesen Johan H, Wu Honghan, Pontikos Nikolas

机构信息

UCL Institute of Ophthalmology, London, UK.

UCL Institute of Health Informatics, London, UK.

出版信息

Transl Vis Sci Technol. 2025 Sep 2;14(9):18. doi: 10.1167/tvst.14.9.18.

Abstract

PURPOSE

The purpose of this study was to evaluate the application of combining information retrieval with text generation using Retrieval-Augmented Generation (RAG) to benchmark the performance of open-source and proprietary generative large language models (LLMs) in question-answering in ophthalmology.

METHODS

Our dataset comprised 260 multiple-choice questions sourced from two question-answer banks designed to assess ophthalmic knowledge: the American Academy of Ophthalmology's (AAO) Basic and Clinical Science Course (BCSC) Self-Assessment program and OphthoQuestions. Our RAG pipeline retrieves documents in the BCSC companion textbook using ChromaDB, followed by reranking with Cohere to refine the context provided to the LLMs. Generative Pretrained Transformer (GPT)-4-turbo and 3 open-source models (Llama-3-70B, Gemma-2-27B, and Mixtral-8 × 7B) are benchmarked using zero-shot, zero-shot with Chain-of-Thought (zero-shot-CoT), and RAG. Model performance is evaluated using accuracy on the two datasets. Quantization is applied to improve the efficiency of the open-source models. Effects of quantization level are also measured.

RESULTS

Using RAG, GPT-4-turbo's accuracy increased by 11.54% on BCSC and by 10.96% on OphthoQuestions. Importantly, the RAG pipeline greatly enhances overall performance of Llama-3 by 23.85%, Gemma-2 by 17.11%, and Mixtral-8 × 7B by 22.11%. Zero-shot-CoT had overall no significant improvement on the models' performance. Quantization using 4 bit was shown to be as effective as using 8 bits while requiring half the resources.

CONCLUSIONS

Our work demonstrates that integrating RAG significantly enhances LLM accuracy especially for smaller LLMs.

TRANSLATION RELEVANCE

Using our RAG, smaller privacy-preserving open-source LLMs can be run in sensitive and resource-constrained environments, such as within hospitals, offering a viable alternative to cloud-based LLMs like GPT-4-turbo.

摘要

目的

本研究旨在评估结合信息检索与文本生成的检索增强生成(RAG)技术在眼科问答中对开源和专有生成式大语言模型(LLM)性能进行基准测试的应用。

方法

我们的数据集包含260道多项选择题,这些题目来自两个旨在评估眼科知识的问答库:美国眼科学会(AAO)的基础与临床科学课程(BCSC)自我评估项目和眼科问题库。我们的RAG管道使用ChromaDB在BCSC配套教科书中检索文档,然后使用Cohere重新排序以优化提供给LLM的上下文。使用零样本、带思维链的零样本(零样本-CoT)和RAG对生成式预训练变换器(GPT)-4-turbo和3个开源模型(Llama-3-70B、Gemma-2-27B和Mixtral-8×7B)进行基准测试。使用两个数据集上的准确率评估模型性能。应用量化来提高开源模型的效率。还测量了量化水平的影响。

结果

使用RAG,GPT-4-turbo在BCSC上的准确率提高了11.54%,在眼科问题库上提高了10.96%。重要的是,RAG管道极大地提高了Llama-3的整体性能23.85%,Gemma-2提高了17.11%,Mixtral-8×7B提高了22.11%。零样本-CoT总体上对模型性能没有显著改善。结果表明,使用4位量化与使用8位量化一样有效,同时所需资源减半。

结论

我们的工作表明,集成RAG可显著提高LLM的准确率,尤其是对于较小的LLM。

翻译相关性

使用我们的RAG,较小的隐私保护开源LLM可以在敏感和资源受限的环境中运行,例如在医院内部,为像GPT-4-turbo这样的基于云的LLM提供了可行的替代方案。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e383/12439504/2a4924e34f6b/tvst-14-9-18-f001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验