Nguyen Quang, Nguyen Duy-Anh, Dang Khang, Liu Siyin, Wang Sophia Y, Woof William A, Thomas Peter B M, Patel Praveen J, Balaskas Konstantinos, Thygesen Johan H, Wu Honghan, Pontikos Nikolas
UCL Institute of Ophthalmology, London, UK.
UCL Institute of Health Informatics, London, UK.
Transl Vis Sci Technol. 2025 Sep 2;14(9):18. doi: 10.1167/tvst.14.9.18.
The purpose of this study was to evaluate the application of combining information retrieval with text generation using Retrieval-Augmented Generation (RAG) to benchmark the performance of open-source and proprietary generative large language models (LLMs) in question-answering in ophthalmology.
Our dataset comprised 260 multiple-choice questions sourced from two question-answer banks designed to assess ophthalmic knowledge: the American Academy of Ophthalmology's (AAO) Basic and Clinical Science Course (BCSC) Self-Assessment program and OphthoQuestions. Our RAG pipeline retrieves documents in the BCSC companion textbook using ChromaDB, followed by reranking with Cohere to refine the context provided to the LLMs. Generative Pretrained Transformer (GPT)-4-turbo and 3 open-source models (Llama-3-70B, Gemma-2-27B, and Mixtral-8 × 7B) are benchmarked using zero-shot, zero-shot with Chain-of-Thought (zero-shot-CoT), and RAG. Model performance is evaluated using accuracy on the two datasets. Quantization is applied to improve the efficiency of the open-source models. Effects of quantization level are also measured.
Using RAG, GPT-4-turbo's accuracy increased by 11.54% on BCSC and by 10.96% on OphthoQuestions. Importantly, the RAG pipeline greatly enhances overall performance of Llama-3 by 23.85%, Gemma-2 by 17.11%, and Mixtral-8 × 7B by 22.11%. Zero-shot-CoT had overall no significant improvement on the models' performance. Quantization using 4 bit was shown to be as effective as using 8 bits while requiring half the resources.
Our work demonstrates that integrating RAG significantly enhances LLM accuracy especially for smaller LLMs.
Using our RAG, smaller privacy-preserving open-source LLMs can be run in sensitive and resource-constrained environments, such as within hospitals, offering a viable alternative to cloud-based LLMs like GPT-4-turbo.
本研究旨在评估结合信息检索与文本生成的检索增强生成(RAG)技术在眼科问答中对开源和专有生成式大语言模型(LLM)性能进行基准测试的应用。
我们的数据集包含260道多项选择题,这些题目来自两个旨在评估眼科知识的问答库:美国眼科学会(AAO)的基础与临床科学课程(BCSC)自我评估项目和眼科问题库。我们的RAG管道使用ChromaDB在BCSC配套教科书中检索文档,然后使用Cohere重新排序以优化提供给LLM的上下文。使用零样本、带思维链的零样本(零样本-CoT)和RAG对生成式预训练变换器(GPT)-4-turbo和3个开源模型(Llama-3-70B、Gemma-2-27B和Mixtral-8×7B)进行基准测试。使用两个数据集上的准确率评估模型性能。应用量化来提高开源模型的效率。还测量了量化水平的影响。
使用RAG,GPT-4-turbo在BCSC上的准确率提高了11.54%,在眼科问题库上提高了10.96%。重要的是,RAG管道极大地提高了Llama-3的整体性能23.85%,Gemma-2提高了17.11%,Mixtral-8×7B提高了22.11%。零样本-CoT总体上对模型性能没有显著改善。结果表明,使用4位量化与使用8位量化一样有效,同时所需资源减半。
我们的工作表明,集成RAG可显著提高LLM的准确率,尤其是对于较小的LLM。
使用我们的RAG,较小的隐私保护开源LLM可以在敏感和资源受限的环境中运行,例如在医院内部,为像GPT-4-turbo这样的基于云的LLM提供了可行的替代方案。