Suppr超能文献

评估检索增强生成强化的大语言模型用于基于德国神经血管指南的问答

Evaluating Retrieval Augmented Generation-enhanced Large Language Models for Question Answering On German Neurovascular Guidelines.

作者信息

Vach Marius, Gliem Michael, Weiss Daniel, Ivan Vivien Lorena, Hauke Frederik, Boschenriedter Christian, Rubbert Christian, Caspers Julian

机构信息

Department of Diagnostic and Interventional Radiology, Medical Faculty and University Hospital Düsseldorf, Heinrich-Heine-University Düsseldorf, Moorenstraße 5, 40225, Düsseldorf, Germany.

Department of Neurology, Medical Faculty and University Hospital Düsseldorf, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany.

出版信息

Clin Neuroradiol. 2025 Sep 2. doi: 10.1007/s00062-025-01562-z.

Abstract

PURPOSE

To investigate the feasibility of Retrieval-augmented Generation (RAG)-enhanced Large Language Models (LLMs) in answering questions about two German neurovascular guidelines.

METHODS

Four LLMs (GPT-4o-mini, Llama 3.1 405B Instruct Turbo, Mixtral 8 × 22B Instruct, and Claude 3.5 Sonnet) with RAG as well as GPT-4o-mini without RAG were evaluated for generating answers about two German neurovascular guidelines ("S3 Guideline for Diagnosis, Treatment, and Follow-up of Extracranial Carotid Stenosis" and "S2e Guideline for Acute Therapy of Ischemic Stroke"). The answers were classified as "correct", "inaccurate", or "incorrect" by two neurovascular experts in consensus. Additionally, retrieval performance of five retrieval strategies was analyzed on a synthetic dataset of 384 questions.

RESULTS

Claude Sonnet 3.5 achieved the highest answer correctness (70.6% correct, 10.6% wrong), followed by Llama 3.1 (64.7%, 15.3% wrong), GPT-4o-mini with RAG (57.6%, 15.3% wrong), and Mixtral (56.6%, 17.6% wrong). GPT-4o-mini without RAG performed significantly worse (20.0%, 32.9% wrong). Retrieval errors were the primary cause of incorrect answers (80%). For retrieval, BM25 achieved the highest accuracy (82.0%), outperforming vector-based methods like "BAAI/bge-m3" (78.4%).

CONCLUSION

RAG significantly improves LLM accuracy for medical guideline question answering compared to the inherent knowledge of pretrained LLMs alone while still showing significant error rates. Improved accuracy and confidence metrics are needed for safer implementation in clinical routine. Additionally, our results demonstrate the strong performance of general LLMs in medical question answering for non-English languages, such as German, even without specific training.

摘要

目的

研究检索增强生成(RAG)增强的大语言模型(LLM)回答有关两份德国神经血管指南问题的可行性。

方法

评估了四个采用RAG的LLM(GPT-4o-mini、Llama 3.1 405B Instruct Turbo、Mixtral 8×22B Instruct和Claude 3.5 Sonnet)以及未采用RAG的GPT-4o-mini,以生成有关两份德国神经血管指南(《颅外颈动脉狭窄的诊断、治疗和随访S3指南》和《缺血性中风急性治疗S2e指南》)的答案。两名神经血管专家达成共识,将答案分类为“正确”、“不准确”或“错误”。此外,在一个包含384个问题的合成数据集上分析了五种检索策略的检索性能。

结果

Claude Sonnet 3.5的答案正确率最高(70.6%正确,10.6%错误),其次是Llama 3.1(64.7%,15.3%错误)、采用RAG的GPT-4o-mini(57.6%,15.3%错误)和Mixtral(56.6%,17.6%错误)。未采用RAG的GPT-4o-mini表现明显更差(20.0%,32.9%错误)。检索错误是答案错误的主要原因(80%)。在检索方面,BM25的准确率最高(82.0%),优于“BAAI/bge-m3”等基于向量的方法(78.4%)。

结论

与仅依靠预训练LLM的固有知识相比,RAG显著提高了LLM在医学指南问答方面的准确性,但仍存在显著的错误率。为了在临床常规中更安全地应用,需要提高准确性和置信度指标。此外,我们的结果表明,即使没有经过专门训练,通用LLM在德语等非英语语言的医学问答中也表现出色。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验