RadioRAG：用于放射学问答的在线检索增强生成

RadioRAG: Online Retrieval-augmented Generation for Radiology Question Answering.

作者信息

Tayebi Arasteh Soroosh, Lotfinia Mahshad, Bressem Keno, Siepmann Robert, Adams Lisa, Ferber Dyke, Kuhl Christiane, Kather Jakob Nikolas, Nebelung Sven, Truhn Daniel

机构信息

Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Pauwelsstr 30, 52074 Aachen, Germany.

Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany.

出版信息

Radiol Artif Intell. 2025 Jun 18:e240476. doi: 10.1148/ryai.240476.

DOI:10.1148/ryai.240476

PMID:40530957

Abstract

Purpose To evaluate diagnostic accuracy of various large language models (LLMs) when answering radiology-specific questions with and without access to additional online, up-to-date information via retrieval-augmented generation (RAG). Materials and Methods The authors developed Radiology RAG (RadioRAG), an end-to-end framework that retrieves data from authoritative radiologic online sources in real-time. RAG incorporates information retrieval from external sources to supplement the initial prompt, grounding the model's response in relevant information. Using 80 questions from the RSNA Case Collection across radiologic subspecialties and 24 additional expert-curated questions with reference standard answers, LLMs (GPT-3.5-turbo, GPT-4, Mistral-7B, Mixtral-8 × 7B, and Llama3 [8B and 70B]) were prompted with and without RadioRAG in a zero-shot inference scenario (temperature ≤ 0.1, top- = 1). RadioRAG retrieved context-specific information from www.radiopaedia.org. Accuracy of LLMs with and without RadioRAG in answering questions from each dataset was assessed. Statistical analyses were performed using bootstrapping while preserving pairing. Additional assessments included comparison of model with human performance and comparison of time required for conventional versus RadioRAG-powered question answering. Results RadioRAG improved accuracy for some LLMs, including GPT-3.5-turbo [74% (59/80) versus 66% (53/80), FDR = 0.03] and Mixtral-8 × 7B [76% (61/80) versus 65% (52/80), FDR = 0.02] on the RSNA-RadioQA dataset, with similar trends in the ExtendedQA dataset. Accuracy exceeded (FDR ≤ 0.007) that of a human expert (63%, (50/80)) for these LLMs, while not for Mistral-7B-instruct-v0.2, Llama3-8B, and Llama3-70B (FDR ≥ 0.21). RadioRAG reduced hallucinations for all LLMs (rates from 6-25%). RadioRAG increased estimated response time fourfold. Conclusion RadioRAG shows potential to improve LLM accuracy and factuality in radiology question answering by integrating real-time domain-specific data. ©RSNA, 2025.

摘要

目的评估各种大语言模型（LLM）在回答放射学特定问题时的诊断准确性，这些问题可通过检索增强生成（RAG）访问或不访问额外的在线最新信息。材料与方法作者开发了放射学RAG（RadioRAG），这是一个端到端框架，可实时从权威放射学在线来源检索数据。RAG整合来自外部来源的信息检索，以补充初始提示，使模型的回答基于相关信息。使用来自RSNA病例集的80个涵盖放射学各亚专业的问题以及另外24个由专家精心策划的带有参考标准答案的问题，在零样本推理场景（温度≤0.1，top- = 1）中，对LLM（GPT-3.5-turbo、GPT-4、Mistral-7B、Mixtral-8×7B和Llama3 [8B和70B]）分别在有和没有RadioRAG的情况下进行提问。RadioRAG从www.radiopaedia.org检索特定上下文信息。评估有和没有RadioRAG的LLM在回答每个数据集问题时的准确性。使用自举法进行统计分析，同时保持配对。额外的评估包括模型与人类表现的比较以及传统问答与由RadioRAG驱动的问答所需时间的比较。结果在RSNA-RadioQA数据集上，RadioRAG提高了一些LLM的准确性，包括GPT-3.5-turbo [74%（59/80）对66%（53/80），FDR = 0.03]和Mixtral-8×7B [76%（61/80）对65%（52/80），FDR = 0.02]，在ExtendedQA数据集上也有类似趋势。这些LLM的准确性超过了（FDR≤0.007）人类专家的准确性（63%，（50/80）），而Mistral-7B-instruct-v0.2、Llama3-8B和Llama3-70B则没有（FDR≥0.21）。RadioRAG减少了所有LLM的幻觉（发生率从6%-25%）。RadioRAG使估计响应时间增加了四倍。结论 RadioRAG通过整合实时特定领域数据，在放射学问答中显示出提高LLM准确性和事实性的潜力。©RSNA，2025。