Wada Akihiko, Tanaka Yuya, Nishizawa Mitsuo, Yamamoto Akira, Akashi Toshiaki, Hagiwara Akifumi, Hayakawa Yayoi, Kikuta Junko, Shimoji Keigo, Sano Katsuhiro, Kamagata Koji, Nakanishi Atsushi, Aoki Shigeki
Department of Radiology, Juntendo University Graduate School of Medicine, 2-1-1 Hongo, Bunkyo-ku, Tokyo, 113-8421, Japan.
Department of Radiology, The University of Tokyo School of Medicine, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan.
NPJ Digit Med. 2025 Jul 2;8(1):395. doi: 10.1038/s41746-025-01802-z.
Large language models (LLMs) demonstrate significant potential in healthcare applications, but clinical deployment is limited by privacy concerns and insufficient medical domain training. This study investigated whether retrieval-augmented generation (RAG) can improve locally deployable LLM for radiology contrast media consultation. In 100 synthetic iodinated contrast media consultations we compared Llama 3.2-11B (baseline and RAG) with three cloud-based models-GPT-4o mini, Gemini 2.0 Flash and Claude 3.5 Haiku. A blinded radiologist ranked the five replies per case, and three LLM-based judges scored accuracy, safety, structure, tone, applicability and latency. Under controlled conditions, RAG eliminated hallucinations (0% vs 8%; χ²₍Yates₎ = 6.38, p = 0.012) and improved mean rank by 1.3 (Z = -4.82, p < 0.001), though performance gaps with cloud models persist. The RAG-enhanced model remained faster (2.6 s vs 4.9-7.3 s) while the LLM-based judges preferred it over GPT-4o mini, though the radiologist ranked GPT-4o mini higher. RAG thus provides meaningful improvements for local clinical LLMs while maintaining the privacy benefits of on-premise deployment.
大语言模型(LLMs)在医疗保健应用中显示出巨大潜力,但临床部署受到隐私问题和医学领域训练不足的限制。本研究调查了检索增强生成(RAG)是否可以改进用于放射学造影剂咨询的本地可部署大语言模型。在100次合成碘化造影剂咨询中,我们将Llama 3.2 - 11B(基线和RAG)与三个基于云的模型——GPT - 4o mini、Gemini 2.0 Flash和Claude 3.5 Haiku进行了比较。一位盲态的放射科医生对每个病例的五条回复进行排名,三位基于大语言模型的评判员对准确性、安全性、结构、语气、适用性和延迟进行评分。在受控条件下,RAG消除了幻觉(0%对8%;Yates校正χ² = 6.38,p = 0.012),平均排名提高了1.3(Z = -4.82,p < 0.001),尽管与基于云的模型仍存在性能差距。RAG增强模型仍然更快(2.6秒对4.9 - 7.3秒),基于大语言模型的评判员比GPT - 4o mini更喜欢它,尽管放射科医生将GPT - 4o mini排名更高。因此,RAG为本地临床大语言模型提供了有意义的改进,同时保持了本地部署的隐私优势。