具有检索增强生成功能的大型语言模型NotebookLM在肺癌分期中的应用。
Application of NotebookLM, a large language model with retrieval-augmented generation, for lung cancer staging.
作者信息
Tozuka Ryota, Johno Hisashi, Amakawa Akitomo, Sato Junichi, Muto Mizuki, Seki Shoichiro, Komaba Atsushi, Onishi Hiroshi
机构信息
Department of Radiology, University of Yamanashi, 1110 Shimokato, Chuo, Yamanashi, 409-3898, Japan.
Department of Radiation Oncology, Tohoku University Graduate School of Medicine, 1-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8574, Japan.
出版信息
Jpn J Radiol. 2025 Apr;43(4):706-712. doi: 10.1007/s11604-024-01705-1. Epub 2024 Nov 25.
PURPOSE
In radiology, large language models (LLMs), including ChatGPT, have recently gained attention, and their utility is being rapidly evaluated. However, concerns have emerged regarding their reliability in clinical applications due to limitations such as hallucinations and insufficient referencing. To address these issues, we focus on the latest technology, retrieval-augmented generation (RAG), which enables LLMs to reference reliable external knowledge (REK). Specifically, this study examines the utility and reliability of a recently released RAG-equipped LLM (RAG-LLM), NotebookLM, for staging lung cancer.
MATERIALS AND METHODS
We summarized the current lung cancer staging guideline in Japan and provided this as REK to NotebookLM. We then tasked NotebookLM with staging 100 fictional lung cancer cases based on CT findings and evaluated its accuracy. For comparison, we performed the same task using a gold-standard LLM, GPT-4 Omni (GPT-4o), both with and without the REK. For GPT-4o, the REK was provided directly within the prompt rather than through RAG.
RESULTS
NotebookLM achieved 86% diagnostic accuracy in the lung cancer staging experiment, outperforming GPT-4o, which recorded 39% accuracy with the REK and 25% without it. Moreover, NotebookLM demonstrated 95% accuracy in searching reference locations within the REK.
CONCLUSION
NotebookLM, a RAG-LLM, successfully performed lung cancer staging by utilizing the REK, demonstrating superior performance compared to GPT-4o (without RAG). Additionally, it provided highly accurate reference locations within the REK, allowing radiologists to efficiently evaluate the reliability of NotebookLM's responses and detect possible hallucinations. Overall, this study highlights the potential of NotebookLM, a RAG-LLM, in image diagnosis.
目的
在放射学领域,包括ChatGPT在内的大语言模型(LLMs)最近受到了关注,其效用正在迅速得到评估。然而,由于诸如幻觉和参考文献不足等局限性,人们对其在临床应用中的可靠性产生了担忧。为了解决这些问题,我们聚焦于最新技术——检索增强生成(RAG),它能使大语言模型参考可靠的外部知识(REK)。具体而言,本研究考察了最近发布的配备RAG的大语言模型(RAG-LLM)NotebookLM在肺癌分期方面的效用和可靠性。
材料与方法
我们总结了日本当前的肺癌分期指南,并将其作为可靠外部知识提供给NotebookLM。然后,我们要求NotebookLM根据CT检查结果对100例虚构的肺癌病例进行分期,并评估其准确性。为作比较,我们使用金标准大语言模型GPT-4 Omni(GPT-4o)执行相同任务,分别提供和不提供可靠外部知识。对于GPT-4o,可靠外部知识是直接在提示中提供的,而非通过RAG。
结果
在肺癌分期实验中,NotebookLM的诊断准确率达到86%,优于GPT-4o,GPT-4o在提供可靠外部知识时准确率为39%,不提供时为25%。此外,NotebookLM在可靠外部知识中搜索参考位置的准确率为95%。
结论
RAG-LLM NotebookLM通过利用可靠外部知识成功进行了肺癌分期,与GPT-4o(无RAG)相比表现出卓越性能。此外,它在可靠外部知识中提供了高度准确的参考位置,使放射科医生能够有效地评估NotebookLM回答的可靠性并检测可能的幻觉。总体而言,本研究凸显了RAG-LLM NotebookLM在图像诊断中的潜力。