Kim Su Hwan, Wihl Jonas, Schramm Severin, Berberich Cornelius, Rosenkranz Enrike, Schmitzer Lena, Serguen Kerem, Klenk Christopher, Lenhart Nicolas, Zimmer Claus, Wiestler Benedikt, Hedderich Dennis M
Department of Diagnostic and Interventional Neuroradiology, Klinikum rechts der Isar, School of Medicine and Health, Technical University Munich, Munich, Germany.
Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, School of Medicine and Health, Technical University Munich, Munich, Germany.
Eur Radiol. 2025 Mar 7. doi: 10.1007/s00330-025-11484-6.
This study investigated the impact of human-large language model (LLM) collaboration on the accuracy and efficiency of brain MRI differential diagnosis.
In this retrospective study, forty brain MRI cases with a challenging but definitive diagnosis were randomized into two groups of twenty cases each. Six radiology residents with an average experience of 6.3 months in reading brain MRI exams evaluated one set of cases supported by conventional internet search (Conventional) and the other set utilizing an LLM-based search engine and hybrid chatbot. A cross-over design ensured that each case was examined with both workflows in equal frequency. For each case, readers were instructed to determine the three most likely differential diagnoses. LLM responses were analyzed by a panel of radiologists. Benefits and challenges in human-LLM interaction were derived from observations and participant feedback.
LLM-assisted brain MRI differential diagnosis yielded superior accuracy (70/114; 61.4% (LLM-assisted) vs 53/114; 46.5% (conventional) correct diagnoses, p = 0.033, chi-square test). No difference in interpretation time or level of confidence was observed. An analysis of LLM responses revealed that correct LLM suggestions translated into correct reader responses in 82.1% of cases (60/73). Inaccurate case descriptions by readers (9.2% of cases), LLM hallucinations (11.5% of cases), and insufficient contextualization of LLM responses were identified as challenges related to human-LLM interaction.
Human-LLM collaboration has the potential to improve brain MRI differential diagnosis. Yet, several challenges must be addressed to ensure effective adoption and user acceptance.
Question While large language models (LLM) have the potential to support radiological differential diagnosis, the role of human-LLM collaboration in this context remains underexplored. Findings LLM-assisted brain MRI differential diagnosis yielded superior accuracy over conventional internet search. Inaccurate case descriptions, LLM hallucinations, and insufficient contextualization were identified as potential challenges. Clinical relevance Our results highlight the potential of an LLM-assisted workflow to increase diagnostic accuracy but underline the necessity to study collaborative efforts between humans and LLMs over LLMs in isolation.
本研究调查了人类与大语言模型(LLM)协作对脑MRI鉴别诊断准确性和效率的影响。
在这项回顾性研究中,40例具有挑战性但诊断明确的脑MRI病例被随机分为两组,每组20例。6名平均有6.3个月阅读脑MRI检查经验的放射科住院医师评估了一组由传统互联网搜索支持的病例(传统组)和另一组使用基于LLM的搜索引擎和混合聊天机器人的病例。交叉设计确保每个病例以两种工作流程进行检查的频率相同。对于每个病例,要求读者确定三种最可能的鉴别诊断。LLM的回答由一组放射科医生进行分析。人类与LLM交互中的益处和挑战来自观察结果和参与者反馈。
LLM辅助的脑MRI鉴别诊断产生了更高的准确性(70/114;61.4%(LLM辅助)对53/114;46.5%(传统组)正确诊断,p = 0.033,卡方检验)。在解读时间或信心水平上未观察到差异。对LLM回答的分析表明,在82.1%的病例(60/73)中,LLM的正确建议转化为读者的正确回答。读者对病例描述不准确(9.2%的病例)、LLM产生幻觉(11.5%的病例)以及LLM回答的背景信息不足被确定为与人类-LLM交互相关的挑战。
人类与LLM协作有潜力改善脑MRI鉴别诊断。然而,必须解决几个挑战,以确保有效采用和用户接受。
问题 虽然大语言模型(LLM)有潜力支持放射学鉴别诊断,但人类与LLM协作在这种情况下的作用仍未得到充分探索。发现 LLM辅助的脑MRI鉴别诊断比传统互联网搜索具有更高的准确性。不准确的病例描述、LLM产生幻觉和背景信息不足被确定为潜在挑战。临床相关性 我们的结果突出了LLM辅助工作流程提高诊断准确性的潜力,但强调了研究人类与LLM之间协作努力而非单独研究LLM必要性。