Schramm Severin, Preis Silas, Metz Marie-Christin, Jung Kirsten, Schmitz-Koep Benita, Zimmer Claus, Wiestler Benedikt, Hedderich Dennis M, Kim Su Hwan
From the Institute of Diagnostic and Interventional Neuroradiology, Klinikum rechts der Isar, School of Medicine, Technical University of Munich, Ismaninger Strasse 22, Munich 81675, Germany.
Radiology. 2025 Jan;314(1):e240689. doi: 10.1148/radiol.240689.
Background Studies have explored the application of multimodal large language models (LLMs) in radiologic differential diagnosis. Yet, how different multimodal input combinations affect diagnostic performance is not well understood. Purpose To evaluate the impact of varying multimodal input elements on the accuracy of OpenAI's GPT-4 with vision (GPT-4V)-based brain MRI differential diagnosis. Materials and Methods Sixty brain MRI cases with a challenging yet verified diagnosis were selected. Seven prompt groups with variations of four input elements (image without modifiers [I], annotation [A], medical history [H], and image description [D]) were defined. For each MRI case and prompt group, three identical queries were performed using an LLM-based search engine (Perplexity AI, powered by GPT-4V). The accuracy of LLM-generated differential diagnoses was rated using a binary and a numeric scoring system and analyzed using a χ test and a Kruskal-Wallis test. Results were corrected for false-discovery rate with use of the Benjamini-Hochberg procedure. Regression analyses were performed to determine the contribution of each input element to diagnostic performance. Results The prompt group containing I, A, H, and D as input exhibited the highest diagnostic accuracy (124 of 180 responses [69%]). Significant differences were observed between prompt groups that contained D among their inputs and those that did not. Unannotated (I) (four of 180 responses [2.2%]) or annotated radiologic images alone (I and A) (two of 180 responses [1.1%]) yielded very low diagnostic accuracy. Regression analyses confirmed a large positive effect of D on diagnostic accuracy (odds ratio [OR], 68.03; < .001), as well as a moderate positive effect of H (OR, 4.18; < .001). Conclusion The textual description of radiologic image findings was identified as the strongest contributor to the performance of GPT-4V in brain MRI differential diagnosis, followed by the medical history; unannotated or annotated images alone yielded very low diagnostic performance. © RSNA, 2025
背景 研究已经探索了多模态大语言模型(LLMs)在放射学鉴别诊断中的应用。然而,不同的多模态输入组合如何影响诊断性能尚不清楚。目的 评估不同的多模态输入元素对基于OpenAI的具有视觉功能的GPT-4(GPT-4V)的脑MRI鉴别诊断准确性的影响。材料与方法 选择60例具有挑战性但已确诊的脑MRI病例。定义了七个提示组,其中四个输入元素(无修饰符的图像[I]、注释[A]、病史[H]和图像描述[D])有所变化。对于每个MRI病例和提示组,使用基于大语言模型的搜索引擎(由GPT-4V驱动的Perplexity AI)进行三个相同的查询。使用二元和数字评分系统对大语言模型生成的鉴别诊断准确性进行评分,并使用χ检验和Kruskal-Wallis检验进行分析。使用Benjamini-Hochberg程序对结果进行错误发现率校正。进行回归分析以确定每个输入元素对诊断性能的贡献。结果 以I、A、H和D作为输入的提示组表现出最高的诊断准确性(180个回答中有124个[69%])。在输入中包含D的提示组和不包含D的提示组之间观察到显著差异。仅无注释的(I)(180个回答中有4个[2.2%])或仅带注释的放射学图像(I和A)(180个回答中有2个[1.1%])产生的诊断准确性非常低。回归分析证实D对诊断准确性有很大的正向影响(优势比[OR],68.03;P <.001),以及H有中等程度的正向影响(OR,4.18;P <.001)。结论 放射学图像发现的文本描述被确定为GPT-4V在脑MRI鉴别诊断性能中最强的贡献因素,其次是病史;仅无注释或带注释的图像产生的诊断性能非常低。© RSNA,2025