Le Guellec Bastien, Bruge Cyril, Chalhoub Najib, Chaton Victor, De Sousa Edouard, Gaillandre Yann, Hanafi Riyad, Masy Matthieu, Vannod-Michel Quentin, Hamroun Aghiles, Kuchcinski Grégory
Department of Neuroradiology, CHU Lille, Salengro Hospital, Lille 59000, France; Université Lille, INSERM, CHU Lille, Institut Pasteur de Lille, U1167-RID-AGE - Facteurs de Risque et Déterminants Moléculaires des Maladies Liées au Vieillissement, Lille 59000, France; INSERM, U1172-LilNCog-Lille Neuroscience & Cognition, Université de Lille, Lille 59000, France.
Department of Radiology, Lens Hospital, Lens 62300, France.
Diagn Interv Imaging. 2025 May 9. doi: 10.1016/j.diii.2025.04.006.
The purpose of this study was to compare the ability of two multimodal models (GPT-4o and Gemini 1.5 Pro) with that of radiologists to generate differential diagnoses from textual context alone, key images alone, or a combination of both using complex neuroradiology cases.
This retrospective study included neuroradiology cases from the "Diagnosis Please" series published in the Radiology journal between January 2008 and September 2024. The two multimodal models were asked to provide three differential diagnoses from textual context alone, key images alone, or the complete case. Six board-certified neuroradiologists solved the cases in the same setting, randomly assigned to two groups: context alone first and images alone first. Three radiologists solved the cases without, and then with the assistance of Gemini 1.5 Pro. An independent radiologist evaluated the quality of the image descriptions provided by GPT-4o and Gemini for each case. Differences in correct answers between multimodal models and radiologists were analyzed using McNemar test.
GPT-4o and Gemini 1.5 Pro outperformed radiologists using clinical context alone (mean accuracy, 34.0 % [18/53] and 44.7 % [23.7/53] vs. 16.4 % [8.7/53]; both P < 0.01). Radiologists outperformed GPT-4o and Gemini 1.5 Pro using images alone (mean accuracy, 42.0 % [22.3/53] vs. 3.8 % [2/53], and 7.5 % [4/53]; both P < 0.01) and the complete cases (48.0 % [25.6/53] vs. 34.0 % [18/53], and 38.7 % [20.3/53]; both P < 0.001). While radiologists improved their accuracy when combining multimodal information (from 42.1 % [22.3/53] for images alone to 50.3 % [26.7/53] for complete cases; P < 0.01), GPT-4o and Gemini 1.5 Pro did not benefit from the multimodal context (from 34.0 % [18/53] for text alone to 35.2 % [18.7/53] for complete cases for GPT-4o; P = 0.48, and from 44.7 % [23.7/53] to 42.8 % [22.7/53] for Gemini 1.5 Pro; P = 0.54). Radiologists benefited significantly from the suggestion of Gemini 1.5 Pro, increasing their accuracy from 47.2 % [25/53] to 56.0 % [27/53] (P < 0.01). Both GPT-4o and Gemini 1.5 Pro correctly identified the imaging modality in 53/53 (100 %) and 51/53 (96.2 %) cases, respectively, but frequently failed to identify key imaging findings (43/53 cases [81.1 %] with incorrect identification of key imaging findings for GPT-4o and 50/53 [94.3 %] for Gemini 1.5).
Radiologists show a specific ability to benefit from the integration of textual and visual information, whereas multimodal models mostly rely on the clinical context to suggest diagnoses.
本研究旨在比较两种多模态模型(GPT - 4o和Gemini 1.5 Pro)与放射科医生仅根据文本背景、仅根据关键图像或使用复杂神经放射学病例将两者结合来生成鉴别诊断的能力。
这项回顾性研究纳入了2008年1月至2024年9月发表在《放射学》杂志上的“请诊断”系列中的神经放射学病例。要求这两种多模态模型仅根据文本背景、仅根据关键图像或完整病例提供三种鉴别诊断。六位获得委员会认证的神经放射科医生在相同环境下解决这些病例,随机分为两组:先仅看文本组和先仅看图像组。三位放射科医生在没有Gemini 1.5 Pro协助的情况下解决病例,然后在其协助下解决病例。一位独立的放射科医生评估GPT - 4o和Gemini为每个病例提供的图像描述质量。使用McNemar检验分析多模态模型与放射科医生在正确答案上的差异。
仅使用临床背景时,GPT - 4o和Gemini 1.5 Pro的表现优于放射科医生(平均准确率分别为34.0%[18/53]和44.7%[23.7/53],而放射科医生为16.4%[8.7/53];两者P < 0.01)。仅使用图像时,放射科医生的表现优于GPT - 4o和Gemini 1.5 Pro(平均准确率分别为42.0%[22.3/53],GPT - 4o为3.8%[2/53],Gemini 1.5 Pro为7.5%[4/53];两者P < 0.01),在完整病例中也是如此(放射科医生为48.0%[25.6/53],GPT - 4o为34.0%[18/53],Gemini 1.5 Pro为38.7%[20.3/53];两者P < 0.001)。虽然放射科医生在结合多模态信息时提高了准确率(从仅看图像时的42.1%[22.3/53]提高到完整病例时的50.3%[26.7/53];P < 0.01),但GPT - 4o和Gemini 1.5 Pro并未从多模态背景中受益(GPT - 4o从仅看文本时的34.0%[18/53]提高到完整病例时的35.2%[18.7/53];P = 0.48,Gemini 1.5 Pro从44.7%[23.7/53]提高到42.8%[22.7/53];P = 0.54)。放射科医生从Gemini 1.5 Pro的建议中显著受益,准确率从47.2%[25/53]提高到56.0%[27/53](P < 0.01)。GPT - 4o和Gemini 1.5 Pro分别在53/53(100%)和51/53(96.2%)的病例中正确识别了成像模态,但经常未能识别关键成像发现(GPT - 4o有43/53例[81.1%]关键成像发现识别错误,Gemini 1.5 Pro有50/53例[94.3%])。
放射科医生表现出从文本和视觉信息整合中受益的特定能力,而多模态模型大多依赖临床背景来提出诊断。