From the Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Olympic-ro 33, Seoul 05505, Republic of Korea (P.S.S., W.H.S., C.H.S., H.J.E., K.J.P., J.C., P.H.K., H.J.P., Y.A., H.Y.P.); Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Medical Science, University of Ulsan College of Medicine, Asan Medical Institute of Convergence Science and Technology, Seoul, Republic of Korea (W.H.S., H.H., C.R.P.); Medical Research Institute, Ganneung Asan Hospital, University of Ulsan College of Medicine, Gangneung, Republic of Korea (Y.C.); Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea (C.Y.W.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.).
Radiology. 2024 Jul;312(1):e240273. doi: 10.1148/radiol.240273.
Background The diagnostic abilities of multimodal large language models (LLMs) using direct image inputs and the impact of the temperature parameter of LLMs remain unexplored. Purpose To investigate the ability of GPT-4V and Gemini Pro Vision in generating differential diagnoses at different temperatures compared with radiologists using Diagnosis Please cases. Materials and Methods This retrospective study included Diagnosis Please cases published from January 2008 to October 2023. Input images included original images and captures of the textual patient history and figure legends (without imaging findings) from PDF files of each case. The LLMs were tasked with providing three differential diagnoses, repeated five times at temperatures 0, 0.5, and 1. Eight subspecialty-trained radiologists solved cases. An experienced radiologist compared generated and final diagnoses, considering the result correct if the generated diagnoses included the final diagnosis after five repetitions. Accuracy was assessed across models, temperatures, and radiology subspecialties, with statistical significance set at < .007 after Bonferroni correction for multiple comparisons across the LLMs at the three temperatures and with radiologists. Results A total of 190 cases were included in neuroradiology ( = 53), multisystem ( = 27), gastrointestinal ( = 25), genitourinary ( = 23), musculoskeletal ( = 17), chest ( = 16), cardiovascular ( = 12), pediatric ( = 12), and breast ( = 5) subspecialties. Overall accuracy improved with increasing temperature settings (0, 0.5, 1) for both GPT-4V (41% [78 of 190 cases], 45% [86 of 190 cases], 49% [93 of 190 cases], respectively) and Gemini Pro Vision (29% [55 of 190 cases], 36% [69 of 190 cases], 39% [74 of 190 cases], respectively), although there was no evidence of a statistically significant difference after Bonferroni adjustment (GPT-4V, = .12; Gemini Pro Vision, = .04). The overall accuracy of radiologists (61% [115 of 190 cases]) was higher than that of Gemini Pro Vision at temperature 1 (T1) ( < .001), while no statistically significant difference was observed between radiologists and GPT-4V at T1 after Bonferroni adjustment ( = .02). Radiologists (range, 45%-88%) outperformed the LLMs at T1 (range, 24%-75%) in most subspecialties. Conclusion Using direct radiologic image inputs, GPT-4V and Gemini Pro Vision showed improved diagnostic accuracy with increasing temperature settings. Although GPT-4V slightly underperformed compared with radiologists, it nonetheless demonstrated promising potential as a supportive tool in diagnostic decision-making. © RSNA, 2024 See also the editorial by Nishino and Ballard in this issue.
背景 多模态大型语言模型(LLM)直接使用图像输入的诊断能力以及 LLM 温度参数的影响尚未得到探索。目的 利用 GPT-4V 和 Gemini Pro Vision 对 Diagnosis Please 病例进行研究,以评估其在不同温度下生成鉴别诊断的能力,与放射科医生进行比较。材料与方法 本回顾性研究纳入了 2008 年 1 月至 2023 年 10 月期间发表的 Diagnosis Please 病例。输入图像包括原始图像和 PDF 文件中病例的文本病史和图注的截图(无影像学发现)。将 LLM 任务设定为提供三个鉴别诊断,每个温度重复五次。八名专科放射科医生解决病例。一位经验丰富的放射科医生将生成的和最终的诊断进行比较,如果生成的诊断在重复五次后包含最终诊断,则认为结果正确。在 Bonferroni 校正后,评估模型、温度和放射科亚专业之间的准确性,LLM 在三个温度下和与放射科医生比较的多次比较的统计学意义水平设定为 <.007。结果 共纳入神经放射学( = 53)、多系统( = 27)、胃肠道( = 25)、泌尿生殖系统( = 23)、肌肉骨骼( = 17)、胸部( = 16)、心血管( = 12)、儿科( = 12)和乳腺( = 5)等七个亚专业的 190 个病例。总体而言,随着温度设置(0、0.5、1)的升高,GPT-4V(分别为 41%[190 例中的 78 例]、45%[190 例中的 86 例]、49%[190 例中的 93 例])和 Gemini Pro Vision(分别为 29%[190 例中的 55 例]、36%[190 例中的 69 例]、39%[190 例中的 74 例])的诊断准确性均有所提高,但 Bonferroni 调整后无统计学意义(GPT-4V, =.12;Gemini Pro Vision, =.04)。放射科医生(61%[190 例中的 115 例])的总体准确率高于 Gemini Pro Vision 在 T1( <.001)时的准确率,且 Bonferroni 调整后,放射科医生与 GPT-4V 在 T1 时无统计学差异( =.02)。在大多数亚专业中,放射科医生(45%-88%)在 T1(24%-75%)时的表现优于 LLM。结论 利用直接放射学图像输入,GPT-4V 和 Gemini Pro Vision 在提高温度设置时表现出更好的诊断准确性。尽管 GPT-4V 与放射科医生相比略有逊色,但它作为诊断决策支持工具具有很大的应用潜力。