Suppr超能文献

比较大语言模型和人类读者在图像挑战病例图像输入方面的准确性。

Comparing Large Language Model and Human Reader Accuracy with Image Challenge Case Image Inputs.

作者信息

Suh Pae Sun, Shim Woo Hyun, Suh Chong Hyun, Heo Hwon, Park Kye Jin, Kim Pyeong Hwa, Choi Se Jin, Ahn Yura, Park Sohee, Park Ho Young, Oh Na Eun, Han Min Woo, Cho Sung Tan, Woo Chang-Yun, Park Hyungjun

机构信息

From the Department of Radiology, Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea (P.S.S.); Department of Radiology and Research Institute of Radiology (W.H.S., C.H.S., K.J.P., P.H.K., S.J.C., Y.A., S.P., H.Y.P., N.E.O.), Department of Medical Science, Asan Medical Institute of Convergence Science and Technology (W.H.S., H.H.), and Department of Internal Medicine (C.Y.W.), Asan Medical Center, University of Ulsan College of Medicine, Olympic-ro 33, Songpa-gu, 05505 Seoul, Republic of Korea; University of Ulsan College of Medicine, Seoul, Republic of Korea (M.W.H.); Department of Orthopaedic Surgery, Seoul Seonam Hospital, Republic of Korea (S.T.C.); and Department of Pulmonary and Critical Care Medicine, Gumdan Top Hospital, Incheon, Republic of Korea (H.P.).

出版信息

Radiology. 2024 Dec;313(3):e241668. doi: 10.1148/radiol.241668.

Abstract

Background Application of multimodal large language models (LLMs) with both textual and visual capabilities has been steadily increasing, but their ability to interpret radiologic images is still doubted. Purpose To evaluate the accuracy of LLMs and compare it with that of human readers with varying levels of experience and to assess the factors affecting LLM accuracy in answering Image Challenge cases. Materials and Methods Radiologic images of cases from October 13, 2005, to April 18, 2024, were retrospectively reviewed. Using text and image inputs, LLMs (Open AI's GPT-4 Turbo with Vision [GPT-4V] and GPT-4 Omni [GPT-4o], Google's DeepMind Gemini 1.5 Pro, and Anthropic's Claude 3) provided answers. Human readers (seven junior faculty radiologists, two clinicians, one in-training radiologist, and one medical student), blinded to the published answers, also answered. LLM accuracy with and without image inputs and short (cases from 2005 to 2015) versus long text inputs (from 2016 to 2024) was evaluated in subgroup analysis to determine the effect of these factors. Factor analysis was assessed using multivariable logistic regression. Accuracy was compared with generalized estimating equations, with multiple comparisons adjusted by using Bonferroni correction. Results A total of 272 cases were included. GPT-4o achieved the highest overall accuracy among LLMs (59.6%; 162 of 272), outperforming a medical student (47.1%; 128 of 272; < .001) but not junior faculty (80.9%; 220 of 272; < .001) or the in-training radiologist (70.2%; 191 of 272; = .003). GPT-4o exhibited similar accuracy regardless of image inputs (without images vs with images, 54.0% [147 of 272] vs 59.6% [162 of 272], respectively; = .59). Human reader accuracy was unaffected by text length, whereas LLMs demonstrated higher accuracy with long text inputs (all < .001). Text input length affected LLM accuracy (odds ratio range, 3.2 [95% CI: 1.9, 5.5] to 6.6 [95% CI: 3.7, 12.0]). Conclusion LLMs demonstrated substantial accuracy with text and image inputs, outperforming a medical student. However, their accuracy decreased with shorter text lengths, regardless of image input. © RSNA, 2024

摘要

背景 具有文本和视觉功能的多模态大语言模型(LLMs)的应用一直在稳步增加,但其解读放射影像的能力仍受到质疑。目的 评估大语言模型的准确性,并将其与不同经验水平的人类读者的准确性进行比较,同时评估影响大语言模型回答影像挑战病例准确性的因素。材料与方法 回顾性分析2005年10月13日至2024年4月18日病例的放射影像。大语言模型(OpenAI的带视觉的GPT-4 Turbo [GPT-4V]和GPT-4 Omni [GPT-4o]、谷歌的DeepMind Gemini 1.5 Pro以及Anthropic的Claude 3)使用文本和图像输入提供答案。对已发表答案不知情的人类读者(7名初级放射科教员、2名临床医生、1名在读放射科医生和1名医学生)也进行了回答。在亚组分析中评估了有无图像输入以及短文本输入(2005年至2015年的病例)与长文本输入(2016年至2024年)时大语言模型的准确性,以确定这些因素的影响。使用多变量逻辑回归评估因素分析。通过广义估计方程比较准确性,并使用Bonferroni校正对多重比较进行调整。结果 共纳入272例病例。GPT-4o在大语言模型中总体准确性最高(59.6%;272例中的162例),优于医学生(47.1%;272例中的128例;P <.001),但不如初级教员(80.9%;272例中的220例;P <.001)或在读放射科医生(70.2%;272例中的191例;P =.003)。无论有无图像输入,GPT-4o的准确性相似(无图像时与有图像时分别为54.0% [272例中的147例]和59.6% [272例中的162例];P =.59)。人类读者的准确性不受文本长度影响,而大语言模型在长文本输入时准确性更高(均P <.001)。文本输入长度影响大语言模型的准确性(优势比范围为3.2 [95% CI:1.9,5.5]至6.6 [95% CI:3.7,12.0])。结论 大语言模型在文本和图像输入时表现出较高的准确性,优于医学生。然而,无论有无图像输入,其准确性都会随着文本长度缩短而降低。© RSNA,2024

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验