Suppr超能文献

迈向放射学的基础模型?GPT-4V 的多模态和多原子区域能力的定量评估。

Toward Foundation Models in Radiology? Quantitative Assessment of GPT-4V's Multimodal and Multianatomic Region Capabilities.

机构信息

From the Institute of Radiology (Q.D.S., L.S.K., G.N., A.K.M., S.M., I.E., J.R., C.W., C.S., O.W.H., A.S.) and Department of Cranio- and Maxillofacial Surgery (F.N.), University of Regensburg Medical Center, Franz-Josef-Strauss-Allee 11, 93053 Regensburg, Germany; Department of Radiology, Division of Neuroradiology, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (Q.D.S.); Department of Radiology, Bayreuth Medical Center, Bayreuth, Germany (M.S.); Center of Neuroradiology, medbo District Hospital and University Medical Center Regensburg, Regensburg, Germany (I.W., C.W.); and Department of Radiology, Donaustauf Hospital, Donaustauf, Germany (O.W.H.).

出版信息

Radiology. 2024 Nov;313(2):e240955. doi: 10.1148/radiol.240955.

Abstract

Background Large language models have already demonstrated potential in medical text processing. GPT-4V, a large vision-language model from OpenAI, has shown potential for medical imaging, yet a quantitative analysis is lacking. Purpose To quantitatively assess the performance of GPT-4V in interpreting radiologic images using unseen data. Materials and Methods This retrospective study included single representative abnormal and healthy control images from neuroradiology, cardiothoracic radiology, and musculoskeletal radiology (CT, MRI, radiography) to generate reports using GPT-4V via the application programming interface from February to March 2024. The factual correctness of free-text reports and the performance in detecting abnormalities in binary classification tasks were assessed using accuracy, sensitivity, and specificity. The binary classification performance was compared with that of a first-year nonradiologist in training and four board-certified radiologists. Results A total of 515 images in 470 patients (median age, 61 years [IQR, 44-71 years]; 267 male) were included, of which 345 images were abnormal. GPT-4V correctly identified the imaging modality and anatomic region in 100% (515 of 515) and 99.2% (511 of 515) of images, respectively. Diagnostic accuracy in free-text reports was between 0% (0 of 33 images) for pneumothorax (CT and radiography) and 90% (45 of 50 images) for brain tumor (MRI). In binary classification tasks, GPT-4V showed sensitivities between 56% (14 of 25 images) for ischemic stroke and 100% (25 of 25 images) for brain hemorrhage and specificities between 8% (two of 25 images) for brain hemorrhage and 52% (13 of 25 images) for pneumothorax, compared with a pooled sensitivity of 97.2% (1103 of 1135 images) and pooled specificity of 97.2% (1084 of 1115 images) for the human readers across all tasks. The model exhibited a clear tendency to overdiagnose abnormalities, with 86.5% (147 of 170 images) and 67.7% (151 of 223 images) false-positive rates for the free-text and binary classification tasks, respectively. Conclusion GPT-4V, in its earliest version, recognized medical image content and reliably determined the modality and anatomic region from single images. However, GPT-4V failed to detect, classify, or rule out abnormalities in image interpretation. © RSNA, 2024

摘要

背景 大型语言模型已在医学文本处理中显示出潜力。OpenAI 的大型视觉语言模型 GPT-4V 已显示出在医学成像方面的潜力,但缺乏定量分析。目的 定量评估 GPT-4V 在使用未见数据解释放射图像方面的性能。 材料与方法 本回顾性研究纳入了神经放射学、心胸放射学和肌肉骨骼放射学(CT、MRI、X 线摄影)的单个代表性异常和健康对照图像,以便通过应用程序编程接口从 2024 年 2 月至 3 月使用 GPT-4V 生成报告。使用准确性、敏感性和特异性评估自由文本报告的事实正确性和在二进制分类任务中检测异常的性能。将二进制分类性能与第一年的非放射科医师培训和四位 board-certified 放射科医师进行比较。 结果 共纳入 470 例患者的 515 幅图像(中位数年龄,61 岁[IQR,44-71 岁];267 例男性),其中 345 幅图像异常。GPT-4V 分别正确识别了 100%(515 幅中的 515 幅)和 99.2%(511 幅中的 515 幅)的成像方式和解剖区域。自由文本报告的诊断准确性在气胸(CT 和 X 线摄影)的 0%(33 幅图像中的 0 幅)和脑肿瘤(MRI)的 90%(50 幅图像中的 45 幅)之间。在二进制分类任务中,GPT-4V 的敏感性分别为 56%(25 幅图像中的 14 幅)和 100%(25 幅图像中的 25 幅),特异性分别为 8%(25 幅图像中的 2 幅)和 52%(25 幅图像中的 13 幅),而人类读者在所有任务中的 pooled 敏感性为 97.2%(1135 幅图像中的 1103 幅)和 pooled 特异性为 97.2%(1115 幅图像中的 1084 幅)。该模型表现出明显的过度诊断异常的趋势,自由文本和二进制分类任务的假阳性率分别为 86.5%(170 幅图像中的 147 幅)和 67.7%(223 幅图像中的 151 幅)。 结论 GPT-4V 的早期版本能够识别医学图像内容,并可靠地从单张图像中确定成像方式和解剖区域。然而,GPT-4V 在图像解释中未能检测、分类或排除异常。 © RSNA,2024

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验