From the Departments of Population Health Sciences (Y.Z., Y.P.) and Radiology (H.O., P.K., J.K., K.H., G.S.), Weill Cornell Medicine, 425 E 61st St, Ste 301, New York, NY 10065; Department of Thoracic Imaging, University of Texas MD Anderson Cancer Center, Houston, Tex (C.C.W.); and Department of Radiology, Thomas Jefferson University Hospital, Philadelphia, Pa (A.F.).
Radiology. 2024 May;311(2):e233270. doi: 10.1148/radiol.233270.
Background Generating radiologic findings from chest radiographs is pivotal in medical image analysis. The emergence of OpenAI's generative pretrained transformer, GPT-4 with vision (GPT-4V), has opened new perspectives on the potential for automated image-text pair generation. However, the application of GPT-4V to real-world chest radiography is yet to be thoroughly examined. Purpose To investigate the capability of GPT-4V to generate radiologic findings from real-world chest radiographs. Materials and Methods In this retrospective study, 100 chest radiographs with free-text radiology reports were annotated by a cohort of radiologists, two attending physicians and three residents, to establish a reference standard. Of 100 chest radiographs, 50 were randomly selected from the National Institutes of Health (NIH) chest radiographic data set, and 50 were randomly selected from the Medical Imaging and Data Resource Center (MIDRC). The performance of GPT-4V at detecting imaging findings from each chest radiograph was assessed in the zero-shot setting (where it operates without prior examples) and few-shot setting (where it operates with two examples). Its outcomes were compared with the reference standard with regards to clinical conditions and their corresponding codes in the (ICD-10), including the anatomic location (hereafter, laterality). Results In the zero-shot setting, in the task of detecting ICD-10 codes alone, GPT-4V attained an average positive predictive value (PPV) of 12.3%, average true-positive rate (TPR) of 5.8%, and average F1 score of 7.3% on the NIH data set, and an average PPV of 25.0%, average TPR of 16.8%, and average F1 score of 18.2% on the MIDRC data set. When both the ICD-10 codes and their corresponding laterality were considered, GPT-4V produced an average PPV of 7.8%, average TPR of 3.5%, and average F1 score of 4.5% on the NIH data set, and an average PPV of 10.9%, average TPR of 4.9%, and average F1 score of 6.4% on the MIDRC data set. With few-shot learning, GPT-4V showed improved performance on both data sets. When contrasting zero-shot and few-shot learning, there were improved average TPRs and F1 scores in the few-shot setting, but there was not a substantial increase in the average PPV. Conclusion Although GPT-4V has shown promise in understanding natural images, it had limited effectiveness in interpreting real-world chest radiographs. © RSNA, 2024
背景 从胸部 X 光片中生成放射学发现是医学图像分析的关键。OpenAI 的生成式预训练转换器 GPT-4 与视觉 (GPT-4V) 的出现,为自动化图像-文本对生成开辟了新的视角。然而,GPT-4V 在实际胸部 X 光摄影中的应用尚未得到彻底研究。目的 探讨 GPT-4V 从实际胸部 X 光片中生成放射学发现的能力。材料与方法 本回顾性研究共纳入 100 张附有放射科报告的胸部 X 光片,由一组放射科医生、两名主治医生和三名住院医生进行注释,以建立参考标准。在这 100 张胸部 X 光片中,有 50 张随机选自美国国立卫生研究院(NIH)胸部 X 光数据集,50 张随机选自医学成像和数据资源中心(MIDRC)。在无样本(即没有先验示例)和少样本(即有两个示例)设置下,评估 GPT-4V 从每张胸部 X 光片中检测影像学发现的性能。将其结果与参考标准进行比较,参考标准包括临床情况和国际疾病分类第 10 版(ICD-10)中的相应代码,包括解剖部位(下文简称侧别)。结果 在无样本设置中,在单独检测 ICD-10 代码的任务中,GPT-4V 在 NIH 数据集上的平均阳性预测值(PPV)为 12.3%,平均真阳性率(TPR)为 5.8%,平均 F1 得分为 7.3%,在 MIDRC 数据集上的平均 PPV 为 25.0%,平均 TPR 为 16.8%,平均 F1 得分为 18.2%。当同时考虑 ICD-10 代码及其对应的侧别时,GPT-4V 在 NIH 数据集上的平均 PPV 为 7.8%,平均 TPR 为 3.5%,平均 F1 得分为 4.5%,在 MIDRC 数据集上的平均 PPV 为 10.9%,平均 TPR 为 4.9%,平均 F1 得分为 6.4%。通过少样本学习,GPT-4V 在两个数据集上的表现都有所提高。在比较无样本和少样本学习时,在少样本设置中,平均 TPR 和 F1 得分有所提高,但平均 PPV 并没有显著增加。结论 尽管 GPT-4V 在理解自然图像方面表现出了一定的潜力,但它在解释实际胸部 X 光片方面的效果有限。©RSNA,2024