Kanzawa Jun, Kurokawa Ryo, Kaiume Masafumi, Nakamura Yuta, Kurokawa Mariko, Sonoda Yuki, Gonoi Wataru, Abe Osamu
Radiology, University of Tokyo, Tokyo, JPN.
Cureus. 2024 Dec 11;16(12):e75532. doi: 10.7759/cureus.75532. eCollection 2024 Dec.
Purpose The aim of this study is to investigate the capability of generative pre-trained transformer 4 (GPT-4) and GPT-4o in identifying chest radiography reports requiring further assessment. Materials and methods This retrospective study included 100 cases from the National Institutes of Health chest radiography dataset, including 50 abnormal and 50 normal cases. A radiologist blinded to the study's purpose interpreted and reported the radiological findings for each case in English and separately determined the necessity for further assessment based on predefined criteria as referential standards. The radiology reports were then input into GPT-4 and GPT-4o models, accompanied by a prompt to identify cases requiring further assessment. This procedure was repeated five times in separate sessions for each model. Overall accuracy, sensitivity, and specificity of the necessity for further assessment were assessed using McNemar's test. Positive and negative predictive values were assessed using Fisher's exact test and Chi-square test, respectively. Results A total of 100 cases were included (mean age of 49.4 years ± 15.4 [standard deviation]; 56 women). Among them, 44 were judged by the radiologist to require further assessment. Across the five sessions, 19.6% and 35.8% of the cases were judged to require further assessment by GPT-4 and GPT-4o, respectively. The sensitivity, accuracy, and negative predictive value of GPT-4o (74.5%, 85.8%, and 82.6%, respectively) were all significantly higher than those of GPT-4 (44.5%, 75.6%, and 69.7%, respectively) ( < 0.001). The specificity and positive predictive value of GPT-4 (100% and 100%, respectively) were significantly higher than that of GPT-4o (94.6% and 91.6%, respectively) ( < 0.001). Conclusion GPT-4o showed acceptable performance in detecting chest radiography reports requiring further assessment.
目的 本研究旨在调查生成式预训练变换器4(GPT-4)和GPT-4o识别需要进一步评估的胸部X线报告的能力。材料与方法 这项回顾性研究纳入了来自美国国立卫生研究院胸部X线数据集的100例病例,包括50例异常病例和50例正常病例。一名对研究目的不知情的放射科医生用英语解释并报告了每个病例的放射学结果,并根据预定义标准分别确定进一步评估的必要性作为参考标准。然后将放射学报告输入GPT-4和GPT-4o模型,并附带一条提示以识别需要进一步评估的病例。每个模型在单独的会话中重复此过程五次。使用McNemar检验评估进一步评估必要性的总体准确性、敏感性和特异性。分别使用Fisher精确检验和卡方检验评估阳性和阴性预测值。结果 共纳入100例病例(平均年龄49.4岁±15.4[标准差];56名女性)。其中,放射科医生判断44例需要进一步评估。在五个会话中,GPT-4和GPT-4o分别判断19.6%和35.8%的病例需要进一步评估。GPT-4o的敏感性、准确性和阴性预测值(分别为74.5%、85.8%和82.6%)均显著高于GPT-4(分别为44.5%、75.6%和69.7%)(<0.001)。GPT-4的特异性和阳性预测值(分别为100%和100%)显著高于GPT-4o(分别为94.6%和91.6%)(<0.001)。结论 GPT-4o在检测需要进一步评估的胸部X线报告方面表现出可接受的性能。