Dede Burak Tayyip, Çakar İsa, Oğuz Muhammed, Alyanak Bülent, Bağcıer Fatih
Department of Physical Medicine and Rehabilitation, Prof. Dr. Cemil Tascioglu City Hospital, Istanbul, Turkey.
Department of Radiology, Basaksehir Cam and Sakura City Hospital, Istanbul, Turkey.
J Imaging Inform Med. 2025 Jul 25. doi: 10.1007/s10278-025-01614-3.
The aim of this study was to evaluate the reliability of ChatGPT-4 measurement of acromiohumeral distance (AHD), a popular assessment in patients with shoulder pain. In this retrospective study, 71 registered shoulder magnetic resonance imaging (MRI) scans were included. AHD measurements were performed on a coronal oblique T1 sequence with a clear view of the acromion and humerus. Measurements were performed by an experienced radiologist twice at 3-day intervals and by ChatGPT-4 twice at 3-day intervals in different sessions. The first, second, and mean values of AHD measured by the physician were 7.6 ± 1.7, 7.5 ± 1.6, and 7.6 ± 1.7, respectively. The first, second, and mean values measured by ChatGPT-4 were 6.7 ± 0.8, 7.3 ± 1.1, and 7.1 ± 0.8, respectively. There was a significant difference between the physician and ChatGPT-4 between the first and mean measurements (p < 0.0001 and p = 0.009, respectively). However, there was no significant difference between the second measurements (p = 0.220). Intrarater reliability for the physician was excellent (ICC = 0.99); intrarater reliability for ChatGPT-4 was poor (ICC = 0.41). Interrater reliability was poor (ICC = 0.45). In conclusion, this study demonstrated that the reliability of ChatGPT-4 in AHD measurements is inferior to that of an experienced radiologist. This study may help improve the possible future contribution of large language models to medical science.
本研究的目的是评估ChatGPT-4测量肩峰肱骨距离(AHD)的可靠性,这是一种常用于肩痛患者的评估方法。在这项回顾性研究中,纳入了71例已注册的肩部磁共振成像(MRI)扫描。AHD测量在冠状斜位T1序列上进行,以清晰显示肩峰和肱骨。测量由一名经验丰富的放射科医生在不同时间段每隔3天进行两次,ChatGPT-4也在不同时间段每隔3天进行两次。医生测量的AHD的第一次、第二次和平均值分别为7.6±1.7、7.5±1.6和7.6±1.7。ChatGPT-4测量的第一次、第二次和平均值分别为6.7±0.8、7.3±1.1和7.1±0.8。医生和ChatGPT-4的第一次测量与平均测量之间存在显著差异(分别为p<0.0001和p=0.009)。然而,第二次测量之间没有显著差异(p=0.220)。医生的组内可靠性极佳(ICC=0.99);ChatGPT-4的组内可靠性较差(ICC=0.41)。组间可靠性较差(ICC=0.45)。总之,本研究表明ChatGPT-4在AHD测量中的可靠性低于经验丰富的放射科医生。本研究可能有助于提高未来大语言模型对医学科学的潜在贡献。