Suppr超能文献

能否出现一种新的肩峰肱骨距离测量方法?人工智能与医生的较量。

Could a New Method of Acromiohumeral Distance Measurement Emerge? Artificial Intelligence vs. Physician.

作者信息

Dede Burak Tayyip, Çakar İsa, Oğuz Muhammed, Alyanak Bülent, Bağcıer Fatih

机构信息

Department of Physical Medicine and Rehabilitation, Prof. Dr. Cemil Tascioglu City Hospital, Istanbul, Turkey.

Department of Radiology, Basaksehir Cam and Sakura City Hospital, Istanbul, Turkey.

出版信息

J Imaging Inform Med. 2025 Jul 25. doi: 10.1007/s10278-025-01614-3.

Abstract

The aim of this study was to evaluate the reliability of ChatGPT-4 measurement of acromiohumeral distance (AHD), a popular assessment in patients with shoulder pain. In this retrospective study, 71 registered shoulder magnetic resonance imaging (MRI) scans were included. AHD measurements were performed on a coronal oblique T1 sequence with a clear view of the acromion and humerus. Measurements were performed by an experienced radiologist twice at 3-day intervals and by ChatGPT-4 twice at 3-day intervals in different sessions. The first, second, and mean values of AHD measured by the physician were 7.6 ± 1.7, 7.5 ± 1.6, and 7.6 ± 1.7, respectively. The first, second, and mean values measured by ChatGPT-4 were 6.7 ± 0.8, 7.3 ± 1.1, and 7.1 ± 0.8, respectively. There was a significant difference between the physician and ChatGPT-4 between the first and mean measurements (p < 0.0001 and p = 0.009, respectively). However, there was no significant difference between the second measurements (p = 0.220). Intrarater reliability for the physician was excellent (ICC = 0.99); intrarater reliability for ChatGPT-4 was poor (ICC = 0.41). Interrater reliability was poor (ICC = 0.45). In conclusion, this study demonstrated that the reliability of ChatGPT-4 in AHD measurements is inferior to that of an experienced radiologist. This study may help improve the possible future contribution of large language models to medical science.

摘要

本研究的目的是评估ChatGPT-4测量肩峰肱骨距离(AHD)的可靠性,这是一种常用于肩痛患者的评估方法。在这项回顾性研究中,纳入了71例已注册的肩部磁共振成像(MRI)扫描。AHD测量在冠状斜位T1序列上进行,以清晰显示肩峰和肱骨。测量由一名经验丰富的放射科医生在不同时间段每隔3天进行两次,ChatGPT-4也在不同时间段每隔3天进行两次。医生测量的AHD的第一次、第二次和平均值分别为7.6±1.7、7.5±1.6和7.6±1.7。ChatGPT-4测量的第一次、第二次和平均值分别为6.7±0.8、7.3±1.1和7.1±0.8。医生和ChatGPT-4的第一次测量与平均测量之间存在显著差异(分别为p<0.0001和p=0.009)。然而,第二次测量之间没有显著差异(p=0.220)。医生的组内可靠性极佳(ICC=0.99);ChatGPT-4的组内可靠性较差(ICC=0.41)。组间可靠性较差(ICC=0.45)。总之,本研究表明ChatGPT-4在AHD测量中的可靠性低于经验丰富的放射科医生。本研究可能有助于提高未来大语言模型对医学科学的潜在贡献。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验