Noda Mitsuaki, Takahara Shunsuke, Hayashi Shinya, Inui Atsuyuki, Oe Keisuke, Matsushita Takehiko
Orthopedics, Himeji Central Hospital, Himeji, JPN.
Orthopedics, Hyogo Prefectural Kakogawa Medical Center, Kakogawa, JPN.
Cureus. 2025 Jan 27;17(1):e78068. doi: 10.7759/cureus.78068. eCollection 2025 Jan.
Introduction Generative Pre-Training Transformer (ChatGPT) has become widely recognized for its capability to generate text, synthesize complex information, and perform a variety of tasks without requiring human specialists for data collection. The latest iteration, ChatGPT-4, is a large multimodal model capable of integrating both text and image inputs, rendering it particularly promising for medical applications. However, its efficacy in analyzing radiographic images remains largely unexplored. Aim This study aims to (i) address the lack of data on the accuracy of ChatGPT in radiographic fracture classification into stable or unstable under the revised Arbeitsgemeinschaft für Osteosynthesefragen/Orthopedic Trauma Association (AO/OTA) classification system, and this procedure is also performed by surgeons, and (ii) compare the agreement between surgeons or ChatGPT-based performance. The study hypothesizes that the use of ChatGPT would achieve moderate agreement with orthopedic surgeons. Materials and methods Patients diagnosed with pertrochanteric fractures were retrospectively collected. Patients with both preoperative two-directional plain radiographs and CT scans (3D-CT) images were conditioned for enrollment into the study. Two orthopedic surgeons (observer 1 and observer 2, respectively) and one resident (observer 3) were once assigned to dichotomized groups into A1 (stable) or A2 (unstable) based on AO/OTA classification using two-directional plain radiographs. Prior to the ChatGPT study, all the anteroposterior images trimmed at the fractured side, attached with figure names including gender, and age, were inputted into OpenAI ChatGPT-4. Radiological evaluation prompts were designed to initiate ChatGPT's classification analysis of the uploaded radiographic images. A single observer (MN) decided the classification patterns by examining 3D CT scan images as well as plain radiographs. This judgment of A1 (stable) and A2 (unstable) was set as a benchmark to mark the results of observers and ChatGPT based on plain radiographs. Results The cohort consisted of 29 males and 90 females, with a mean age of 87 years after the data exclusion. The fractures were classified into A1 (stable) and A2 (unstable) groups based on CT imaging. The A1 group included 50 patients (13 males, 37 females; mean age: 86.2 ± 7.8 years), while the A2 group included 69 patients (16 males, 53 females; mean age: 87.0 ± 7.9 years). Kappa values for fracture classification between plain radiographs evaluated by the three observers and ChatGPT, compared to the CT-based gold standard, showed fair to moderate agreement: Observer 1: 0.494 (95% CI: 0.337-0.650), Observer 2: 0.390 (95% CI: 0.227-0.553), Observer 3: 0.360 (95% CI: 0.198-0.521), and ChatGPT: 0.420 (95% CI: 0.255-0.585). ChatGPT demonstrated accuracy, sensitivity, specificity, and positive and negative predictable values comparable to the human observers, suggesting moderate reliability. Conclusion This study demonstrates that ChatGPT can classify pertrochanteric fractures into A1 (stable) and A2 (unstable) under the Revised AO/OTA Classification System. Its moderate agreement with CT-based assessments (κ = 0.420) is comparable to the performance of orthopedic surgeons. Moreover, ChatGPT is straightforward to integrate into clinical workflows, requiring minimal data collection for training.
引言 生成式预训练变换器(ChatGPT)因其能够生成文本、合成复杂信息以及在无需人类专家进行数据收集的情况下执行各种任务而得到广泛认可。其最新版本ChatGPT-4是一个大型多模态模型,能够整合文本和图像输入,这使其在医学应用方面具有特别广阔的前景。然而,其在分析X线影像方面的功效在很大程度上仍未得到探索。
目的 本研究旨在:(i)解决在修订的AO/OTA( Arbeitsgemeinschaft für Osteosynthesefragen/Orthopedic Trauma Association,接骨术问题研究协会/骨科创伤协会)分类系统下,ChatGPT对X线骨折分类为稳定或不稳定的准确性方面缺乏数据的问题,且该过程也由外科医生执行;(ii)比较外科医生或基于ChatGPT的表现之间的一致性。该研究假设使用ChatGPT将与骨科医生达成适度的一致性。
材料与方法 回顾性收集诊断为转子间骨折的患者。纳入研究的患者需同时具备术前双向X线平片和CT扫描(三维CT)图像。两名骨科医生(分别为观察者1和观察者2)和一名住院医师(观察者3)曾根据AO/OTA分类,使用双向X线平片将患者分为A1(稳定)或A2(不稳定)两组。在进行ChatGPT研究之前,将所有在骨折侧裁剪的前后位图像,附上包括性别和年龄的图像名称,输入到OpenAI ChatGPT-4中。设计了放射学评估提示,以启动ChatGPT对上传的X线影像的分类分析。一名观察者(MN)通过检查三维CT扫描图像以及X线平片来确定分类模式。将这种A1(稳定)和A2(不稳定)的判断作为基准,以标记基于X线平片的观察者和ChatGPT的结果。
结果 在数据排除后,该队列包括29名男性和90名女性,平均年龄为87岁。根据CT成像将骨折分为A1(稳定)和A2(不稳定)组。A1组包括50名患者(13名男性,37名女性;平均年龄:86.2±7.8岁),而A2组包括69名患者(16名男性,53名女性;平均年龄:87.0±7.9岁)。与基于CT的金标准相比,三位观察者和ChatGPT对X线平片骨折分类的Kappa值显示出中等至良好的一致性:观察者1:0.494(95%CI:0.337 - 0.650),观察者2:0.390(95%CI:0.227 - 0.553),观察者3:0.360(95%CI:0.198 - 0.521),ChatGPT:0.420(95%CI:0.255 - 0.585)。ChatGPT表现出与人类观察者相当的准确性、敏感性、特异性以及阳性和阴性预测值,表明具有中等可靠性。
结论 本研究表明,在修订的AO/OTA分类系统下,ChatGPT能够将转子间骨折分类为A1(稳定)和A2(不稳定)。其与基于CT的评估的中等一致性(κ = 0.420)与骨科医生的表现相当。此外,ChatGPT易于整合到临床工作流程中,训练所需的数据收集极少。