人工智能模型GPT-4和GPT-3.5在运动外科和物理治疗临床决策中的比较评估:一项横断面研究。
Comparative evaluation of artificial intelligence models GPT-4 and GPT-3.5 in clinical decision-making in sports surgery and physiotherapy: a cross-sectional study.
作者信息
Saglam Sönmez, Uludag Veysel, Karaduman Zekeriya Okan, Arıcan Mehmet, Yücel Mücahid Osman, Dalaslan Raşit Emin
机构信息
Department of Orthopaedics and Traumatology, Faculty of Medicine, Duzce University, Duzce, Türkiye.
Department of Physiotherapy and Rehabilitation, Faculty of Health Sciences, Duzce University, Duzce, Türkiye.
出版信息
BMC Med Inform Decis Mak. 2025 Apr 14;25(1):163. doi: 10.1186/s12911-025-02996-8.
BACKGROUND
The integration of artificial intelligence (AI) in healthcare has rapidly expanded, particularly in clinical decision-making. Large language models (LLMs) such as GPT-4 and GPT-3.5 have shown potential in various medical applications, including diagnostics and treatment planning. However, their efficacy in specialized fields like sports surgery and physiotherapy remains underexplored. This study aims to compare the performance of GPT-4 and GPT-3.5 in clinical decision-making within these domains using a structured assessment approach.
METHODS
This cross-sectional study included 56 professionals specializing in sports surgery and physiotherapy. Participants evaluated 10 standardized clinical scenarios generated by GPT-4 and GPT-3.5 using a 5-point Likert scale. The scenarios encompassed common musculoskeletal conditions, and assessments focused on diagnostic accuracy, treatment appropriateness, surgical technique detailing, and rehabilitation plan suitability. Data were collected anonymously via Google Forms. Statistical analysis included paired t-tests for direct model comparisons, one-way ANOVA to assess performance across multiple criteria, and Cronbach's alpha to evaluate inter-rater reliability.
RESULTS
GPT-4 significantly outperformed GPT-3.5 across all evaluated criteria. Paired t-test results (t(55) = 10.45, p < 0.001) demonstrated that GPT-4 provided more accurate diagnoses, superior treatment plans, and more detailed surgical recommendations. ANOVA results confirmed the higher suitability of GPT-4 in treatment planning (F(1, 55) = 35.22, p < 0.001) and rehabilitation protocols (F(1, 55) = 32.10, p < 0.001). Cronbach's alpha values indicated higher internal consistency for GPT-4 (α = 0.478) compared to GPT-3.5 (α = 0.234), reflecting more reliable performance.
CONCLUSIONS
GPT-4 demonstrates superior performance compared to GPT-3.5 in clinical decision-making for sports surgery and physiotherapy. These findings suggest that advanced AI models can aid in diagnostic accuracy, treatment planning, and rehabilitation strategies. However, AI should function as a decision-support tool rather than a substitute for expert clinical judgment. Future studies should explore the integration of AI into real-world clinical workflows, validate findings using larger datasets, and compare additional AI models beyond the GPT series.
背景
人工智能(AI)在医疗保健领域的整合迅速扩展,尤其是在临床决策方面。诸如GPT - 4和GPT - 3.5等大型语言模型在包括诊断和治疗规划在内的各种医学应用中已显示出潜力。然而,它们在运动外科和物理治疗等专业领域的功效仍未得到充分探索。本研究旨在使用结构化评估方法比较GPT - 4和GPT - 3.5在这些领域临床决策中的表现。
方法
这项横断面研究纳入了56名运动外科和物理治疗专业人员。参与者使用5点李克特量表对GPT - 4和GPT - 3.5生成的10个标准化临床场景进行评估。这些场景涵盖常见的肌肉骨骼疾病,评估重点在于诊断准确性、治疗适当性、手术技术细节以及康复计划适用性。数据通过谷歌表单匿名收集。统计分析包括用于直接模型比较的配对t检验、用于评估多个标准下表现的单因素方差分析以及用于评估评分者间信度的克朗巴哈系数。
结果
在所有评估标准上,GPT - 4的表现均显著优于GPT - 3.5。配对t检验结果(t(55) = 10.45,p < 0.001)表明,GPT - 4提供了更准确的诊断、更优的治疗方案以及更详细的手术建议。方差分析结果证实GPT - 4在治疗规划(F(1, 55) = 35.22,p < 0.001)和康复方案(F(1, 55) = 32.10,p < 0.001)方面具有更高的适用性。克朗巴哈系数值表明,与GPT - 3.5(α = 0.234)相比,GPT - 4具有更高的内部一致性(α = 0.478),反映出其表现更可靠。
结论
在运动外科和物理治疗的临床决策中,GPT - 4的表现优于GPT - 3.5。这些发现表明先进的人工智能模型可有助于提高诊断准确性、治疗规划和康复策略。然而,人工智能应作为决策支持工具,而非替代专家临床判断。未来的研究应探索将人工智能整合到实际临床工作流程中,使用更大的数据集验证研究结果,并比较GPT系列之外的其他人工智能模型。