Slawaska-Eng David, Bourgeault-Gagnon Yoan, Cohen Dan, Pauyo Thierry, Belzile Etienne L, Ayeni Olufemi R
Division of Orthopaedic Surgery, Department of Surgery, McMaster University, 1200 Main St West, Hamilton, Ontario, L8N 3Z5, Canada.
Division of Orthopaedic Surgery, McGill University, 845 Rue Sherbrooke O, Montréal, QC H3A 0G4, Canada.
J ISAKOS. 2025 Feb;10:100376. doi: 10.1016/j.jisako.2024.100376. Epub 2024 Dec 12.
This study aimed to evaluate the accuracy of ChatGPT in answering patient questions about femoroacetabular impingement (FAI) and arthroscopic hip surgery, comparing the performance of versions ChatGPT-3.5 (free) and ChatGPT-4 (paid).
Twelve frequently asked questions (FAQs) relating to FAI were selected and posed to ChatGPT-3.5 and ChatGPT-4. The responses were assessed for accuracy by three hip arthroscopy surgeons using a four-tier grading system. Statistical analyses included Wilcoxon signed-rank tests and Gwet's AC2 coefficient for interrater agreement corrected for chance and employing quadratic weights.
The median ratings for responses ranged from "excellent not requiring clarification" to "satisfactory requiring moderate clarification." No responses were rated as "unsatisfactory requiring substantial clarification." The median accuracy scores were 2 (range 1-3) for ChatGPT-3.5 and 1.5 (range 1-3) for ChatGPT-4, with 25 % of ChatGPT-3.5's responses and 50 % of ChatGPT-4's responses rated as "excellent." There was no statistical difference in performance between the two versions (p = 0.279) although ChatGPT-4 showed a tendency towards higher accuracy in some areas. Interrater agreement was substantial for ChatGPT-3.5 (Gwet's AC2 = 0.79 [95% confidence interval (CI) = 0.6-0.94]) and moderate to substantial for ChatGPT-4 (Gwet's AC2 = 0.65 [95% CI = 0.43-0.87]).
Both versions of ChatGPT provided mostly accurate responses to FAQs on FAI and arthroscopic surgery, with no significant difference between the versions. The findings suggest potential utility of ChatGPT in patient education, though cautious implementation and further evaluation are recommended due to variability in response accuracy and low power of the study.
IV.
本研究旨在评估ChatGPT回答患者关于股骨髋臼撞击症(FAI)和关节镜髋关节手术问题的准确性,比较ChatGPT-3.5(免费版)和ChatGPT-4(付费版)的性能。
选择了12个与FAI相关的常见问题(FAQ),并向ChatGPT-3.5和ChatGPT-4提出。三位髋关节镜外科医生使用四级评分系统对回答的准确性进行评估。统计分析包括Wilcoxon符号秩检验和Gwet's AC2系数,用于校正机遇并采用二次权重的评分者间一致性分析。
回答的中位数评分范围从“优秀,无需澄清”到“满意,需要适度澄清”。没有回答被评为“不满意,需要大量澄清”。ChatGPT-3.5的中位数准确性得分为2(范围1 - 3),ChatGPT-4的中位数准确性得分为1.5(范围1 - 3),ChatGPT-3.5的回答中有25%被评为“优秀”,ChatGPT-4的回答中有50%被评为“优秀”。尽管ChatGPT-4在某些方面显示出更高准确性的趋势,但两个版本在性能上没有统计学差异(p = 0.279)。ChatGPT-3.5的评分者间一致性较高(Gwet's AC2 = 0.79 [95%置信区间(CI)= 0.6 - 0.94]),ChatGPT-4的评分者间一致性为中等至高(Gwet's AC2 = 0.65 [95% CI = 0.43 - 0.87])。
两个版本的ChatGPT对FAI和关节镜手术常见问题的回答大多准确,版本之间没有显著差异。研究结果表明ChatGPT在患者教育中具有潜在用途,但由于回答准确性存在差异且研究效能较低,建议谨慎实施并进一步评估。
IV级。