Suppr超能文献

ChatGPT-3.5和ChatGPT-4在回答患者有关股骨髋臼撞击综合征和关节镜髋关节手术的问题时,提供的信息大多是准确的。

ChatGPT-3.5 and -4 provide mostly accurate information when answering patients' questions relating to femoroacetabular impingement syndrome and arthroscopic hip surgery.

作者信息

Slawaska-Eng David, Bourgeault-Gagnon Yoan, Cohen Dan, Pauyo Thierry, Belzile Etienne L, Ayeni Olufemi R

机构信息

Division of Orthopaedic Surgery, Department of Surgery, McMaster University, 1200 Main St West, Hamilton, Ontario, L8N 3Z5, Canada.

Division of Orthopaedic Surgery, McGill University, 845 Rue Sherbrooke O, Montréal, QC H3A 0G4, Canada.

出版信息

J ISAKOS. 2025 Feb;10:100376. doi: 10.1016/j.jisako.2024.100376. Epub 2024 Dec 12.

Abstract

OBJECTIVES

This study aimed to evaluate the accuracy of ChatGPT in answering patient questions about femoroacetabular impingement (FAI) and arthroscopic hip surgery, comparing the performance of versions ChatGPT-3.5 (free) and ChatGPT-4 (paid).

METHODS

Twelve frequently asked questions (FAQs) relating to FAI were selected and posed to ChatGPT-3.5 and ChatGPT-4. The responses were assessed for accuracy by three hip arthroscopy surgeons using a four-tier grading system. Statistical analyses included Wilcoxon signed-rank tests and Gwet's AC2 coefficient for interrater agreement corrected for chance and employing quadratic weights.

RESULTS

The median ratings for responses ranged from "excellent not requiring clarification" to "satisfactory requiring moderate clarification." No responses were rated as "unsatisfactory requiring substantial clarification." The median accuracy scores were 2 (range 1-3) for ChatGPT-3.5 and 1.5 (range 1-3) for ChatGPT-4, with 25 ​% of ChatGPT-3.5's responses and 50 ​% of ChatGPT-4's responses rated as "excellent." There was no statistical difference in performance between the two versions (p ​= ​0.279) although ChatGPT-4 showed a tendency towards higher accuracy in some areas. Interrater agreement was substantial for ChatGPT-3.5 (Gwet's AC2 ​= ​0.79 [95% confidence interval (CI) ​= ​0.6-0.94]) and moderate to substantial for ChatGPT-4 (Gwet's AC2 ​= ​0.65 [95% CI ​= ​0.43-0.87]).

CONCLUSION

Both versions of ChatGPT provided mostly accurate responses to FAQs on FAI and arthroscopic surgery, with no significant difference between the versions. The findings suggest potential utility of ChatGPT in patient education, though cautious implementation and further evaluation are recommended due to variability in response accuracy and low power of the study.

LEVEL OF EVIDENCE

IV.

摘要

目的

本研究旨在评估ChatGPT回答患者关于股骨髋臼撞击症(FAI)和关节镜髋关节手术问题的准确性,比较ChatGPT-3.5(免费版)和ChatGPT-4(付费版)的性能。

方法

选择了12个与FAI相关的常见问题(FAQ),并向ChatGPT-3.5和ChatGPT-4提出。三位髋关节镜外科医生使用四级评分系统对回答的准确性进行评估。统计分析包括Wilcoxon符号秩检验和Gwet's AC2系数,用于校正机遇并采用二次权重的评分者间一致性分析。

结果

回答的中位数评分范围从“优秀,无需澄清”到“满意,需要适度澄清”。没有回答被评为“不满意,需要大量澄清”。ChatGPT-3.5的中位数准确性得分为2(范围1 - 3),ChatGPT-4的中位数准确性得分为1.5(范围1 - 3),ChatGPT-3.5的回答中有25%被评为“优秀”,ChatGPT-4的回答中有50%被评为“优秀”。尽管ChatGPT-4在某些方面显示出更高准确性的趋势,但两个版本在性能上没有统计学差异(p = 0.279)。ChatGPT-3.5的评分者间一致性较高(Gwet's AC2 = 0.79 [95%置信区间(CI)= 0.6 - 0.94]),ChatGPT-4的评分者间一致性为中等至高(Gwet's AC2 = 0.65 [95% CI = 0.43 - 0.87])。

结论

两个版本的ChatGPT对FAI和关节镜手术常见问题的回答大多准确,版本之间没有显著差异。研究结果表明ChatGPT在患者教育中具有潜在用途,但由于回答准确性存在差异且研究效能较低,建议谨慎实施并进一步评估。

证据水平

IV级。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验