Ayık Gökhan, Ercan Niyazi, Demirtaş Yunus, Yıldırım Tuğrul, Çakmak Gökhan
Yüksek İhtisas Üniversitesi Ortopedi ve Travmatoloji Anabilim Dalı, 06530 Çankaya, Ankara, Türkiye.
Jt Dis Relat Surg. 2025 Jan 2;36(1):193-199. doi: 10.52312/jdrs.2025.1961. Epub 2024 Dec 18.
This study aimed to evaluate the responses provided by ChatGPT-4o to the most frequently asked questions by patients regarding hip arthroscopy.
In this cross-sectional survey study, a new Google account without a search history was created to determine the 20 most frequently asked questions about hip arthroscopy via Google. These questions were asked to a new ChatGPT-4o account on June 1, 2024, and the responses were recorded. Ten orthopedic surgeons specializing in sports surgery rated the responses using a rating scale to assess relevance, accuracy, clarity, and completeness. The responses were scored on a scale from 1 to 5, with 1 being the worst and 5 being the best. The interrater reliability assessed via the intraclass correlation coefficient (ICC).
The lowest score given by the surgeons for any response was 4/5 in each subcategory. The highest mean scores were in accuracy and clarity, followed by relevance, with completeness receiving the lowest scores. The overall mean score was 4.49±0.16. Interrater reliability showed insufficient overall agreement (ICC=0.004, p=0.383), with the highest agreement in clarity (ICC=0.039, p=0.131) and the lowest in accuracy (ICC=-0.019, p=0.688).
The study confirms our hypothesis that ChatGPT-4o provides above-average quality responses to frequently asked questions about hip arthroscopy, as evidenced by the high scores in relevance, accuracy, clarity, and completeness. However, it is still advisable to consult orthopedic specialists on the subject, incorporating ChatGPT's suggestions during the final decision-making process.
本研究旨在评估ChatGPT-4o对患者关于髋关节镜检查最常见问题的回答。
在这项横断面调查研究中,创建了一个没有搜索历史的新谷歌账户,以确定通过谷歌搜索关于髋关节镜检查的20个最常见问题。2024年6月1日,向一个新的ChatGPT-4o账户提出这些问题,并记录回答。10名专门从事运动外科的骨科医生使用评分量表对回答进行评分,以评估相关性、准确性、清晰度和完整性。回答的评分范围为1至5分,1分为最差,5分为最佳。通过组内相关系数(ICC)评估评分者间的可靠性。
在每个子类别中,外科医生给任何回答的最低分数为4/5。最高平均分在准确性和清晰度方面,其次是相关性,完整性得分最低。总体平均分为4.49±0.16。评分者间可靠性显示总体一致性不足(ICC = 0.004,p = 0.383),清晰度方面一致性最高(ICC = 0.039,p = 0.131),准确性方面一致性最低(ICC = -0.019,p = 0.688)。
该研究证实了我们的假设,即ChatGPT-4o对关于髋关节镜检查的常见问题提供了高于平均水平的高质量回答,相关性、准确性、清晰度和完整性方面的高分证明了这一点。然而,就该主题咨询骨科专家仍然是明智的,在最终决策过程中纳入ChatGPT的建议。