Mabrouk Ahmed, Boutefnouchet Tarek, Malik Shahbaz, Sweed Tamer
Basingstoke and North Hampshire Hospital, Basingstoke, United Kingdom.
University Hospitals Birmingham NHS Foundation Trust, Birmingham, United Kingdom.
Eur J Orthop Surg Traumatol. 2025 Jun 16;35(1):247. doi: 10.1007/s00590-025-04373-7.
This study aimed to assess the accuracy of a custom ChatGPT in responding to questions specifically related to high tibial osteotomies (HTO), using an international expert osteotomy consensus statement as the source of information.
A custom ChatGPT was developed using European Society of Sports Traumatology, Knee Surgery and Arthroscopy (ESSKA) osteotomy consensus for the painful degenerative varus knee as the primary training material. The custom ChatGPT was then tested for accuracy by generating responses to a series of 10 questions:five directly extracted from the consensus statement (Identical group) and five other common questions related to HTO (Random group). The generated responses were assessed by three knee surgeons using a bespoke scoring system. The scoring system evaluated accuracy, relevance, clarity, completeness, and adherence to the consensus. Each item was scored on a four-point Likert scale from 0 to 3. Inter-rater reliability was calculated with an intra-class correlation coefficient (ICC).
A total of 30 questions were asked to the custom ChatGPT by the three raters. The mean scores for accuracy, relevance, and clarity were 2.5 ± 0.8, 2.9 ± 0.3, and 2.9 ± 0.2, respectively. The inter-rater reliability for these scores was good (ICC 0.7, p = 0.004). Whereas, the mean score for completeness was 2.6 ± 0.5 with moderate inter-rater reliability (ICC 0.5, p = 0.1) and the mean score for adherence to the consensus statement document was 2.5 ± 1.1 with excellent inter-rater reliability (ICC 0.9, p < 0.001). There was no significant intergroup difference in accuracy, relevance, clarity and completeness (All p > 0.05). Only adherence to PDF was significantly lower in the random group 1.9 ± 0.5 versus 3 in the identical group (p = 0.01).
A custom ChatGPT can be trained to accurately answer questions from an international expert osteotomy consensus statement, indicating effective training and customization. This can serve as a valuable tool to guide surgeons in their practice by providing evidence-based answers to key questions in a time-efficient manner and is potentially applicable to other consensus statements and published literature.
本研究旨在以国际专家截骨术共识声明为信息来源,评估定制的ChatGPT回答与高位胫骨截骨术(HTO)相关问题的准确性。
以欧洲运动创伤、膝关节手术和关节镜学会(ESSKA)针对疼痛性退行性膝内翻的截骨术共识为主要训练材料,开发了定制的ChatGPT。然后,通过让定制的ChatGPT回答一系列10个问题来测试其准确性:其中5个问题直接从共识声明中提取(相同组),另外5个是与HTO相关的常见问题(随机组)。由三名膝关节外科医生使用定制的评分系统对生成的回答进行评估。该评分系统评估准确性、相关性、清晰度、完整性以及对共识的遵循情况。每个项目根据从0到3的四点李克特量表进行评分。使用组内相关系数(ICC)计算评分者间信度。
三名评分者共向定制的ChatGPT提出了30个问题。准确性、相关性和清晰度的平均得分分别为2.5±0.8、2.9±0.3和2.9±0.2。这些得分的评分者间信度良好(ICC 0.7,p = 0.004)。而完整性的平均得分为2.6±0.5,评分者间信度中等(ICC 0.5,p = 0.1),对共识声明文件的遵循情况的平均得分为2.5±1.1,评分者间信度极佳(ICC 0.9,p < 0.001)。准确性、相关性、清晰度和完整性方面没有显著的组间差异(所有p > 0.05)。只有随机组对PDF的遵循情况显著低于相同组,分别为1.9±0.5和3(p = 0.01)。
可以训练定制的ChatGPT准确回答来自国际专家截骨术共识声明的问题,表明训练和定制是有效的。这可以作为一种有价值的工具,通过及时有效地为关键问题提供循证答案来指导外科医生的实践,并且可能适用于其他共识声明和已发表的文献。