Li Juntan, Gao Xiang, Dou Tianxu, Gao Yuyang, Li Xu, Zhu Wannan
Jinzhou Medical University, Jinzhou, Liaoning, China.
The First Affiliated Hospital of China Medical University, Shenyang, Liaoning, China.
BMJ Open. 2024 Dec 30;14(12):e082344. doi: 10.1136/bmjopen-2023-082344.
To evaluate GPT-4's performance in interpreting osteoarthritis (OA) treatment guidelines from the USA and China, and to assess its ability to diagnose and manage orthopaedic cases.
The study was conducted using publicly available OA treatment guidelines and simulated orthopaedic case scenarios.
No human participants were involved. The evaluation focused on GPT-4's responses to clinical guidelines and case questions, assessed by two orthopaedic specialists.
Primary outcomes included the accuracy and completeness of GPT-4's responses to guideline-based queries and case scenarios. Metrics included the correct match rate, completeness score and stratification of case responses into predefined tiers of correctness.
In interpreting the American Academy of Orthopaedic Surgeons and Chinese OA guidelines, GPT-4 achieved a correct match rate of 46.4% and complete agreement with all score-2 recommendations. The accuracy score for guideline interpretation was 4.3±1.6 (95% CI 3.9 to 4.7), and the completeness score was 2.8±0.6 (95% CI 2.5 to 3.1). For case-based questions, GPT-4 demonstrated high performance, with over 88% of responses rated as comprehensive.
GPT-4 demonstrates promising capabilities as an auxiliary tool in orthopaedic clinical practice and patient education, with high levels of accuracy and completeness in guideline interpretation and clinical case analysis. However, further validation is necessary to establish its utility in real-world clinical settings.
评估GPT-4在解读美国和中国骨关节炎(OA)治疗指南方面的表现,并评估其诊断和处理骨科病例的能力。
本研究使用公开可用的OA治疗指南和模拟骨科病例场景进行。
未涉及人类参与者。评估重点是GPT-4对临床指南和病例问题的回答,由两名骨科专家进行评估。
主要结果包括GPT-4对基于指南的询问和病例场景回答的准确性和完整性。指标包括正确匹配率、完整性得分以及将病例回答分层到预定义的正确性等级。
在解读美国矫形外科医师学会和中国OA指南时,GPT-4的正确匹配率为46.4%,并与所有2分的推荐完全一致。指南解读的准确性得分为4.3±1.6(95%置信区间3.9至4.7),完整性得分为2.8±0.6(95%置信区间2.5至3.1)。对于基于病例的问题,GPT-4表现出色,超过88%的回答被评为全面。
GPT-4作为骨科临床实践和患者教育的辅助工具显示出有前景的能力,在指南解读和临床病例分析中具有较高的准确性和完整性。然而,需要进一步验证以确定其在实际临床环境中的效用。