Kim Hong Jin, Yoon Pil Whan, Yoon Jae Youn, Kim Hyungtae, Choi Young Jin, Park Sangyoon, Moon Jun-Ki
Department of Orthopaedic Surgery, Kyung-in Regional Military Manpower Administration, Suwon 16440, Republic of Korea.
Department of Orthopedic Surgery, Inje University Sanggye Paik Hospital, Seoul 01757, Republic of Korea.
J Clin Med. 2024 Oct 8;13(19):5971. doi: 10.3390/jcm13195971.
: This study aimed to assess the reproducibility and reliability of Chat-Based GPT (ChatGPT)'s responses to 19 statements regarding the management of hip fractures in older adults as adopted by the American Academy of Orthopaedic Surgeons' (AAOS) evidence-based clinical practice guidelines. : Nineteen statements were obtained from the 2021 AAOS evidence-based clinical practice guidelines. After generating questions based on these 19 statements, we set a prompt for both the GPT-4o and GPT-4 models. We repeated this process three times at 24 h intervals for both models, producing outputs A, B, and C. ChatGPT's performance, the intra-ChatGPT reliability, and the accuracy rates were assessed to evaluate the reproducibility and reliability of the hip fracture-related guidelines. Regarding the strengths of the recommendation compared with the 2021 AAOS guidelines, we observed accuracy of 0.684, 0.579, and 0.632 for outputs A, B, and C, respectively. : The precision was 0.740, 0.737, and 0.718 in outputs A, B, and C, respectively. For the reliability of the strengths of the recommendation, the Fleiss kappa was 0.409, indicating a moderate level of agreement. No statistical differences in the strengths of the recommendation were observed in outputs A, B, and C between the GPT-4o and GPT-4 versions. : ChatGPT may be useful in providing guidelines for hip fractures but performs poorly in terms of accuracy and precision. However, hallucinations remain an unresolved limitation associated with using ChatGPT to search for hip fracture guidelines. The effective utilization of ChatGPT as a patient education tool for the management of hip fractures should be addressed in the future.
本研究旨在评估基于聊天的生成式预训练变换器(ChatGPT)对美国矫形外科医师学会(AAOS)循证临床实践指南中关于老年人髋部骨折管理的19条陈述的回复的可重复性和可靠性。
从2021年AAOS循证临床实践指南中获取了19条陈述。基于这19条陈述生成问题后,我们为GPT-4o和GPT-4模型设置了提示。我们对两个模型每隔24小时重复此过程三次,生成输出A、B和C。评估了ChatGPT的表现、ChatGPT内部的可靠性以及准确率,以评估髋部骨折相关指南的可重复性和可靠性。与2021年AAOS指南相比,在推荐强度方面,我们观察到输出A、B和C的准确率分别为0.684、0.579和0.632。
输出A、B和C的精确率分别为0.740、0.737和0.718。对于推荐强度的可靠性,Fleiss卡方值为0.409,表明一致性处于中等水平。在GPT-4o和GPT-4版本的输出A、B和C之间,未观察到推荐强度的统计学差异。
ChatGPT在提供髋部骨折指南方面可能有用,但在准确性和精确性方面表现不佳。然而,幻觉仍然是使用ChatGPT搜索髋部骨折指南相关的一个未解决的局限性。未来应探讨如何有效利用ChatGPT作为髋部骨折管理的患者教育工具。