Nian Patrick P, Umesh Amith, Simpson Shae K, Tracey Olivia C, Nichols Erikson, Logterman Stephanie, Doyle Shevaun M, Heyer Jessica H
Department of Pediatric Orthopaedic Surgery, Hospital for Special Surgery, New York, NY.
Department of Pediatric Orthopaedic Surgery, Orlando Health Arnold Palmer Hospital for Children Center for Orthopaedics, Orlando, FL.
J Pediatr Orthop. 2025 Apr 1;45(4):e338-e344. doi: 10.1097/BPO.0000000000002890. Epub 2025 Jan 14.
Artificial intelligence (AI) chatbots, including chat generative pretrained transformer (ChatGPT) and Google Gemini, have significantly increased access to medical information. However, in pediatric orthopaedics, no study has evaluated the accuracy of AI chatbots compared with evidence-based recommendations, including the American Academy of Orthopaedic Surgeons clinical practice guidelines (AAOS CPGs). The aims of this study were to compare responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini with AAOS CPG recommendations on pediatric supracondylar humerus and diaphyseal femur fractures regarding accuracy, supplementary and incomplete response patterns, and readability.
ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were prompted by questions created from 13 evidence-based recommendations (6 from the 2011 AAOS CPG on pediatric supracondylar humerus fractures; 7 from the 2020 AAOS CPG on pediatric diaphyseal femur fractures). Responses were anonymized and independently evaluated by 2 pediatric orthopaedic attending surgeons. Supplementary responses were, in addition, evaluated on whether no, some, or many modifications were necessary. Readability metrics (response length, Flesch-Kincaid reading level, Flesch Reading Ease, Gunning Fog Index) were compared. Cohen Kappa interrater reliability (κ) was calculated. χ 2 analyses and single-factor analysis of variance were utilized to compare categorical and continuous variables, respectively. Statistical significance was set with P <0.05.
ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were accurate in 11/13, 9/13, and 11/13, supplementary in 13/13, 11/13, and 13/13, and incomplete in 3/13, 4/13, and 4/13 recommendations, respectively. Of 37 supplementary responses, 17 (45.9%), 19 (51.4%), and 1 (2.7%) required no, some, and many modifications, respectively. There were no significant differences in accuracy ( P = 0.533), supplementary responses ( P = 0.121), necessary modifications ( P = 0.580), and incomplete responses ( P = 0.881). Overall κ was moderate at 0.55. ChatGPT-3.5 provided shorter responses ( P = 0.002), but Google Gemini was more readable in terms of Flesch-Kincaid Grade Level ( P = 0.002), Flesch Reading Ease ( P < 0.001), and Gunning Fog Index ( P = 0.021).
While AI chatbots provided responses with reasonable accuracy, most supplemental information required modification and had complex readability. Improvements are necessary before AI chatbots can be reliably used for patient education.
Level IV.
包括聊天生成预训练变换器(ChatGPT)和谷歌Gemini在内的人工智能(AI)聊天机器人显著增加了获取医学信息的途径。然而,在小儿骨科领域,尚无研究将AI聊天机器人的准确性与循证推荐(包括美国骨科医师学会临床实践指南[AAOS CPGs])进行比较。本研究的目的是比较ChatGPT-4.0、ChatGPT-3.5和谷歌Gemini针对小儿肱骨髁上骨折和股骨干骨折的回答与AAOS CPG推荐在准确性、补充和不完整回答模式以及可读性方面的差异。
根据13条循证推荐(2011年AAOS CPG关于小儿肱骨髁上骨折的6条;2020年AAOS CPG关于小儿股骨干骨折的7条)提出问题,分别向ChatGPT-4.0、ChatGPT-3.5和谷歌Gemini提问。回答进行匿名处理,并由2名小儿骨科主治医生独立评估。此外,对补充回答评估是否需要进行无、部分或大量修改。比较可读性指标(回答长度、弗莱什-金凯德阅读难度、弗莱什阅读易读性、冈宁雾度指数)。计算科恩卡帕评分者间信度(κ)。分别采用χ²分析和单因素方差分析比较分类变量和连续变量。设定P<0.05为具有统计学意义。
ChatGPT-4.0、ChatGPT-3.5和谷歌Gemini在13条推荐中的准确回答分别为11条、9条和11条,补充回答分别为13条、11条和13条,不完整回答分别为3条、4条和4条。在37条补充回答中,分别有17条(45.9%)、19条(51.4%)和1条(2.7%)无需修改、需要部分修改和需要大量修改。在准确性(P = 0.533)、补充回答(P = 0.121)、必要修改(P = 0.580)和不完整回答(P = 0.881)方面无显著差异。总体κ为中等,为0.55。ChatGPT-3.5提供的回答较短(P = 0.002),但就弗莱什-金凯德年级水平(P = 0.002)、弗莱什阅读易读性(P < 0.001)和冈宁雾度指数(P = 0.021)而言,谷歌Gemini的可读性更强。
虽然AI聊天机器人提供的回答准确性合理,但大多数补充信息需要修改且可读性复杂。在AI聊天机器人能够可靠地用于患者教育之前,有必要进行改进。
四级。