Nian Patrick P, Umesh Amith, Jones Ruth H, Adhiyaman Akshitha, Williams Christopher J, Goodbody Christine M, Heyer Jessica H, Doyle Shevaun M
Hospital for Special Surgery, New York City, NY, USA.
Children's Hospital of Philadelphia, Philadelphia, PA, USA.
J Pediatr Soc North Am. 2024 Dec 9;10:100135. doi: 10.1016/j.jposna.2024.100135. eCollection 2025 Feb.
Large language models, including Chat Generative Pre-trained Transformer (ChatGPT) and Google Gemini have accelerated public accessibility to information, but their accuracy to medical questions remains unknown. In pediatric orthopaedics, no study has utilized board-certified expert opinion to evaluate the accuracy of artificial intelligence (AI) chatbots compared to evidence-based recommendations, including the American Academy of Orthopaedic Surgeons clinical practice guidelines (AAOS CPGs). The aims of this study were to compare responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini with AAOS CPG recommendations on developmental dysplasia of the hip (DDH) regarding accuracy, supplementary and incomplete response patterns, and readability.
ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were prompted by questions created from 9 evidence-based recommendations from the 2022 AAOS CPG on DDH. The answers to these questions were obtained on July 1st, 2024. Responses were anonymized and independently evaluated by two pediatric orthopaedic attending surgeons. Supplementary responses were additionally evaluated on whether no, some, or many modifications were necessary. Readability metrics (response length, Flesch-Kincaid reading level, Flesch Reading Ease, Gunning Fog Index) were compared. Cohen's Kappa inter-rater reliability (κ) was calculated. Chi-square analyses and single-factor analysis of variance were utilized to compare categorical and continuous variables, respectively. Statistical significance was set with < 0.05.
ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were accurate in 5/9, 6/9, 6/9, supplementary in 8/9, 7/9, 9/9, and incomplete in 7/9, 6/9, 7/9 recommendations, respectively. Of 24 supplementary responses, 5 (20.8%), 16 (66.7%), and 3 (12.5%) required no, some, and many modifications, respectively. There were no significant differences in accuracy ( = 0.853), supplementary responses ( = 0.325), necessary modifications ( = 0.661), and incomplete responses ( = 0.825). κ was highest for accuracy at 0.17. Google Gemini was significantly more readable in Flesch-Kincaid reading level, Flesch Reading Ease, and Gunning fog index (all, < 0.05).
In the setting of DDH, AI chatbots demonstrated limited accuracy, high supplementary and incomplete response patterns, and complex readability. Pediatric orthopaedic surgeons can counsel patients and their families to set appropriate expectations on the utility of these novel tools.
(1)Responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were inadequately accurate, frequently provided supplementary information that required modifications and frequently lacked essential details from the AAOS CPGs on DDH.(2)Accurate, supplementary, and incomplete response patterns were not significantly different among the three chatbots.(3)Google Gemini provided responses that had the highest readability among the three chatbots.(4)Pediatric orthopaedic surgeons can play a role in counseling patients and their families on the limited utility of AI chatbots for patient education purposes.
IV.
包括聊天生成预训练变换器(ChatGPT)和谷歌Gemini在内的大语言模型加速了信息的公众可及性,但其对医学问题的准确性仍不明确。在小儿骨科领域,尚无研究利用经委员会认证的专家意见,将人工智能(AI)聊天机器人的准确性与基于证据的建议(包括美国矫形外科医师学会临床实践指南(AAOS CPGs))进行比较。本研究的目的是比较ChatGPT-4.0、ChatGPT-3.5和谷歌Gemini针对AAOS CPGs关于发育性髋关节发育不良(DDH)的建议在准确性、补充和不完整回答模式以及可读性方面的回答。
根据2022年AAOS CPGs关于DDH的9条基于证据的建议提出问题,分别向ChatGPT-4.0、ChatGPT-3.5和谷歌Gemini提问。于2024年7月1日获取这些问题的答案。回答进行匿名处理,并由两名小儿骨科主治医生独立评估。对于补充回答,还额外评估了是否需要进行无、一些或大量修改。比较可读性指标(回答长度、弗莱什-金凯德阅读难度、弗莱什易读性、冈宁雾度指数)。计算科恩卡方评分者间信度(κ)。分别采用卡方分析和单因素方差分析来比较分类变量和连续变量。设定统计学显著性水平为P<0.05。
ChatGPT-4.0、ChatGPT-3.5和谷歌Gemini在9条建议中的准确回答分别为5条、6条、6条,补充回答分别为8条、7条、9条,不完整回答分别为7条、6条、7条。在24条补充回答中,分别有5条(20.8%)、16条(66.7%)和3条(12.5%)无需修改、需要一些修改和需要大量修改。在准确性(P = 0.853)、补充回答(P = 0.325)、必要修改(P = 0.661)和不完整回答(P = 0.825)方面无显著差异。κ在准确性方面最高,为0.17。在弗莱什-金凯德阅读难度、弗莱什易读性和冈宁雾度指数方面,谷歌Gemini的可读性显著更高(均为P<0.05)。
在DDH的背景下,AI聊天机器人表现出有限的准确性、高补充和不完整回答模式以及复杂的可读性。小儿骨科医生可以为患者及其家属提供咨询,使其对这些新型工具的效用设定适当期望。
(1)ChatGPT-4.0、ChatGPT-3.5和谷歌Gemini的回答准确性不足,经常提供需要修改的补充信息,且经常缺少AAOS CPGs关于DDH的关键细节。(2)三种聊天机器人在准确、补充和不完整回答模式方面无显著差异。(3)谷歌Gemini提供的回答在三种聊天机器人中可读性最高。(4)小儿骨科医生可以在向患者及其家属提供咨询方面发挥作用,告知他们AI聊天机器人在患者教育方面的效用有限。
IV级。