与美国矫形外科医师学会临床实践指南相比，ChatGPT和谷歌Gemini在提供髋关节发育不良管理建议方面存在临床不足。

ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines.

作者信息

Nian Patrick P, Umesh Amith, Jones Ruth H, Adhiyaman Akshitha, Williams Christopher J, Goodbody Christine M, Heyer Jessica H, Doyle Shevaun M

机构信息

Hospital for Special Surgery, New York City, NY, USA.

Children's Hospital of Philadelphia, Philadelphia, PA, USA.

出版信息

J Pediatr Soc North Am. 2024 Dec 9;10:100135. doi: 10.1016/j.jposna.2024.100135. eCollection 2025 Feb.

DOI:10.1016/j.jposna.2024.100135

PMID:40433583

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12088106/

Abstract

BACKGROUND

Large language models, including Chat Generative Pre-trained Transformer (ChatGPT) and Google Gemini have accelerated public accessibility to information, but their accuracy to medical questions remains unknown. In pediatric orthopaedics, no study has utilized board-certified expert opinion to evaluate the accuracy of artificial intelligence (AI) chatbots compared to evidence-based recommendations, including the American Academy of Orthopaedic Surgeons clinical practice guidelines (AAOS CPGs). The aims of this study were to compare responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini with AAOS CPG recommendations on developmental dysplasia of the hip (DDH) regarding accuracy, supplementary and incomplete response patterns, and readability.

METHODS

ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were prompted by questions created from 9 evidence-based recommendations from the 2022 AAOS CPG on DDH. The answers to these questions were obtained on July 1st, 2024. Responses were anonymized and independently evaluated by two pediatric orthopaedic attending surgeons. Supplementary responses were additionally evaluated on whether no, some, or many modifications were necessary. Readability metrics (response length, Flesch-Kincaid reading level, Flesch Reading Ease, Gunning Fog Index) were compared. Cohen's Kappa inter-rater reliability (κ) was calculated. Chi-square analyses and single-factor analysis of variance were utilized to compare categorical and continuous variables, respectively. Statistical significance was set with < 0.05.

RESULTS

ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were accurate in 5/9, 6/9, 6/9, supplementary in 8/9, 7/9, 9/9, and incomplete in 7/9, 6/9, 7/9 recommendations, respectively. Of 24 supplementary responses, 5 (20.8%), 16 (66.7%), and 3 (12.5%) required no, some, and many modifications, respectively. There were no significant differences in accuracy ( = 0.853), supplementary responses ( = 0.325), necessary modifications ( = 0.661), and incomplete responses ( = 0.825). κ was highest for accuracy at 0.17. Google Gemini was significantly more readable in Flesch-Kincaid reading level, Flesch Reading Ease, and Gunning fog index (all, < 0.05).

CONCLUSIONS

In the setting of DDH, AI chatbots demonstrated limited accuracy, high supplementary and incomplete response patterns, and complex readability. Pediatric orthopaedic surgeons can counsel patients and their families to set appropriate expectations on the utility of these novel tools.

KEY CONCEPTS

(1)Responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were inadequately accurate, frequently provided supplementary information that required modifications and frequently lacked essential details from the AAOS CPGs on DDH.(2)Accurate, supplementary, and incomplete response patterns were not significantly different among the three chatbots.(3)Google Gemini provided responses that had the highest readability among the three chatbots.(4)Pediatric orthopaedic surgeons can play a role in counseling patients and their families on the limited utility of AI chatbots for patient education purposes.

LEVEL OF EVIDENCE

IV.

摘要

背景

包括聊天生成预训练变换器（ChatGPT）和谷歌Gemini在内的大语言模型加速了信息的公众可及性，但其对医学问题的准确性仍不明确。在小儿骨科领域，尚无研究利用经委员会认证的专家意见，将人工智能（AI）聊天机器人的准确性与基于证据的建议（包括美国矫形外科医师学会临床实践指南（AAOS CPGs））进行比较。本研究的目的是比较ChatGPT-4.0、ChatGPT-3.5和谷歌Gemini针对AAOS CPGs关于发育性髋关节发育不良（DDH）的建议在准确性、补充和不完整回答模式以及可读性方面的回答。

方法

根据2022年AAOS CPGs关于DDH的9条基于证据的建议提出问题，分别向ChatGPT-4.0、ChatGPT-3.5和谷歌Gemini提问。于2024年7月1日获取这些问题的答案。回答进行匿名处理，并由两名小儿骨科主治医生独立评估。对于补充回答，还额外评估了是否需要进行无、一些或大量修改。比较可读性指标（回答长度、弗莱什-金凯德阅读难度、弗莱什易读性、冈宁雾度指数）。计算科恩卡方评分者间信度（κ）。分别采用卡方分析和单因素方差分析来比较分类变量和连续变量。设定统计学显著性水平为P<0.05。

结果

ChatGPT-4.0、ChatGPT-3.5和谷歌Gemini在9条建议中的准确回答分别为5条、6条、6条，补充回答分别为8条、7条、9条，不完整回答分别为7条、6条、7条。在24条补充回答中，分别有5条（20.8%）、16条（66.7%）和3条（12.5%）无需修改、需要一些修改和需要大量修改。在准确性（P = 0.853）、补充回答（P = 0.325）、必要修改（P = 0.661）和不完整回答（P = 0.825）方面无显著差异。κ在准确性方面最高，为0.17。在弗莱什-金凯德阅读难度、弗莱什易读性和冈宁雾度指数方面，谷歌Gemini的可读性显著更高（均为P<0.05）。

结论

在DDH的背景下，AI聊天机器人表现出有限的准确性、高补充和不完整回答模式以及复杂的可读性。小儿骨科医生可以为患者及其家属提供咨询，使其对这些新型工具的效用设定适当期望。

关键概念

（1）ChatGPT-4.0、ChatGPT-3.5和谷歌Gemini的回答准确性不足，经常提供需要修改的补充信息，且经常缺少AAOS CPGs关于DDH的关键细节。（2）三种聊天机器人在准确、补充和不完整回答模式方面无显著差异。（3）谷歌Gemini提供的回答在三种聊天机器人中可读性最高。（4）小儿骨科医生可以在向患者及其家属提供咨询方面发挥作用，告知他们AI聊天机器人在患者教育方面的效用有限。

证据级别

IV级。

相似文献

ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines.与美国矫形外科医师学会临床实践指南相比，ChatGPT和谷歌Gemini在提供髋关节发育不良管理建议方面存在临床不足。

J Pediatr Soc North Am. 2024 Dec 9;10:100135. doi: 10.1016/j.jposna.2024.100135. eCollection 2025 Feb.

Pediatric Supracondylar Humerus and Diaphyseal Femur Fractures: A Comparative Analysis of Chat Generative Pretrained Transformer and Google Gemini Recommendations Versus American Academy of Orthopaedic Surgeons Clinical Practice Guidelines.小儿肱骨髁上骨折和股骨干骨折：Chat生成式预训练变换器与谷歌Gemini建议对比美国矫形外科医师学会临床实践指南的分析

J Pediatr Orthop. 2025 Apr 1;45(4):e338-e344. doi: 10.1097/BPO.0000000000002890. Epub 2025 Jan 14.

Evaluating the Quality and Readability of Generative Artificial Intelligence (AI) Chatbot Responses in the Management of Achilles Tendon Rupture.评估生成式人工智能（AI）聊天机器人在跟腱断裂管理中的回复质量和可读性。

Cureus. 2025 Jan 31;17(1):e78313. doi: 10.7759/cureus.78313. eCollection 2025 Jan.

Do ChatGPT and Gemini Provide Appropriate Recommendations for Pediatric Orthopaedic Conditions?ChatGPT和Gemini是否能为小儿骨科疾病提供恰当的建议？

J Pediatr Orthop. 2025 Jan 1;45(1):e66-e71. doi: 10.1097/BPO.0000000000002797. Epub 2024 Aug 22.

Assessing the Readability of Patient Education Materials on Cardiac Catheterization From Artificial Intelligence Chatbots: An Observational Cross-Sectional Study.评估人工智能聊天机器人提供的心脏导管插入术患者教育材料的可读性：一项观察性横断面研究。

Cureus. 2024 Jul 4;16(7):e63865. doi: 10.7759/cureus.63865. eCollection 2024 Jul.

Evaluación de la fiabilidad y legibilidad de las respuestas de los chatbots como recurso de información al paciente para las exploraciones PET-TC más communes.评估聊天机器人回复作为常见PET-CT检查患者信息资源的可靠性和可读性。

Rev Esp Med Nucl Imagen Mol (Engl Ed). 2025 Jan-Feb;44(1):500065. doi: 10.1016/j.remnie.2024.500065. Epub 2024 Sep 28.

Performance of Artificial Intelligence in Addressing Questions Regarding the Management of Pediatric Supracondylar Humerus Fractures.人工智能在解决小儿肱骨髁上骨折管理相关问题中的表现

J Pediatr Soc North Am. 2025 Mar 9;11:100164. doi: 10.1016/j.jposna.2025.100164. eCollection 2025 May.

Accuracy and Readability of Artificial Intelligence Chatbot Responses to Vasectomy-Related Questions: Public Beware.人工智能聊天机器人对输精管切除术相关问题回答的准确性和可读性：公众需谨慎。

Cureus. 2024 Aug 28;16(8):e67996. doi: 10.7759/cureus.67996. eCollection 2024 Aug.

Chat Generative Pretrained Transformer (ChatGPT) and Bard: Artificial Intelligence Does not yet Provide Clinically Supported Answers for Hip and Knee Osteoarthritis.聊天生成预训练转换器（ChatGPT）和巴德：人工智能尚未为髋和膝关节骨关节炎提供临床支持的答案。

J Arthroplasty. 2024 May;39(5):1184-1190. doi: 10.1016/j.arth.2024.01.029. Epub 2024 Jan 17.

Readability, accuracy and appropriateness and quality of AI chatbot responses as a patient information source on root canal retreatment: A comparative assessment.作为根管再治疗患者信息来源的人工智能聊天机器人回复的可读性、准确性、恰当性和质量：一项比较评估。

Int J Med Inform. 2025 Sep;201:105948. doi: 10.1016/j.ijmedinf.2025.105948. Epub 2025 Apr 25.

引用本文的文献

Comparison of the readability of ChatGPT and Bard in medical communication: a meta-analysis.ChatGPT与Bard在医学交流中的可读性比较：一项荟萃分析。

BMC Med Inform Decis Mak. 2025 Sep 1;25(1):325. doi: 10.1186/s12911-025-03035-2.

Large Language Models for Pediatric Differential Diagnoses in Rural Health Care: Multicenter Retrospective Cohort Study Comparing GPT-3 With Pediatrician Performance.用于农村医疗保健中儿科鉴别诊断的大语言模型：比较GPT-3与儿科医生表现的多中心回顾性队列研究

JMIRx Med. 2025 Mar 19;6:e65263. doi: 10.2196/65263.

本文引用的文献

Is the information provided by large language models valid in educating patients about adolescent idiopathic scoliosis? An evaluation of content, clarity, and empathy : The perspective of the European Spine Study Group.大语言模型提供的信息在对患者进行青少年特发性脊柱侧凸教育方面是否有效？内容、清晰度和同理心的评估：欧洲脊柱研究小组的观点

Spine Deform. 2025 Mar;13(2):361-372. doi: 10.1007/s43390-024-00955-3. Epub 2024 Nov 4.

Artificial Intelligence Promotes the Dunning Kruger Effect: Evaluating ChatGPT Answers to Frequently Asked Questions About Adolescent Idiopathic Scoliosis.人工智能助长了邓宁-克鲁格效应：评估ChatGPT对青少年特发性脊柱侧凸常见问题的回答

J Am Acad Orthop Surg. 2025 May 1;33(9):473-480. doi: 10.5435/JAAOS-D-24-00297. Epub 2024 Sep 20.

A Comparison of ChatGPT and Expert Consensus Statements on Surgical Site Infection Prevention in High-Risk Paediatric Spine Surgery.ChatGPT与专家共识声明在高危儿科脊柱手术中预防手术部位感染方面的比较

J Pediatr Orthop. 2025 Jan 1;45(1):e72-e75. doi: 10.1097/BPO.0000000000002781. Epub 2024 Aug 30.

Do ChatGPT and Gemini Provide Appropriate Recommendations for Pediatric Orthopaedic Conditions?ChatGPT和Gemini是否能为小儿骨科疾病提供恰当的建议？

J Pediatr Orthop. 2025 Jan 1;45(1):e66-e71. doi: 10.1097/BPO.0000000000002797. Epub 2024 Aug 22.

Can Large Language Models (LLMs) Predict the Appropriate Treatment of Acute Hip Fractures in Older Adults? Comparing Appropriate Use Criteria With Recommendations From ChatGPT.大语言模型（LLMs）能否预测老年人急性髋部骨折的适当治疗方法？比较适当使用标准与 ChatGPT 的建议

J Am Acad Orthop Surg Glob Res Rev. 2024 Aug 9;8(8). doi: 10.5435/JAAOSGlobal-D-24-00206. eCollection 2024 Aug 1.

An analysis of ChatGPT recommendations for the diagnosis and treatment of cervical radiculopathy.对 ChatGPT 推荐的颈神经根病诊断和治疗方案的分析。

J Neurosurg Spine. 2024 Jun 28;41(3):385-395. doi: 10.3171/2024.4.SPINE231148. Print 2024 Sep 1.

The Large Language Model ChatGPT-4 Exhibits Excellent Triage Capabilities and Diagnostic Performance for Patients Presenting With Various Causes of Knee Pain.大型语言模型ChatGPT-4对因各种原因导致膝关节疼痛的患者表现出出色的分诊能力和诊断性能。

Arthroscopy. 2025 May;41(5):1438-1447.e14. doi: 10.1016/j.arthro.2024.06.021. Epub 2024 Jun 24.

Evaluating Chat Generative Pre-trained Transformer Responses to Common Pediatric In-toeing Questions.评估聊天生成预训练转换器对常见儿科内八字问题的回答。

J Pediatr Orthop. 2024 Aug 1;44(7):e592-e597. doi: 10.1097/BPO.0000000000002695. Epub 2024 Apr 30.

ChatGPT Responses to Common Questions About Slipped Capital Femoral Epiphysis: A Reliable Resource for Parents?ChatGPT 对常见的关于股骨头骨骺滑脱问题的回答：父母的可靠资源？

J Pediatr Orthop. 2024 Jul 1;44(6):353-357. doi: 10.1097/BPO.0000000000002681. Epub 2024 Apr 10.

Are Generative Pretrained Transformer 4 Responses to Developmental Dysplasia of the Hip Clinical Scenarios Universal? An International Review.生成式预训练转换器 4 对发育性髋关节发育不良临床情况的反应是否具有普遍性？一项国际综述。

J Pediatr Orthop. 2024 Jul 1;44(6):e504-e511. doi: 10.1097/BPO.0000000000002682. Epub 2024 Apr 9.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验