ChatGPT和Gemini的建议是否符合手部和上肢手术的既定指南？

Do ChatGPT and Gemini's Recommendations Align With Established Guidelines for Hand and Upper Extremity Surgery?

作者信息

Zhang Yibin B, Fischer Fielding S, Abola Matthew V, Osei Daniel A, Wolfe Scott W, Amen Troy B

机构信息

Harvard Medical School, Boston, MA, USA.

Hospital for Special Surgery, New York, NY, USA.

出版信息

Hand (N Y). 2025 Sep 18:15589447251371089. doi: 10.1177/15589447251371089.

DOI:10.1177/15589447251371089

PMID:40964819

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12446276/

Abstract

BACKGROUND

The use of large language models (LLMs) such as ChatGPT and Gemini in clinical settings has surged, presenting potential benefits in reducing administrative workload and enhancing patient communication. However, concerns about the clinical accuracy of these tools persist. This study evaluated the concordance of ChatGPT and Gemini's recommendations with American Academy of Orthopedic Surgeons (AAOS) clinical practice guidelines (CPGs) for carpal tunnel syndrome, distal radius fractures, and glenohumeral joint osteoarthritis.

METHODS

ChatGPT (version 4o) and Gemini (version 1.5 Flash) were queried using structured text-based prompts aligned with AAOS CPGs. The LLMs' outputs were analyzed by blinded reviewers to determine concordance with the guidelines. Concordance rates were compared across models, topics, and guideline strength using descriptive statistics and McNemar's test. The transparency of responses, including source citation, was also assessed.

RESULTS

A total of 174 recommendations were generated, with an overall concordance rate of 62.1%. When comparing concordance rates between LLMs, there was no statistically significant difference between ChatGPT and Gemini (66.7% vs 57.5%, = .131). Concordance varied by topic and guideline strength, with ChatGPT performing best for moderately supported guidelines. Both models demonstrated low citation transparency. Gemini provided sources for 39.1% of recommendations, significantly more than ChatGPT's 3.5% ( < .0001).

CONCLUSIONS

Despite modest concordance rates, both models exhibited significant limitations, including variability across topics and guideline strengths, as well as insufficient citation transparency. These findings highlight the challenges in integrating LLMs into clinical practice and emphasize the need for further refinement and evaluation before adoption in hand surgery.

摘要

背景

ChatGPT和Gemini等大语言模型在临床环境中的使用激增，在减轻行政工作量和加强医患沟通方面具有潜在益处。然而，人们对这些工具的临床准确性仍存在担忧。本研究评估了ChatGPT和Gemini针对腕管综合征、桡骨远端骨折和盂肱关节骨关节炎的建议与美国矫形外科医师学会（AAOS）临床实践指南（CPG）的一致性。

方法

使用与AAOS CPG一致的基于文本的结构化提示对ChatGPT（版本4o）和Gemini（版本1.5 Flash）进行查询。由不知情的评审人员分析大语言模型的输出，以确定与指南的一致性。使用描述性统计和McNemar检验比较不同模型、主题和指南强度的一致性率。还评估了回答的透明度，包括来源引用。

结果

共生成了174条建议，总体一致性率为62.1%。在比较大语言模型之间的一致性率时，ChatGPT和Gemini之间没有统计学上的显著差异（66.7%对57.5%，P = 0.131）。一致性因主题和指南强度而异，ChatGPT在中等支持的指南中表现最佳。两个模型都表现出较低的引用透明度。Gemini为39.1%的建议提供了来源，显著多于ChatGPT的3.5%（P < 0.0001）。

结论

尽管一致性率一般，但两个模型都存在显著局限性，包括不同主题和指南强度之间的差异，以及引用透明度不足。这些发现凸显了将大语言模型整合到临床实践中的挑战，并强调在手部手术中采用之前需要进一步完善和评估。

相似文献

Do ChatGPT and Gemini's Recommendations Align With Established Guidelines for Hand and Upper Extremity Surgery?ChatGPT和Gemini的建议是否符合手部和上肢手术的既定指南？

Hand (N Y). 2025 Sep 18:15589447251371089. doi: 10.1177/15589447251371089.

"Dr. AI Will See You Now": How Do ChatGPT-4 Treatment Recommendations Align With Orthopaedic Clinical Practice Guidelines?“AI 医生为您服务”：ChatGPT-4 的治疗建议与骨科临床实践指南如何契合？

Clin Orthop Relat Res. 2024 Dec 1;482(12):2098-2106. doi: 10.1097/CORR.0000000000003234. Epub 2024 Sep 6.

Evaluating the perspectives of ChatGPT and Gemini on glenohumeral osteoarthritis management.评估ChatGPT和Gemini在肩关节骨关节炎管理方面的观点。

JSES Int. 2025 Apr 10;9(4):1365-1370. doi: 10.1016/j.jseint.2025.03.011. eCollection 2025 Jul.

Prescription of Controlled Substances: Benefits and Risks管制药品的处方：益处与风险

ChatGPT and Gemini Are Not Consistently Concordant With the 2020 American Academy of Orthopaedic Surgeons Clinical Practice Guidelines When Evaluating Rotator Cuff Injury.在评估肩袖损伤时，ChatGPT和Gemini与2020年美国矫形外科医师学会临床实践指南的结果并非始终一致。

Arthroscopy. 2025 Feb 4. doi: 10.1016/j.arthro.2025.01.039.

Artificial Intelligence in Peripheral Artery Disease Education: A Battle Between ChatGPT and Google Gemini.外周动脉疾病教育中的人工智能：ChatGPT与谷歌Gemini的较量

Cureus. 2025 Jun 1;17(6):e85174. doi: 10.7759/cureus.85174. eCollection 2025 Jun.

Evaluation of the Reliability of AI-Based Large Language Models in Developing Orthodontic Treatment Plans.基于人工智能的大语言模型在制定正畸治疗方案中的可靠性评估。

Cureus. 2025 Jul 31;17(7):e89149. doi: 10.7759/cureus.89149. eCollection 2025 Jul.

Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple-choice questions: implications for the use of AI in medical education.ChatGPT、Gemini与急诊医学实习最后一年学生在回答多项选择题方面的表现比较：人工智能在医学教育中的应用启示

Int J Emerg Med. 2025 Aug 7;18(1):146. doi: 10.1186/s12245-025-00949-6.

How Accurate Is AI? A Critical Evaluation of Commonly Used Large Language Models in Responding to Patient Concerns About Incidental Kidney Tumors.人工智能的准确性如何？对常用大语言模型回应患者对偶然发现的肾肿瘤担忧的批判性评估。

J Clin Med. 2025 Aug 12;14(16):5697. doi: 10.3390/jcm14165697.

Evaluating Artificial Intelligence in Spinal Cord Injury Management: A Comparative Analysis of ChatGPT-4o and Google Gemini Against American College of Surgeons Best Practices Guidelines for Spine Injury.评估人工智能在脊髓损伤管理中的应用：ChatGPT-4o和谷歌Gemini与美国外科医生学会脊柱损伤最佳实践指南的对比分析

Global Spine J. 2025 Feb 17:21925682251321837. doi: 10.1177/21925682251321837.

本文引用的文献

Editorial Commentary: Off-the-Shelf Large Language Models Are of Insufficient Quality to Provide Medical Treatment Recommendations, While Customization of Large Language Models Results in Quality Recommendations.编辑评论：现成的大语言模型质量不足以提供医疗建议，而定制大语言模型则能产生高质量的建议。

Arthroscopy. 2025 Feb;41(2):276-278. doi: 10.1016/j.arthro.2024.09.047. Epub 2024 Oct 3.

Clin Orthop Relat Res. 2024 Dec 1;482(12):2098-2106. doi: 10.1097/CORR.0000000000003234. Epub 2024 Sep 6.

Currently Available Large Language Models Do Not Provide Musculoskeletal Treatment Recommendations That Are Concordant With Evidence-Based Clinical Practice Guidelines.目前可用的大语言模型并未提供与循证临床实践指南相一致的肌肉骨骼治疗建议。

Arthroscopy. 2025 Feb;41(2):263-275.e6. doi: 10.1016/j.arthro.2024.07.040. Epub 2024 Aug 22.

Large Language Models in Orthopaedics: Definitions, Uses, and Limitations.骨科中的大语言模型：定义、用途及局限性

J Bone Joint Surg Am. 2024 Aug 7;106(15):1411-1418. doi: 10.2106/JBJS.23.01417. Epub 2024 Jun 19.

ChatGPT-4 Can Help Hand Surgeons Communicate Better With Patients.ChatGPT-4可帮助手外科医生更好地与患者沟通。

J Hand Surg Glob Online. 2024 Apr 6;6(3):436-438. doi: 10.1016/j.jhsg.2024.03.008. eCollection 2024 May.

AI in Hand Surgery: Assessing Large Language Models in the Classification and Management of Hand Injuries.人工智能在手外科中的应用：评估大语言模型在手部损伤分类与管理中的作用

J Clin Med. 2024 May 11;13(10):2832. doi: 10.3390/jcm13102832.

Quality of ChatGPT Responses to Frequently Asked Questions in Carpal Tunnel Release Surgery.ChatGPT对腕管松解手术常见问题的回答质量。

Plast Reconstr Surg Glob Open. 2024 May 16;12(5):e5822. doi: 10.1097/GOX.0000000000005822. eCollection 2024 May.

Chat Generative Pretraining Transformer Answers Patient-focused Questions in Cervical Spine Surgery.ChatGPT 生成式预训练转换器可回答颈椎手术患者关注的问题。

Clin Spine Surg. 2024 Jul 1;37(6):E278-E281. doi: 10.1097/BSD.0000000000001600. Epub 2024 Mar 21.

How Does ChatGPT Use Source Information Compared With Google? A Text Network Analysis of Online Health Information.ChatGPT 与谷歌相比如何使用来源信息？在线健康信息的文本网络分析。

Clin Orthop Relat Res. 2024 Apr 1;482(4):578-588. doi: 10.1097/CORR.0000000000002995. Epub 2024 Mar 1.

Leveraging large language models for generating responses to patient messages-a subjective analysis.利用大型语言模型生成对患者信息的回复——主观分析。

J Am Med Inform Assoc. 2024 May 20;31(6):1367-1379. doi: 10.1093/jamia/ocae052.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。