与人类临床医生相比，人工智能工具制定骨科临床决策的能力：对ChatGPT 3.5、ChatGPT 4和Bard的分析。

The ability of artificial intelligence tools to formulate orthopaedic clinical decisions in comparison to human clinicians: An analysis of ChatGPT 3.5, ChatGPT 4, and Bard.

作者信息

Agharia Suzen, Szatkowski Jan, Fraval Andrew, Stevens Jarrad, Zhou Yushy

机构信息

Department of Orthopaedic Surgery, St. Vincent's Hospital, Melbourne, Victoria, Australia.

Department of Orthopaedic Surgery, Indiana University Health Methodist Hospital, Indianapolis, IN, USA.

出版信息

J Orthop. 2023 Dec 1;50:1-7. doi: 10.1016/j.jor.2023.11.063. eCollection 2024 Apr.

DOI:10.1016/j.jor.2023.11.063

PMID:38148925

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10749221/

Abstract

BACKGROUND

Recent advancements in artificial intelligence (AI) have sparked interest in its integration into clinical medicine and education. This study evaluates the performance of three AI tools compared to human clinicians in addressing complex orthopaedic decisions in real-world clinical cases.

QUESTIONS/PURPOSES: To evaluate the ability of commonly used AI tools to formulate orthopaedic clinical decisions in comparison to human clinicians.

PATIENTS AND METHODS

The study used OrthoBullets Cases, a publicly available clinical cases collaboration platform where surgeons from around the world choose treatment options based on peer-reviewed standardised treatment polls. The clinical cases cover various orthopaedic categories. Three AI tools, (ChatGPT 3.5, ChatGPT 4, and Bard), were evaluated. Uniform prompts were used to input case information including questions relating to the case, and the AI tools' responses were analysed for alignment with the most popular response, within 10%, and within 20% of the most popular human responses.

RESULTS

In total, 8 clinical categories comprising of 97 questions were analysed. ChatGPT 4 demonstrated the highest proportion of most popular responses (proportion of most popular response: ChatGPT 4 68.0%, ChatGPT 3.5 40.2%, Bard 45.4%, value < 0.001), outperforming other AI tools. AI tools performed poorer in questions that were considered controversial (where disagreement occurred in human responses). Inter-tool agreement, as evaluated using Cohen's kappa coefficient, ranged from 0.201 (ChatGPT 4 vs. Bard) to 0.634 (ChatGPT 3.5 vs. Bard). However, AI tool responses varied widely, reflecting a need for consistency in real-world clinical applications.

CONCLUSIONS

While AI tools demonstrated potential use in educational contexts, their integration into clinical decision-making requires caution due to inconsistent responses and deviations from peer consensus. Future research should focus on specialised clinical AI tool development to maximise utility in clinical decision-making.

LEVEL OF EVIDENCE

IV.

摘要

背景

人工智能（AI）的最新进展引发了人们将其整合到临床医学和教育中的兴趣。本研究评估了三种人工智能工具在处理实际临床病例中复杂骨科决策时与人类临床医生相比的表现。

问题/目的：评估常用人工智能工具与人类临床医生相比制定骨科临床决策的能力。

患者与方法

本研究使用了OrthoBullets病例，这是一个公开可用的临床病例协作平台，来自世界各地的外科医生根据同行评审的标准化治疗投票选择治疗方案。临床病例涵盖各种骨科类别。对三种人工智能工具（ChatGPT 3.5、ChatGPT 4和Bard）进行了评估。使用统一的提示输入病例信息，包括与病例相关的问题，并分析人工智能工具的回答与最受欢迎的回答、在最受欢迎的人类回答的10%以内以及20%以内的一致性。

结果

总共分析了8个临床类别中的97个问题。ChatGPT 4表现出最高比例的最受欢迎回答（最受欢迎回答的比例：ChatGPT 4为68.0%，ChatGPT 3.5为40.2%，Bard为45.4%，P值<0.001），优于其他人工智能工具。在被认为有争议的问题（人类回答中存在分歧的问题）上，人工智能工具的表现较差。使用科恩kappa系数评估的工具间一致性范围为0.201（ChatGPT 4与Bard）至0.634（ChatGPT 3.5与Bard）。然而，人工智能工具的回答差异很大，这反映出在实际临床应用中需要一致性。

结论

虽然人工智能工具在教育环境中显示出潜在用途，但由于回答不一致以及偏离同行共识，将其整合到临床决策中需要谨慎。未来的研究应专注于开发专门的临床人工智能工具，以最大限度地提高在临床决策中的效用。

证据水平

IV级。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2224/10749221/0926f607a676/gr1.jpg

相似文献

The ability of artificial intelligence tools to formulate orthopaedic clinical decisions in comparison to human clinicians: An analysis of ChatGPT 3.5, ChatGPT 4, and Bard.

J Orthop. 2023 Dec 1;50:1-7. doi: 10.1016/j.jor.2023.11.063. eCollection 2024 Apr.

Evaluation of the Current Status of Artificial Intelligence for Endourology Patient Education: A Blind Comparison of ChatGPT and Google Bard Against Traditional Information Resources.

J Endourol. 2024 Aug;38(8):843-851. doi: 10.1089/end.2023.0696. Epub 2024 May 17.

Chat Generative Pretrained Transformer (ChatGPT) and Bard: Artificial Intelligence Does not yet Provide Clinically Supported Answers for Hip and Knee Osteoarthritis.

J Arthroplasty. 2024 May;39(5):1184-1190. doi: 10.1016/j.arth.2024.01.029. Epub 2024 Jan 17.

Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5 and Humans in Clinical Chemistry Multiple-Choice Questions.

Adv Med Educ Pract. 2024 Sep 20;15:857-871. doi: 10.2147/AMEP.S479801. eCollection 2024.

Assessing the Accuracy of Information on Medication Abortion: A Comparative Analysis of ChatGPT and Google Bard AI.

Cureus. 2024 Jan 2;16(1):e51544. doi: 10.7759/cureus.51544. eCollection 2024 Jan.

Generative Artificial Intelligence Performs at a Second-Year Orthopedic Resident Level.

Cureus. 2024 Mar 13;16(3):e56104. doi: 10.7759/cureus.56104. eCollection 2024 Mar.

Understanding the Landscape: The Emergence of Artificial Intelligence (AI), ChatGPT, and Google Bard in Gastroenterology.

Cureus. 2024 Jan 8;16(1):e51848. doi: 10.7759/cureus.51848. eCollection 2024 Jan.

Performance of Artificial Intelligence Chatbots on Glaucoma Questions Adapted From Patient Brochures.

Cureus. 2024 Mar 23;16(3):e56766. doi: 10.7759/cureus.56766. eCollection 2024 Mar.

Analysing the Applicability of ChatGPT, Bard, and Bing to Generate Reasoning-Based Multiple-Choice Questions in Medical Physiology.

Cureus. 2023 Jun 26;15(6):e40977. doi: 10.7759/cureus.40977. eCollection 2023 Jun.

Assessing the Capability of ChatGPT, Google Bard, and Microsoft Bing in Solving Radiology Case Vignettes.

Indian J Radiol Imaging. 2023 Dec 29;34(2):276-282. doi: 10.1055/s-0043-1777746. eCollection 2024 Apr.

引用本文的文献

Systematic Review on Large Language Models in Orthopaedic Surgery.

J Clin Med. 2025 Aug 20;14(16):5876. doi: 10.3390/jcm14165876.

Evaluating DeepResearch and DeepThink in anterior cruciate ligament surgery patient education: ChatGPT-4o excels in comprehensiveness, DeepSeek R1 leads in clarity and readability of orthopaedic information.

Knee Surg Sports Traumatol Arthrosc. 2025 Jun 1. doi: 10.1002/ksa.12711.

Assessing the Current Limitations of Large Language Models in Advancing Health Care Education.

JMIR Form Res. 2025 Jan 16;9:e51319. doi: 10.2196/51319.

Evaluating the quality and readability of ChatGPT-generated patient-facing medical information in rhinology.

Eur Arch Otorhinolaryngol. 2025 Apr;282(4):1911-1920. doi: 10.1007/s00405-024-09180-0. Epub 2024 Dec 26.

Examining the Role of Large Language Models in Orthopedics: Systematic Review.

J Med Internet Res. 2024 Nov 15;26:e59607. doi: 10.2196/59607.

Evaluating Large Language Model (LLM) Performance on Established Breast Classification Systems.

Diagnostics (Basel). 2024 Jul 11;14(14):1491. doi: 10.3390/diagnostics14141491.

本文引用的文献

Evaluating ChatGPT responses in the context of a 53-year-old male with a femoral neck fracture: a qualitative analysis.

Eur J Orthop Surg Traumatol. 2024 Feb;34(2):927-955. doi: 10.1007/s00590-023-03742-4. Epub 2023 Sep 30.

Evaluating the Sensitivity, Specificity, and Accuracy of ChatGPT-3.5, ChatGPT-4, Bing AI, and Bard Against Conventional Drug-Drug Interactions Clinical Tools.

Drug Healthc Patient Saf. 2023 Sep 20;15:137-147. doi: 10.2147/DHPS.S425858. eCollection 2023.

The impact of Chat Generative Pre-trained Transformer (ChatGPT) on medical education.

Postgrad Med J. 2023 Sep 21;99(1176):1125-1127. doi: 10.1093/postmj/qgad058.

Exploring the Test-Taking Capabilities of Chatbots-From Surgeon to Sommelier.

JAMA Ophthalmol. 2023 Aug 1;141(8):800-801. doi: 10.1001/jamaophthalmol.2023.3003.

Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.

Neurosurgery. 2023 Nov 1;93(5):1090-1098. doi: 10.1227/neu.0000000000002551. Epub 2023 Jun 12.

Update on Educational Resources and Evaluation Tools for Orthopaedic Surgery Residents.

J Am Acad Orthop Surg. 2023 Jul 1;31(13):660-668. doi: 10.5435/JAAOS-D-22-01195. Epub 2023 May 18.

Overview of Early ChatGPT's Presence in Medical Literature: Insights From a Hybrid Literature Review by ChatGPT and Human Experts.

Cureus. 2023 Apr 8;15(4):e37281. doi: 10.7759/cureus.37281. eCollection 2023 Apr.

Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.

N Engl J Med. 2023 Mar 30;388(13):1233-1239. doi: 10.1056/NEJMsr2214184.

ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns.

Healthcare (Basel). 2023 Mar 19;11(6):887. doi: 10.3390/healthcare11060887.

The rise of ChatGPT: Exploring its potential in medical education.

Anat Sci Educ. 2024 Jul-Aug;17(5):926-931. doi: 10.1002/ase.2270. Epub 2023 Mar 28.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

与人类临床医生相比，人工智能工具制定骨科临床决策的能力：对ChatGPT 3.5、ChatGPT 4和Bard的分析。

The ability of artificial intelligence tools to formulate orthopaedic clinical decisions in comparison to human clinicians: An analysis of ChatGPT 3.5, ChatGPT 4, and Bard.

作者信息

机构信息

出版信息

BACKGROUND

PATIENTS AND METHODS

RESULTS

CONCLUSIONS

LEVEL OF EVIDENCE

背景

患者与方法

结果

结论

证据水平

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献