Suppr超能文献

与人类临床医生相比,人工智能工具制定骨科临床决策的能力:对ChatGPT 3.5、ChatGPT 4和Bard的分析。

The ability of artificial intelligence tools to formulate orthopaedic clinical decisions in comparison to human clinicians: An analysis of ChatGPT 3.5, ChatGPT 4, and Bard.

作者信息

Agharia Suzen, Szatkowski Jan, Fraval Andrew, Stevens Jarrad, Zhou Yushy

机构信息

Department of Orthopaedic Surgery, St. Vincent's Hospital, Melbourne, Victoria, Australia.

Department of Orthopaedic Surgery, Indiana University Health Methodist Hospital, Indianapolis, IN, USA.

出版信息

J Orthop. 2023 Dec 1;50:1-7. doi: 10.1016/j.jor.2023.11.063. eCollection 2024 Apr.

Abstract

BACKGROUND

Recent advancements in artificial intelligence (AI) have sparked interest in its integration into clinical medicine and education. This study evaluates the performance of three AI tools compared to human clinicians in addressing complex orthopaedic decisions in real-world clinical cases.

QUESTIONS/PURPOSES: To evaluate the ability of commonly used AI tools to formulate orthopaedic clinical decisions in comparison to human clinicians.

PATIENTS AND METHODS

The study used OrthoBullets Cases, a publicly available clinical cases collaboration platform where surgeons from around the world choose treatment options based on peer-reviewed standardised treatment polls. The clinical cases cover various orthopaedic categories. Three AI tools, (ChatGPT 3.5, ChatGPT 4, and Bard), were evaluated. Uniform prompts were used to input case information including questions relating to the case, and the AI tools' responses were analysed for alignment with the most popular response, within 10%, and within 20% of the most popular human responses.

RESULTS

In total, 8 clinical categories comprising of 97 questions were analysed. ChatGPT 4 demonstrated the highest proportion of most popular responses (proportion of most popular response: ChatGPT 4 68.0%, ChatGPT 3.5 40.2%, Bard 45.4%, value < 0.001), outperforming other AI tools. AI tools performed poorer in questions that were considered controversial (where disagreement occurred in human responses). Inter-tool agreement, as evaluated using Cohen's kappa coefficient, ranged from 0.201 (ChatGPT 4 vs. Bard) to 0.634 (ChatGPT 3.5 vs. Bard). However, AI tool responses varied widely, reflecting a need for consistency in real-world clinical applications.

CONCLUSIONS

While AI tools demonstrated potential use in educational contexts, their integration into clinical decision-making requires caution due to inconsistent responses and deviations from peer consensus. Future research should focus on specialised clinical AI tool development to maximise utility in clinical decision-making.

LEVEL OF EVIDENCE

IV.

摘要

背景

人工智能(AI)的最新进展引发了人们将其整合到临床医学和教育中的兴趣。本研究评估了三种人工智能工具在处理实际临床病例中复杂骨科决策时与人类临床医生相比的表现。

问题/目的:评估常用人工智能工具与人类临床医生相比制定骨科临床决策的能力。

患者与方法

本研究使用了OrthoBullets病例,这是一个公开可用的临床病例协作平台,来自世界各地的外科医生根据同行评审的标准化治疗投票选择治疗方案。临床病例涵盖各种骨科类别。对三种人工智能工具(ChatGPT 3.5、ChatGPT 4和Bard)进行了评估。使用统一的提示输入病例信息,包括与病例相关的问题,并分析人工智能工具的回答与最受欢迎的回答、在最受欢迎的人类回答的10%以内以及20%以内的一致性。

结果

总共分析了8个临床类别中的97个问题。ChatGPT 4表现出最高比例的最受欢迎回答(最受欢迎回答的比例:ChatGPT 4为68.0%,ChatGPT 3.5为40.2%,Bard为45.4%,P值<0.001),优于其他人工智能工具。在被认为有争议的问题(人类回答中存在分歧的问题)上,人工智能工具的表现较差。使用科恩kappa系数评估的工具间一致性范围为0.201(ChatGPT 4与Bard)至0.634(ChatGPT 3.5与Bard)。然而,人工智能工具的回答差异很大,这反映出在实际临床应用中需要一致性。

结论

虽然人工智能工具在教育环境中显示出潜在用途,但由于回答不一致以及偏离同行共识,将其整合到临床决策中需要谨慎。未来的研究应专注于开发专门的临床人工智能工具,以最大限度地提高在临床决策中的效用。

证据水平

IV级。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2224/10749221/0926f607a676/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验