Agharia Suzen, Szatkowski Jan, Fraval Andrew, Stevens Jarrad, Zhou Yushy
Department of Orthopaedic Surgery, St. Vincent's Hospital, Melbourne, Victoria, Australia.
Department of Orthopaedic Surgery, Indiana University Health Methodist Hospital, Indianapolis, IN, USA.
J Orthop. 2023 Dec 1;50:1-7. doi: 10.1016/j.jor.2023.11.063. eCollection 2024 Apr.
Recent advancements in artificial intelligence (AI) have sparked interest in its integration into clinical medicine and education. This study evaluates the performance of three AI tools compared to human clinicians in addressing complex orthopaedic decisions in real-world clinical cases.
QUESTIONS/PURPOSES: To evaluate the ability of commonly used AI tools to formulate orthopaedic clinical decisions in comparison to human clinicians.
The study used OrthoBullets Cases, a publicly available clinical cases collaboration platform where surgeons from around the world choose treatment options based on peer-reviewed standardised treatment polls. The clinical cases cover various orthopaedic categories. Three AI tools, (ChatGPT 3.5, ChatGPT 4, and Bard), were evaluated. Uniform prompts were used to input case information including questions relating to the case, and the AI tools' responses were analysed for alignment with the most popular response, within 10%, and within 20% of the most popular human responses.
In total, 8 clinical categories comprising of 97 questions were analysed. ChatGPT 4 demonstrated the highest proportion of most popular responses (proportion of most popular response: ChatGPT 4 68.0%, ChatGPT 3.5 40.2%, Bard 45.4%, value < 0.001), outperforming other AI tools. AI tools performed poorer in questions that were considered controversial (where disagreement occurred in human responses). Inter-tool agreement, as evaluated using Cohen's kappa coefficient, ranged from 0.201 (ChatGPT 4 vs. Bard) to 0.634 (ChatGPT 3.5 vs. Bard). However, AI tool responses varied widely, reflecting a need for consistency in real-world clinical applications.
While AI tools demonstrated potential use in educational contexts, their integration into clinical decision-making requires caution due to inconsistent responses and deviations from peer consensus. Future research should focus on specialised clinical AI tool development to maximise utility in clinical decision-making.
IV.
人工智能(AI)的最新进展引发了人们将其整合到临床医学和教育中的兴趣。本研究评估了三种人工智能工具在处理实际临床病例中复杂骨科决策时与人类临床医生相比的表现。
问题/目的:评估常用人工智能工具与人类临床医生相比制定骨科临床决策的能力。
本研究使用了OrthoBullets病例,这是一个公开可用的临床病例协作平台,来自世界各地的外科医生根据同行评审的标准化治疗投票选择治疗方案。临床病例涵盖各种骨科类别。对三种人工智能工具(ChatGPT 3.5、ChatGPT 4和Bard)进行了评估。使用统一的提示输入病例信息,包括与病例相关的问题,并分析人工智能工具的回答与最受欢迎的回答、在最受欢迎的人类回答的10%以内以及20%以内的一致性。
总共分析了8个临床类别中的97个问题。ChatGPT 4表现出最高比例的最受欢迎回答(最受欢迎回答的比例:ChatGPT 4为68.0%,ChatGPT 3.5为40.2%,Bard为45.4%,P值<0.001),优于其他人工智能工具。在被认为有争议的问题(人类回答中存在分歧的问题)上,人工智能工具的表现较差。使用科恩kappa系数评估的工具间一致性范围为0.201(ChatGPT 4与Bard)至0.634(ChatGPT 3.5与Bard)。然而,人工智能工具的回答差异很大,这反映出在实际临床应用中需要一致性。
虽然人工智能工具在教育环境中显示出潜在用途,但由于回答不一致以及偏离同行共识,将其整合到临床决策中需要谨慎。未来的研究应专注于开发专门的临床人工智能工具,以最大限度地提高在临床决策中的效用。
IV级。