人工智能模型与土耳其骨科专科培训发展考试中骨科住院医师的表现对比。

Performance of AI Models vs. Orthopedic Residents in Turkish Specialty Training Development Exams in Orthopedics.

作者信息

Ipek Enver, Sulek Yusuf, Balkanli Bahadir

机构信息

Department of Orthopedics, University of Health Sciences Türkiye, Sisli Hamidiye Etfal Training and Research Hospital, Istanbul, Türkiye.

出版信息

Sisli Etfal Hastan Tip Bul. 2025 Feb 7;59(2):151-155. doi: 10.14744/SEMB.2025.65289. eCollection 2025.

DOI:10.14744/SEMB.2025.65289

PMID:40756288

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12314458/

Abstract

OBJECTIVES

As artificial intelligence (AI) continues to advance, its integration into medical education and clinical decision making has attracted considerable attention. Large language models, such as ChatGPT-4o, Gemini, Bing AI, and DeepSeek, have demonstrated potential in supporting healthcare professionals, particularly in specialty training examinations. However, the extent to which these models can independently match or surpass human performance in specialized medical assessments remains uncertain. This study aimed to systematically compare the performance of these AI models with orthopedic residents in the Specialty Training Development Exams (UEGS) conducted between 2010 and 2021, focusing on their accuracy, depth of explanation, and clinical applicability.

METHODS

This retrospective comparative study involved presenting the UEGS questions to ChatGPT-4o, Gemini, Bing AI, and DeepSeek. Orthopedic residents who took the exams during 2010-2021 served as the control group. The responses were evaluated for accuracy, explanatory details, and clinical applicability. Statistical analysis was conducted using SPSS Version 27, with one-way ANOVA and post-hoc tests for performance comparison.

RESULTS

All AI models outperformed orthopedic residents in terms of accuracy. Bing AI demonstrated the highest accuracy rates (64.0% to 93.0%), followed by Gemini (66.0% to 87.0%) and DeepSeek (63.5% to 81.0%). ChatGPT-4o showed the lowest accuracy among AI models (51.0% to 59.5%). Orthopedic residents consistently had the lowest accuracy (43.95% to 53.45%). Bing AI, Gemini, and DeepSeek showed knowledge levels equivalent to over 5 years of medical experience, while ChatGPT-4o ranged from to 2-5 years.

CONCLUSION

This study showed that AI models, especially Bing AI and Gemini, perform at a high level in orthopedic specialty examinations and have potential as educational support tools. However, the lower accuracy of ChatGPT-4o reduced its suitability for assessment. Despite these limitations, AI shows promise in medical education. Future research should focus on improving the reliability, incorporating visual data interpretation, and exploring clinical integration.

摘要

目的

随着人工智能（AI）不断发展，其在医学教育和临床决策中的整合已引起广泛关注。大型语言模型，如ChatGPT-4o、Gemini、必应AI和豆包，已在支持医疗专业人员方面展现出潜力，尤其是在专科培训考试中。然而，这些模型在专业医学评估中能在多大程度上独立匹配或超越人类表现仍不确定。本研究旨在系统比较这些人工智能模型与2010年至2021年期间参加专科培训发展考试（UEGS）的骨科住院医师的表现，重点关注其准确性、解释深度和临床适用性。

方法

这项回顾性比较研究包括向ChatGPT-4o、Gemini、必应AI和豆包呈现UEGS问题。2010年至2021年期间参加考试的骨科住院医师作为对照组。对回答进行准确性、解释细节和临床适用性评估。使用SPSS 27版进行统计分析，采用单因素方差分析和事后检验进行性能比较。

结果

在准确性方面，所有人工智能模型均优于骨科住院医师。必应AI的准确率最高（64.0%至93.0%），其次是Gemini（66.0%至87.0%）和豆包（63.5%至81.0%）。ChatGPT-4o在人工智能模型中准确率最低（51.0%至59.5%）。骨科住院医师的准确率始终最低（43.95%至53.45%）。必应AI、Gemini和豆包的知识水平相当于超过5年的医学经验，而ChatGPT-4o的知识水平在2至5年之间。

结论

本研究表明，人工智能模型，尤其是必应AI和Gemini，在骨科专科考试中表现出色，有潜力作为教育支持工具。然而，ChatGPT-4o较低的准确性降低了其评估适用性。尽管存在这些局限性，人工智能在医学教育中仍有前景。未来研究应专注于提高可靠性、纳入视觉数据解读并探索临床整合。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e399/12314458/02ad184d4d4d/SEMB-59-151-g001.jpg

相似文献

Performance of AI Models vs. Orthopedic Residents in Turkish Specialty Training Development Exams in Orthopedics.人工智能模型与土耳其骨科专科培训发展考试中骨科住院医师的表现对比。

Sisli Etfal Hastan Tip Bul. 2025 Feb 7;59(2):151-155. doi: 10.14744/SEMB.2025.65289. eCollection 2025.

Evaluating ChatGPT and DeepSeek in postdural puncture headache management: a comparative study with international consensus guidelines.评估ChatGPT和DeepSeek在硬膜穿刺后头痛管理中的应用：与国际共识指南的对比研究

BMC Neurol. 2025 Jul 1;25(1):264. doi: 10.1186/s12883-025-04280-8.

Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study.ChatGPT-3.5、ChatGPT-4o、Copilot、Gemini、Claude和Perplexity在依据临床实践指南对腰骶神经根性疼痛提供建议方面的准确性：横断面研究

Front Digit Health. 2025 Jun 27;7:1574287. doi: 10.3389/fdgth.2025.1574287. eCollection 2025.

Who Knows Anatomy Best? A Comparative Study of ChatGPT-4o, DeepSeek, Gemini, and Claude.谁最了解解剖学？ChatGPT-4o、DeepSeek、Gemini和Claude的比较研究

Clin Anat. 2025 Jul 24. doi: 10.1002/ca.70012.

How Well Do Different AI Language Models Inform Patients About Radiofrequency Ablation for Varicose Veins?不同的人工智能语言模型在向患者介绍静脉曲张的射频消融治疗方面效果如何？

Cureus. 2025 Jun 22;17(6):e86537. doi: 10.7759/cureus.86537. eCollection 2025 Jun.

"Dr. AI Will See You Now": How Do ChatGPT-4 Treatment Recommendations Align With Orthopaedic Clinical Practice Guidelines?“AI 医生为您服务”：ChatGPT-4 的治疗建议与骨科临床实践指南如何契合？

Clin Orthop Relat Res. 2024 Dec 1;482(12):2098-2106. doi: 10.1097/CORR.0000000000003234. Epub 2024 Sep 6.

Sexual Harassment and Prevention Training性骚扰与预防培训

Artificial Intelligence in Peripheral Artery Disease Education: A Battle Between ChatGPT and Google Gemini.外周动脉疾病教育中的人工智能：ChatGPT与谷歌Gemini的较量

Cureus. 2025 Jun 1;17(6):e85174. doi: 10.7759/cureus.85174. eCollection 2025 Jun.

Cognitive Domain Assessment of Artificial Intelligence Chatbots: A Comparative Study Between ChatGPT and Gemini's Understanding of Anatomy Education.人工智能聊天机器人的认知领域评估：ChatGPT与Gemini对解剖学教育理解的比较研究

Med Sci Educ. 2025 Feb 15;35(3):1295-1304. doi: 10.1007/s40670-025-02303-0. eCollection 2025 Jun.

Evaluating the Reasoning Capabilities of Large Language Models for Medical Coding and Hospital Readmission Risk Stratification: Zero-Shot Prompting Approach.评估大型语言模型在医学编码和医院再入院风险分层方面的推理能力：零样本提示方法。

J Med Internet Res. 2025 Jul 30;27:e74142. doi: 10.2196/74142.

引用本文的文献

ChatGPT-4.0 or DeepSeek-V3? Comparative analysis of answers to the most frequently asked questions by total knee replacement candidate patients.ChatGPT-4.0还是DeepSeek-V3？全膝关节置换候选患者常见问题答案的比较分析。

Medicine (Baltimore). 2025 Aug 22;104(34):e43951. doi: 10.1097/MD.0000000000043951.

本文引用的文献

Evaluation of ChatGPT-4 Performance in Answering Patients' Questions About the Management of Type 2 Diabetes.评估ChatGPT-4在回答患者关于2型糖尿病管理问题方面的表现。

Sisli Etfal Hastan Tip Bul. 2024 Dec 24;58(4):483-490. doi: 10.14744/SEMB.2024.23697. eCollection 2024.

ChatGPT, Bard, and Bing Chat Are Large Language Processing Models That Answered Orthopaedic In-Training Examination Questions With Similar Accuracy to First-Year Orthopaedic Surgery Residents.ChatGPT、Bard和必应聊天是大型语言处理模型，它们回答骨科住院医师培训考试问题的准确率与骨科外科一年级住院医师相似。

Arthroscopy. 2025 Mar;41(3):557-562. doi: 10.1016/j.arthro.2024.08.023. Epub 2024 Aug 28.

Reference Hallucination Score for Medical Artificial Intelligence Chatbots: Development and Usability Study.医学人工智能聊天机器人的参考幻觉评分：开发与可用性研究。

JMIR Med Inform. 2024 Jul 31;12:e54345. doi: 10.2196/54345.

Generative Artificial Intelligence Performs at a Second-Year Orthopedic Resident Level.生成式人工智能的表现达到了骨科住院医师二年级的水平。

Cureus. 2024 Mar 13;16(3):e56104. doi: 10.7759/cureus.56104. eCollection 2024 Mar.

The Rapid Development of Artificial Intelligence: GPT-4's Performance on Orthopedic Surgery Board Questions.人工智能的快速发展：GPT-4 在骨科手术委员会问题上的表现。

Orthopedics. 2024 Mar-Apr;47(2):e85-e89. doi: 10.3928/01477447-20230922-05. Epub 2023 Sep 27.

Evaluating ChatGPT Performance on the Orthopaedic In-Training Examination.评估ChatGPT在骨科住院医师培训考试中的表现。

JB JS Open Access. 2023 Sep 8;8(3). doi: 10.2106/JBJS.OA.23.00056. eCollection 2023 Jul-Sep.

Comparison of ChatGPT-3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations.ChatGPT-3.5、ChatGPT-4 和骨科住院医师在骨科评估考试中的表现比较。

J Am Acad Orthop Surg. 2023 Dec 1;31(23):1173-1179. doi: 10.5435/JAAOS-D-23-00396. Epub 2023 Sep 4.

Can Artificial Intelligence Pass the American Board of Orthopaedic Surgery Examination? Orthopaedic Residents Versus ChatGPT.人工智能能通过美国骨科医师学会考试吗？骨科住院医师与ChatGPT的对比。

Clin Orthop Relat Res. 2023 Aug 1;481(8):1623-1630. doi: 10.1097/CORR.0000000000002704. Epub 2023 May 23.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

人工智能模型与土耳其骨科专科培训发展考试中骨科住院医师的表现对比。

Performance of AI Models vs. Orthopedic Residents in Turkish Specialty Training Development Exams in Orthopedics.

作者信息

机构信息

出版信息

OBJECTIVES

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献