• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

人工智能模型与土耳其骨科专科培训发展考试中骨科住院医师的表现对比。

Performance of AI Models vs. Orthopedic Residents in Turkish Specialty Training Development Exams in Orthopedics.

作者信息

Ipek Enver, Sulek Yusuf, Balkanli Bahadir

机构信息

Department of Orthopedics, University of Health Sciences Türkiye, Sisli Hamidiye Etfal Training and Research Hospital, Istanbul, Türkiye.

出版信息

Sisli Etfal Hastan Tip Bul. 2025 Feb 7;59(2):151-155. doi: 10.14744/SEMB.2025.65289. eCollection 2025.

DOI:10.14744/SEMB.2025.65289
PMID:40756288
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12314458/
Abstract

OBJECTIVES

As artificial intelligence (AI) continues to advance, its integration into medical education and clinical decision making has attracted considerable attention. Large language models, such as ChatGPT-4o, Gemini, Bing AI, and DeepSeek, have demonstrated potential in supporting healthcare professionals, particularly in specialty training examinations. However, the extent to which these models can independently match or surpass human performance in specialized medical assessments remains uncertain. This study aimed to systematically compare the performance of these AI models with orthopedic residents in the Specialty Training Development Exams (UEGS) conducted between 2010 and 2021, focusing on their accuracy, depth of explanation, and clinical applicability.

METHODS

This retrospective comparative study involved presenting the UEGS questions to ChatGPT-4o, Gemini, Bing AI, and DeepSeek. Orthopedic residents who took the exams during 2010-2021 served as the control group. The responses were evaluated for accuracy, explanatory details, and clinical applicability. Statistical analysis was conducted using SPSS Version 27, with one-way ANOVA and post-hoc tests for performance comparison.

RESULTS

All AI models outperformed orthopedic residents in terms of accuracy. Bing AI demonstrated the highest accuracy rates (64.0% to 93.0%), followed by Gemini (66.0% to 87.0%) and DeepSeek (63.5% to 81.0%). ChatGPT-4o showed the lowest accuracy among AI models (51.0% to 59.5%). Orthopedic residents consistently had the lowest accuracy (43.95% to 53.45%). Bing AI, Gemini, and DeepSeek showed knowledge levels equivalent to over 5 years of medical experience, while ChatGPT-4o ranged from to 2-5 years.

CONCLUSION

This study showed that AI models, especially Bing AI and Gemini, perform at a high level in orthopedic specialty examinations and have potential as educational support tools. However, the lower accuracy of ChatGPT-4o reduced its suitability for assessment. Despite these limitations, AI shows promise in medical education. Future research should focus on improving the reliability, incorporating visual data interpretation, and exploring clinical integration.

摘要

目的

随着人工智能(AI)不断发展,其在医学教育和临床决策中的整合已引起广泛关注。大型语言模型,如ChatGPT-4o、Gemini、必应AI和豆包,已在支持医疗专业人员方面展现出潜力,尤其是在专科培训考试中。然而,这些模型在专业医学评估中能在多大程度上独立匹配或超越人类表现仍不确定。本研究旨在系统比较这些人工智能模型与2010年至2021年期间参加专科培训发展考试(UEGS)的骨科住院医师的表现,重点关注其准确性、解释深度和临床适用性。

方法

这项回顾性比较研究包括向ChatGPT-4o、Gemini、必应AI和豆包呈现UEGS问题。2010年至2021年期间参加考试 的骨科住院医师作为对照组。对回答进行准确性、解释细节和临床适用性评估。使用SPSS 27版进行统计分析,采用单因素方差分析和事后检验进行性能比较。

结果

在准确性方面,所有人工智能模型均优于骨科住院医师。必应AI的准确率最高(64.0%至93.0%),其次是Gemini(66.0%至87.0%)和豆包(63.5%至81.0%)。ChatGPT-4o在人工智能模型中准确率最低(51.0%至59.5%)。骨科住院医师的准确率始终最低(43.95%至53.45%)。必应AI、Gemini和豆包的知识水平相当于超过5年的医学经验,而ChatGPT-4o的知识水平在2至5年之间。

结论

本研究表明,人工智能模型,尤其是必应AI和Gemini,在骨科专科考试中表现出色,有潜力作为教育支持工具。然而,ChatGPT-4o较低的准确性降低了其评估适用性。尽管存在这些局限性,人工智能在医学教育中仍有前景。未来研究应专注于提高可靠性、纳入视觉数据解读并探索临床整合。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e399/12314458/02ad184d4d4d/SEMB-59-151-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e399/12314458/02ad184d4d4d/SEMB-59-151-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e399/12314458/02ad184d4d4d/SEMB-59-151-g001.jpg

相似文献

1
Performance of AI Models vs. Orthopedic Residents in Turkish Specialty Training Development Exams in Orthopedics.人工智能模型与土耳其骨科专科培训发展考试中骨科住院医师的表现对比。
Sisli Etfal Hastan Tip Bul. 2025 Feb 7;59(2):151-155. doi: 10.14744/SEMB.2025.65289. eCollection 2025.
2
Evaluating ChatGPT and DeepSeek in postdural puncture headache management: a comparative study with international consensus guidelines.评估ChatGPT和DeepSeek在硬膜穿刺后头痛管理中的应用:与国际共识指南的对比研究
BMC Neurol. 2025 Jul 1;25(1):264. doi: 10.1186/s12883-025-04280-8.
3
Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study.ChatGPT-3.5、ChatGPT-4o、Copilot、Gemini、Claude和Perplexity在依据临床实践指南对腰骶神经根性疼痛提供建议方面的准确性:横断面研究
Front Digit Health. 2025 Jun 27;7:1574287. doi: 10.3389/fdgth.2025.1574287. eCollection 2025.
4
Who Knows Anatomy Best? A Comparative Study of ChatGPT-4o, DeepSeek, Gemini, and Claude.谁最了解解剖学?ChatGPT-4o、DeepSeek、Gemini和Claude的比较研究
Clin Anat. 2025 Jul 24. doi: 10.1002/ca.70012.
5
How Well Do Different AI Language Models Inform Patients About Radiofrequency Ablation for Varicose Veins?不同的人工智能语言模型在向患者介绍静脉曲张的射频消融治疗方面效果如何?
Cureus. 2025 Jun 22;17(6):e86537. doi: 10.7759/cureus.86537. eCollection 2025 Jun.
6
"Dr. AI Will See You Now": How Do ChatGPT-4 Treatment Recommendations Align With Orthopaedic Clinical Practice Guidelines?“AI 医生为您服务”:ChatGPT-4 的治疗建议与骨科临床实践指南如何契合?
Clin Orthop Relat Res. 2024 Dec 1;482(12):2098-2106. doi: 10.1097/CORR.0000000000003234. Epub 2024 Sep 6.
7
Sexual Harassment and Prevention Training性骚扰与预防培训
8
Artificial Intelligence in Peripheral Artery Disease Education: A Battle Between ChatGPT and Google Gemini.外周动脉疾病教育中的人工智能:ChatGPT与谷歌Gemini的较量
Cureus. 2025 Jun 1;17(6):e85174. doi: 10.7759/cureus.85174. eCollection 2025 Jun.
9
Cognitive Domain Assessment of Artificial Intelligence Chatbots: A Comparative Study Between ChatGPT and Gemini's Understanding of Anatomy Education.人工智能聊天机器人的认知领域评估:ChatGPT与Gemini对解剖学教育理解的比较研究
Med Sci Educ. 2025 Feb 15;35(3):1295-1304. doi: 10.1007/s40670-025-02303-0. eCollection 2025 Jun.
10
Evaluating the Reasoning Capabilities of Large Language Models for Medical Coding and Hospital Readmission Risk Stratification: Zero-Shot Prompting Approach.评估大型语言模型在医学编码和医院再入院风险分层方面的推理能力:零样本提示方法。
J Med Internet Res. 2025 Jul 30;27:e74142. doi: 10.2196/74142.

引用本文的文献

1
ChatGPT-4.0 or DeepSeek-V3? Comparative analysis of answers to the most frequently asked questions by total knee replacement candidate patients.ChatGPT-4.0还是DeepSeek-V3?全膝关节置换候选患者常见问题答案的比较分析。
Medicine (Baltimore). 2025 Aug 22;104(34):e43951. doi: 10.1097/MD.0000000000043951.

本文引用的文献

1
Evaluation of ChatGPT-4 Performance in Answering Patients' Questions About the Management of Type 2 Diabetes.评估ChatGPT-4在回答患者关于2型糖尿病管理问题方面的表现。
Sisli Etfal Hastan Tip Bul. 2024 Dec 24;58(4):483-490. doi: 10.14744/SEMB.2024.23697. eCollection 2024.
2
ChatGPT, Bard, and Bing Chat Are Large Language Processing Models That Answered Orthopaedic In-Training Examination Questions With Similar Accuracy to First-Year Orthopaedic Surgery Residents.ChatGPT、Bard和必应聊天是大型语言处理模型,它们回答骨科住院医师培训考试问题的准确率与骨科外科一年级住院医师相似。
Arthroscopy. 2025 Mar;41(3):557-562. doi: 10.1016/j.arthro.2024.08.023. Epub 2024 Aug 28.
3
Reference Hallucination Score for Medical Artificial Intelligence Chatbots: Development and Usability Study.
医学人工智能聊天机器人的参考幻觉评分:开发与可用性研究。
JMIR Med Inform. 2024 Jul 31;12:e54345. doi: 10.2196/54345.
4
Generative Artificial Intelligence Performs at a Second-Year Orthopedic Resident Level.生成式人工智能的表现达到了骨科住院医师二年级的水平。
Cureus. 2024 Mar 13;16(3):e56104. doi: 10.7759/cureus.56104. eCollection 2024 Mar.
5
The Rapid Development of Artificial Intelligence: GPT-4's Performance on Orthopedic Surgery Board Questions.人工智能的快速发展:GPT-4 在骨科手术委员会问题上的表现。
Orthopedics. 2024 Mar-Apr;47(2):e85-e89. doi: 10.3928/01477447-20230922-05. Epub 2023 Sep 27.
6
Evaluating ChatGPT Performance on the Orthopaedic In-Training Examination.评估ChatGPT在骨科住院医师培训考试中的表现。
JB JS Open Access. 2023 Sep 8;8(3). doi: 10.2106/JBJS.OA.23.00056. eCollection 2023 Jul-Sep.
7
Comparison of ChatGPT-3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations.ChatGPT-3.5、ChatGPT-4 和骨科住院医师在骨科评估考试中的表现比较。
J Am Acad Orthop Surg. 2023 Dec 1;31(23):1173-1179. doi: 10.5435/JAAOS-D-23-00396. Epub 2023 Sep 4.
8
Can Artificial Intelligence Pass the American Board of Orthopaedic Surgery Examination? Orthopaedic Residents Versus ChatGPT.人工智能能通过美国骨科医师学会考试吗?骨科住院医师与ChatGPT的对比。
Clin Orthop Relat Res. 2023 Aug 1;481(8):1623-1630. doi: 10.1097/CORR.0000000000002704. Epub 2023 May 23.