Suppr超能文献

人工智能模型与土耳其骨科专科培训发展考试中骨科住院医师的表现对比。

Performance of AI Models vs. Orthopedic Residents in Turkish Specialty Training Development Exams in Orthopedics.

作者信息

Ipek Enver, Sulek Yusuf, Balkanli Bahadir

机构信息

Department of Orthopedics, University of Health Sciences Türkiye, Sisli Hamidiye Etfal Training and Research Hospital, Istanbul, Türkiye.

出版信息

Sisli Etfal Hastan Tip Bul. 2025 Feb 7;59(2):151-155. doi: 10.14744/SEMB.2025.65289. eCollection 2025.

Abstract

OBJECTIVES

As artificial intelligence (AI) continues to advance, its integration into medical education and clinical decision making has attracted considerable attention. Large language models, such as ChatGPT-4o, Gemini, Bing AI, and DeepSeek, have demonstrated potential in supporting healthcare professionals, particularly in specialty training examinations. However, the extent to which these models can independently match or surpass human performance in specialized medical assessments remains uncertain. This study aimed to systematically compare the performance of these AI models with orthopedic residents in the Specialty Training Development Exams (UEGS) conducted between 2010 and 2021, focusing on their accuracy, depth of explanation, and clinical applicability.

METHODS

This retrospective comparative study involved presenting the UEGS questions to ChatGPT-4o, Gemini, Bing AI, and DeepSeek. Orthopedic residents who took the exams during 2010-2021 served as the control group. The responses were evaluated for accuracy, explanatory details, and clinical applicability. Statistical analysis was conducted using SPSS Version 27, with one-way ANOVA and post-hoc tests for performance comparison.

RESULTS

All AI models outperformed orthopedic residents in terms of accuracy. Bing AI demonstrated the highest accuracy rates (64.0% to 93.0%), followed by Gemini (66.0% to 87.0%) and DeepSeek (63.5% to 81.0%). ChatGPT-4o showed the lowest accuracy among AI models (51.0% to 59.5%). Orthopedic residents consistently had the lowest accuracy (43.95% to 53.45%). Bing AI, Gemini, and DeepSeek showed knowledge levels equivalent to over 5 years of medical experience, while ChatGPT-4o ranged from to 2-5 years.

CONCLUSION

This study showed that AI models, especially Bing AI and Gemini, perform at a high level in orthopedic specialty examinations and have potential as educational support tools. However, the lower accuracy of ChatGPT-4o reduced its suitability for assessment. Despite these limitations, AI shows promise in medical education. Future research should focus on improving the reliability, incorporating visual data interpretation, and exploring clinical integration.

摘要

目的

随着人工智能(AI)不断发展,其在医学教育和临床决策中的整合已引起广泛关注。大型语言模型,如ChatGPT-4o、Gemini、必应AI和豆包,已在支持医疗专业人员方面展现出潜力,尤其是在专科培训考试中。然而,这些模型在专业医学评估中能在多大程度上独立匹配或超越人类表现仍不确定。本研究旨在系统比较这些人工智能模型与2010年至2021年期间参加专科培训发展考试(UEGS)的骨科住院医师的表现,重点关注其准确性、解释深度和临床适用性。

方法

这项回顾性比较研究包括向ChatGPT-4o、Gemini、必应AI和豆包呈现UEGS问题。2010年至2021年期间参加考试 的骨科住院医师作为对照组。对回答进行准确性、解释细节和临床适用性评估。使用SPSS 27版进行统计分析,采用单因素方差分析和事后检验进行性能比较。

结果

在准确性方面,所有人工智能模型均优于骨科住院医师。必应AI的准确率最高(64.0%至93.0%),其次是Gemini(66.0%至87.0%)和豆包(63.5%至81.0%)。ChatGPT-4o在人工智能模型中准确率最低(51.0%至59.5%)。骨科住院医师的准确率始终最低(43.95%至53.45%)。必应AI、Gemini和豆包的知识水平相当于超过5年的医学经验,而ChatGPT-4o的知识水平在2至5年之间。

结论

本研究表明,人工智能模型,尤其是必应AI和Gemini,在骨科专科考试中表现出色,有潜力作为教育支持工具。然而,ChatGPT-4o较低的准确性降低了其评估适用性。尽管存在这些局限性,人工智能在医学教育中仍有前景。未来研究应专注于提高可靠性、纳入视觉数据解读并探索临床整合。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e399/12314458/02ad184d4d4d/SEMB-59-151-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验