Ayik Gokhan, Kolac Ulas Can, Aksoy Taha, Yilmaz Abdurrahman, Sili Mazlum Veysel, Tokgozoglu Mazhar, Huri Gazi
Department of Orthopedics and Traumatology, Yuksek Ihtisas University Faculty of Medicine, Ankara, Türkiye.
Department of Orthopedics and Traumatology, Hacettepe University Faculty Of Medicine, Ankara, Türkiye.
Acta Orthop Traumatol Turc. 2025 Mar 17;59(1):18-26. doi: 10.5152/j.aott.2025.24090.
The aim of this study was to evaluate and compare the performance of the artificial intelligence (AI) models ChatGPT-3.5, ChatGPT-4, and Gemini on the Turkish Specialization Training and Development Examination (UEGS) to determine their utility in medical education and their potential to improve patient care.
This retrospective study analyzed responses of ChatGPT-3.5, ChatGPT-4, and Gemini to 1000 true or false questions from UEGS administered over 5 years (2018-2023). Questions, encompassing 9 orthopedic subspecialties, were categorized by 2 independent residents, with discrepancies resolved by a senior author. Artificial intelligence models were restarted for each query to prevent data retention. Performance was evaluated by calculating net scores and comparing them to orthopedic resident scores obtained from the Turkish Orthopedics and Traumatology Education Council (TOTEK) database. Statistical analyses included chi-squared tests, Bonferroni-adjusted Z tests, Cochran's Q test, and receiver operating characteristic (ROC) analysis to determine the optimal question length for AI accuracy. All AI responses were generated independently without retaining prior information.
Significant di!erences in AI tool accuracy were observed across di!erent years and subspecialties (P < .001). ChatGPT-4 consistently outperformed other models, achieving the highest overall accuracy (95% in specific subspecialties). Notably, ChatGPT-4 demonstrated superior performance in Basic and General Orthopedics and Foot and Ankle Surgery, while Gemini and ChatGPT-3.5 showed variability in accuracy across topics and years. Receiver operating characteristic analysis revealed a significant relationship between shorter letter counts and higher accuracy for ChatGPT-4 (P=.002). ChatGPT-4 showed significant negative correlations between letter count and accuracy across all years (r="0.099, P=.002), outperformed residents in basic and general orthopedics (P=.015) and trauma (P=.012), unlike other AI models.
The findings underscore the advancing role of AI in the medical field, with ChatGPT-4 demonstrating significant potential as a tool for medical education and clinical decision-making. Continuous evaluation and refinement of AI technologies are essential to enhance their educational and clinical impact.
本研究旨在评估和比较人工智能(AI)模型ChatGPT-3.5、ChatGPT-4和Gemini在土耳其专科培训与发展考试(UEGS)中的表现,以确定它们在医学教育中的效用以及改善患者护理的潜力。
这项回顾性研究分析了ChatGPT-3.5、ChatGPT-4和Gemini对5年(2018 - 2023年)期间UEGS的1000道是非题的回答。涵盖9个骨科亚专业的问题由2名独立的住院医师进行分类,如有差异则由一位资深作者解决。每个查询都重新启动人工智能模型以防止数据保留。通过计算净分数并将其与从土耳其骨科学与创伤学教育委员会(TOTEK)数据库获得的骨科住院医师分数进行比较来评估表现。统计分析包括卡方检验、Bonferroni校正的Z检验、 Cochr an's Q检验和受试者工作特征(ROC)分析,以确定人工智能准确性的最佳问题长度。所有人工智能回答均独立生成,不保留先前信息。
在不同年份和亚专业中观察到人工智能工具准确性存在显著差异(P <.001)。ChatGPT-4始终优于其他模型,总体准确率最高(某些亚专业中为95%)。值得注意的是,ChatGPT-4在基础和普通骨科以及足踝外科表现出色,而Gemini和ChatGPT-3.5在不同主题和年份的准确性存在差异。受试者工作特征分析显示,ChatGPT-4的字母计数越少与准确性越高之间存在显著关系(P =.002)。ChatGPT-4在所有年份中字母计数与准确性之间均显示出显著的负相关(r = 0.099,P =.002),在基础和普通骨科(P =.015)以及创伤科(P =.012)方面优于住院医师,这与其他人工智能模型不同。
研究结果强调了人工智能在医学领域的不断推进作用,ChatGPT-4作为医学教育和临床决策工具具有巨大潜力。持续评估和改进人工智能技术对于增强其教育和临床影响至关重要。