Tassoker Melek
Department of Dentomaxillofacial Radiology, Faculty of Dentistry, Necmettin Erbakan University, Konya, Turkey.
Clin Anat. 2025 Jul 24. doi: 10.1002/ca.70012.
This study evaluates the performance of ChatGPT-4o (OpenAI), DeepSeek-v3 (DeepSeek), Gemini 2.0 (Google DeepMind), and Claude 3.7 Sonnet (Anthropic) in answering anatomy questions from the Turkish Dental Specialty Admission Exam (DUS). The study aims to compare their accuracy, response times, and answer lengths. A total of 74 text-based multiple choice anatomy questions from the Turkish Dental Specialty Admission Exam (DUS) administered between 2012 and 2021 were analyzed in this study. The questions varied in difficulty and included both basic anatomical identification and clinically oriented scenarios, with a majority focusing on head and neck anatomy, followed by thorax, neuroanatomy, and musculoskeletal regions, which are particularly relevant to dental education. The accuracy of answers was evaluated against official sources, and response times and word counts were recorded. Statistical analyses, including the Kruskal-Wallis and Cochran's Q tests, were used to compare performance differences. ChatGPT-4o demonstrated the highest accuracy (98.6%), while the other models achieved the same rate of 89.2%. Gemini produced the fastest responses (mean: 4.47 s), whereas DeepSeek generated the shortest answers and Gemini the longest (p = 0.000). The differences in accuracy, response times, and word count were statistically significant (p < 0.05). ChatGPT-4o outperformed other models in accuracy for DUS anatomy questions, suggesting its superior potential as a tool for dental education. Future research should explore the integration of LLMs into structured learning programs.
本研究评估了ChatGPT-4o(OpenAI)、DeepSeek-v3(DeepSeek)、Gemini 2.0(谷歌DeepMind)和Claude 3.7 Sonnet(Anthropic)在回答土耳其牙科专业入学考试(DUS)中的解剖学问题时的表现。该研究旨在比较它们的准确性、响应时间和答案长度。本研究分析了2012年至2021年期间进行的土耳其牙科专业入学考试(DUS)中总共74道基于文本的多项选择解剖学问题。这些问题难度各异,包括基本的解剖学识别和临床导向的场景,其中大多数集中在头颈部解剖学,其次是胸部、神经解剖学和肌肉骨骼区域,这些与牙科教育特别相关。根据官方资料评估答案的准确性,并记录响应时间和单词数。使用包括Kruskal-Wallis和Cochran's Q检验在内的统计分析来比较性能差异。ChatGPT-4o表现出最高的准确性(98.6%),而其他模型的准确率均为89.2%。Gemini的响应速度最快(平均:4.47秒),而DeepSeek生成的答案最短,Gemini生成的答案最长(p = 0.000)。准确性、响应时间和单词数的差异具有统计学意义(p < 0.05)。ChatGPT-4o在DUS解剖学问题的准确性方面优于其他模型,表明其作为牙科教育工具具有更大的潜力。未来的研究应探索将大语言模型整合到结构化学习计划中。