Koçak Murat, Oğuz Ali Kemal, Akçalı Zafer
Department of Medical Informatics, Faculty of Medicine, Baskent University, Ankara, Turkey.
Department of Internal Medicine, Faculty of Medicine, Baskent University, Ankara, Turkey.
BMC Med Educ. 2025 Apr 25;25(1):609. doi: 10.1186/s12909-025-07148-0.
OBJECTIVE: To evaluate the performance of advanced large language models (LLMs)-OpenAI-ChatGPT 4, Google AI-Gemini 1.5 Pro, Cohere-Command R + and Meta AI-Llama 3 70B on questions from the Turkish Medical Specialty Training Entrance Exam (2021, 1st semester) and analyze their answers for user interpretability in languages other than English. METHODS: The study used questions from the Basic Medical Sciences and Clinical Medical Sciences exams of the Turkish Medical Specialty Training Entrance Exam held on March 21, 2021. The 240 questions were presented to the LLMs in Turkish, and their responses were evaluated based on the official answers published by the Student Selection and Placement Centre. RESULTS: ChatGPT 4 was the best-performing model with an overall accuracy of 88.75%. Llama 3 70B followed closely with 79.17% accuracy. Gemini 1.5 Pro achieved 78.13% accuracy, while Command R + lagged with 50% accuracy. ChatGPT 4 demonstrated strengths in both basic and clinical medical science questions. Performance varied across question difficulties, with ChatGPT 4 maintaining high accuracy even on the most challenging questions. CONCLUSIONS: GPT-4 and Llama 3 70B achieved satisfactory results on the Turkish Medical Specialty Training Entrance Exam, demonstrating their potential as safe sources for basic medical sciences and clinical medical sciences knowledge in languages other than English. These LLMs could be valuable resources for medical education and clinical support in non-English speaking areas. However, Gemini 1.5 Pro and Command R + show potential but need significant improvement to compete with the best-performing models.
JMIR Med Educ. 2025-4-10
Adv Med Educ Pract. 2025-8-15