Aksoy Iskender, Arslan Merve Kara
Iskender Aksoy Department of Emergency Medicine, Faculty of Medicine, Giresun University, 28100, Giresun, Turkey.
Merve Kara Arslan Department of Emergency Clinic, Bulancak State Hospital, 28300, Bulancak, Giresun, Turkey.
Pak J Med Sci. 2025 Apr;41(4):968-972. doi: 10.12669/pjms.41.4.11178.
Using artificial intelligence tools that work with different software architectures for both clinical and educational purposes in the medical field has been a subject of considerable interest recently. In this study, we compared the answers given by three different artificial intelligence chatbots to the Emergency Medicine question pool obtained from the questions asked in the Turkish National Medical Specialization Exam. We tried to investigate the effects on the answers given by classifying the questions in terms of content and form and examining the question sentences.
The questions related to emergency medicine of the Medical Specialization Exam questions between 2015-2020 were recorded. The questions were asked to artificial intelligence models, including ChatGPT-4, Gemini, and Copilot. The length of the questions, the question type and the topics of the wrong answers were recorded.
The most successful chatbot in terms of total score was Microsoft Copilot (7.8% error margin), while the least successful was Google Gemini (22.9% error margin) (p<0.001). It was important that all chatbots had the highest error margins in questions about trauma and surgical approaches and made mistakes in burns and pediatrics. The increase in the error rates in questions containing the root "probability" also showed that the question style affected the answers given.
Although chatbots show promising success in determining the correct answer, we think that they should not see chatbots as a primary source for the exam, but rather as a good auxiliary tool to support their learning processes.
近年来,在医学领域将适用于不同软件架构的人工智能工具用于临床和教育目的一直备受关注。在本研究中,我们比较了三种不同的人工智能聊天机器人对从土耳其国家医学专科考试所提问题中获取的急诊医学题库问题给出的答案。我们试图通过根据内容和形式对问题进行分类并检查问题句子来研究其对所给答案的影响。
记录了2015 - 2020年医学专科考试问题中与急诊医学相关的问题。将这些问题提供给包括ChatGPT - 4、Gemini和Copilot在内的人工智能模型。记录问题的长度、问题类型以及错误答案的主题。
就总分而言,最成功的聊天机器人是微软Copilot(错误率为7.8%),而最不成功的是谷歌Gemini(错误率为22.9%)(p<0.001)。所有聊天机器人在关于创伤和手术方法的问题上错误率最高,且在烧伤和儿科问题上出错,这一点很重要。包含“概率”词根的问题中错误率的增加也表明问题风格会影响所给答案。
尽管聊天机器人在确定正确答案方面显示出有前景的成效,但我们认为不应将聊天机器人视为考试的主要信息来源,而应将其视为支持学习过程的良好辅助工具。