人工智能工具在回答急诊医学题库问题方面的性能比较：ChatGPT 4.0、谷歌Gemini和微软Copilot

Comparison of performance of artificial intelligence tools in answering emergency medicine question pool: ChatGPT 4.0, Google Gemini and Microsoft Copilot.

作者信息

Aksoy Iskender, Arslan Merve Kara

机构信息

Iskender Aksoy Department of Emergency Medicine, Faculty of Medicine, Giresun University, 28100, Giresun, Turkey.

Merve Kara Arslan Department of Emergency Clinic, Bulancak State Hospital, 28300, Bulancak, Giresun, Turkey.

出版信息

Pak J Med Sci. 2025 Apr;41(4):968-972. doi: 10.12669/pjms.41.4.11178.

DOI:10.12669/pjms.41.4.11178

PMID:40290213

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12022595/

Abstract

OBJECTIVE

Using artificial intelligence tools that work with different software architectures for both clinical and educational purposes in the medical field has been a subject of considerable interest recently. In this study, we compared the answers given by three different artificial intelligence chatbots to the Emergency Medicine question pool obtained from the questions asked in the Turkish National Medical Specialization Exam. We tried to investigate the effects on the answers given by classifying the questions in terms of content and form and examining the question sentences.

METHODS

The questions related to emergency medicine of the Medical Specialization Exam questions between 2015-2020 were recorded. The questions were asked to artificial intelligence models, including ChatGPT-4, Gemini, and Copilot. The length of the questions, the question type and the topics of the wrong answers were recorded.

RESULTS

The most successful chatbot in terms of total score was Microsoft Copilot (7.8% error margin), while the least successful was Google Gemini (22.9% error margin) (p<0.001). It was important that all chatbots had the highest error margins in questions about trauma and surgical approaches and made mistakes in burns and pediatrics. The increase in the error rates in questions containing the root "probability" also showed that the question style affected the answers given.

CONCLUSIONS

Although chatbots show promising success in determining the correct answer, we think that they should not see chatbots as a primary source for the exam, but rather as a good auxiliary tool to support their learning processes.

摘要

目的

近年来，在医学领域将适用于不同软件架构的人工智能工具用于临床和教育目的一直备受关注。在本研究中，我们比较了三种不同的人工智能聊天机器人对从土耳其国家医学专科考试所提问题中获取的急诊医学题库问题给出的答案。我们试图通过根据内容和形式对问题进行分类并检查问题句子来研究其对所给答案的影响。

方法

记录了2015 - 2020年医学专科考试问题中与急诊医学相关的问题。将这些问题提供给包括ChatGPT - 4、Gemini和Copilot在内的人工智能模型。记录问题的长度、问题类型以及错误答案的主题。

结果

就总分而言，最成功的聊天机器人是微软Copilot（错误率为7.8%），而最不成功的是谷歌Gemini（错误率为22.9%）（p<0.001）。所有聊天机器人在关于创伤和手术方法的问题上错误率最高，且在烧伤和儿科问题上出错，这一点很重要。包含“概率”词根的问题中错误率的增加也表明问题风格会影响所给答案。

结论

尽管聊天机器人在确定正确答案方面显示出有前景的成效，但我们认为不应将聊天机器人视为考试的主要信息来源，而应将其视为支持学习过程的良好辅助工具。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

人工智能工具在回答急诊医学题库问题方面的性能比较：ChatGPT 4.0、谷歌Gemini和微软Copilot

Comparison of performance of artificial intelligence tools in answering emergency medicine question pool: ChatGPT 4.0, Google Gemini and Microsoft Copilot.

作者信息

机构信息

出版信息

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

相似文献

引用本文的文献

本文引用的文献

人工智能工具在回答急诊医学题库问题方面的性能比较：ChatGPT 4.0、谷歌Gemini和微软Copilot

Comparison of performance of artificial intelligence tools in answering emergency medicine question pool: ChatGPT 4.0, Google Gemini and Microsoft Copilot.

作者信息

机构信息

出版信息

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

目的

方法

结果

结论