Suppr超能文献

人工智能平台“谷歌Gemini 2.5 Flash、谷歌Gemini 2.0 Flash、DeepSeek V3和ChatGPT 4o”在解答解剖学不同子主题多项选择题方面的比较评估

Comparative evaluation of AI platforms "Google Gemini 2.5 Flash, Google Gemini 2.0 Flash, DeepSeek V3 and ChatGPT 4o" in solving multiple-choice questions from different subtopics of anatomy.

作者信息

Singal Anjali, Goyal Swati

机构信息

Department of Anatomy, All India Institute of Medical Sciences, Bathinda, Punjab, 151001, India.

Department of Radiodiagnosis, Gandhi Medical College, Bhopal, Madhya Pradesh, India.

出版信息

Surg Radiol Anat. 2025 Aug 30;47(1):193. doi: 10.1007/s00276-025-03707-8.

Abstract

PURPOSE

The rise of artificial intelligence (AI) based large language models (LLMs) had a profound impact on medical education. Given the widespread use of multiple-choice questions (MCQs) in anatomy education, it is likely that such queries are commonly directed to AI tools. The current study compared the accuracy level of different AI platforms for solving MCQs from various subtopics in Anatomy.

METHODS

A total of 55 MCQs from different subtopics of Anatomy were enquired using Google Gemini 2.0 Flash, Google Gemini Flash 2.5 DeepSeek V3 and ChatGPT 4o AI platforms. The accuracy rate was calculated. Chi-square test and Fisher's exact test was employed for statistical analysis.

RESULTS

Overall accuracy of 95.9% was observed across all platforms. When ranked by performance, Google Gemini 2.5 Flash performed the best, followed by Google Gemini 2.0, with Chat GPT 4o and DeepSeek V3. No statistically significant performance difference among different AI models was observed. General Anatomy was identified as the most challenging area across all the models.

CONCLUSIONS

All models performed exceptionally well on anatomy-focused MCQs. Google Gemini models show superior overall performance. However, certain errors persist that cannot be overlooked, highlighting the continued need for human oversight and expert validation. As per best available literature, this is the first study to include MCQS from different anatomy subtopics and to compare the performance of DeepSeek and Google Gemini Flash 2.5 for the task.

摘要

目的

基于人工智能(AI)的大语言模型(LLMs)的兴起对医学教育产生了深远影响。鉴于多项选择题(MCQs)在解剖学教育中的广泛应用,此类问题很可能经常被指向人工智能工具。本研究比较了不同人工智能平台解答解剖学各个子主题的多项选择题的准确率。

方法

使用谷歌Gemini 2.0 Flash、谷歌Gemini Flash 2.5、DeepSeek V3和ChatGPT 4o人工智能平台查询了来自解剖学不同子主题的总共55道多项选择题。计算了准确率。采用卡方检验和费舍尔精确检验进行统计分析。

结果

所有平台的总体准确率为95.9%。按性能排名,谷歌Gemini 2.5 Flash表现最佳,其次是谷歌Gemini 2.0,Chat GPT 4o和DeepSeek V3紧随其后。未观察到不同人工智能模型之间存在统计学上的显著性能差异。全身解剖学被确定为所有模型中最具挑战性的领域。

结论

所有模型在专注于解剖学的多项选择题上表现都非常出色。谷歌Gemini模型显示出卓越的总体性能。然而,某些错误仍然存在,不容忽视,这突出表明仍然需要人工监督和专家验证。根据现有最佳文献,这是第一项纳入来自不同解剖学子主题的多项选择题并比较DeepSeek和谷歌Gemini Flash 2.5在该任务中性能的研究。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验