Suppr超能文献

DeepSeek-R1和ChatGPT-4o在中国国家医师资格考试中的表现:一项比较研究。

Performance of DeepSeek-R1 and ChatGPT-4o on the Chinese National Medical Licensing Examination: A Comparative Study.

作者信息

Wu Jin, Wang Zhiheng, Qin Yifan

机构信息

Department of Anesthesiology, Affiliated Hospital of Jiangsu University, Zhenjiang, 212001, China.

出版信息

J Med Syst. 2025 Jun 3;49(1):74. doi: 10.1007/s10916-025-02213-z.

Abstract

Large Language Models (LLMs) have a significant impact on medical education due to their advanced natural language processing capabilities. ChatGPT-4o (Chat Generative Pre-trained Transformer), a mainstream Western LLM, demonstrates powerful multimodal abilities. DeepSeek-R1, a newly released free and open-source LLM from China, demonstrates capabilities on par with ChatGPT-4o across various domains. This study aims to evaluate the performance of DeepSeek-R1 and ChatGPT-4o on the Chinese National Medical Licensing Examination (CNMLE) and explore the performance differences of LLMs from distinct linguistic environments in Chinese medical education. We evaluated both LLMs using 600 multiple-choice questions from the written part of 2024 CNMLE, covering four units. The questions were categorized into low- and high-difficulty groups according to difficulty. The primary outcome was the overall accuracy rate of each LLM. The secondary outcomes included accuracy within each of the four units and within the two difficulty-level groups. DeepSeek-R1 achieved a statistically significantly higher overall accuracy of 92.0% compared to ChatGPT-4o's 87.2% (P < 0.05). In the low-difficulty group, DeepSeek-R1 demonstrated an accuracy rate of 95.9%, which was significantly higher than ChatGPT-4o's 92.0% (P < 0.05). No statistically significant differences were observed between the models in any of the four units or in the high-difficulty group (P > 0.05). DeepSeek-R1 demonstrated a performance advantage on CNMLE.

摘要

大语言模型(LLMs)因其先进的自然语言处理能力而对医学教育产生重大影响。ChatGPT-4o(聊天生成预训练变换器)是西方主流的大语言模型,展示了强大的多模态能力。DeepSeek-R1是中国新发布的免费开源大语言模型,在各个领域的能力与ChatGPT-4o相当。本研究旨在评估DeepSeek-R1和ChatGPT-4o在中国国家医师资格考试(CNMLE)中的表现,并探讨来自不同语言环境的大语言模型在中国医学教育中的表现差异。我们使用2024年CNMLE笔试部分的600道选择题对这两个大语言模型进行了评估,涵盖四个单元。这些问题根据难度分为低难度组和高难度组。主要结果是每个大语言模型的总体准确率。次要结果包括四个单元中每个单元以及两个难度级别组内的准确率。DeepSeek-R1的总体准确率达到92.0%,在统计学上显著高于ChatGPT-4o的87.2%(P < 0.05)。在低难度组中,DeepSeek-R1的准确率为95.9%,显著高于ChatGPT-4o的92.0%(P < 0.05)。在四个单元中的任何一个单元或高难度组中,模型之间均未观察到统计学上的显著差异(P > 0.05)。DeepSeek-R1在CNMLE上表现出性能优势。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验