Suppr超能文献

从算法到手术室:大语言模型能通过中国麻醉主治医师考试吗?一项横断面评估。

From algorithms to operating room: can large language models master China's attending anesthesiology exam? a cross-sectional evaluation.

作者信息

He Qiyu, Tan Zhimin, Niu Wang, Chen Dongxu, Zhang Xian, Qin Feng, Yuan Jiuhong

机构信息

Department of Urology and Andrology Laboratory, West China Hospital, Sichuan University, Chengdu, Sichuan Province, China.

Department of Anesthesiology, West China Hospital, Sichuan University, Sichuan Province, China.

出版信息

Int J Surg. 2025 Sep 4. doi: 10.1097/JS9.0000000000003406.

Abstract

OBJECTIVE

The performance of large language models (LLMs) in complex clinical reasoning tasks is not well established. This study compares ChatGPT (GPT-3.5, GPT-4) and DeepSeek (DeepSeek-V3, DeepSeek-R1) in the Chinese anesthesiology attending physician examination (CAAPE), aiming to set AI benchmarks in medical assessments and enhance AI-driven medical education.

METHODS

This cross-sectional study assessed four iterations of two major LLMs on the 2025 CAAPE question bank (5,647 questions). Testing employed diverse querying strategies and languages, with subgroup analyses by subspecialty, knowledge type, and question format. The focus was on LLM performance in clinical and logical reasoning tasks, measuring accuracy, error types, and response times.

RESULTS

DeepSeek-R1 (70.6%-73.4%) and GPT-4 (68.6%-70.3%) outperformed DeepSeek-V3 (53.1%-55.5%) and GPT-3.5 (52.2%-55.7%) across all strategies. System role (SR) improved performance, while joint response degraded it. DeepSeek-R1 outperformed GPT-4 in complex subspecialties, reaching peak accuracy (73.4%) under SR combined initial response. GPT models performed better with English than Chinese queries. All models excelled in basic knowledge and Type A1 questions but struggled with clinical scenarios and advanced reasoning. Despite DeepSeek-R1's stronger performance, its response time was longer. Errors were primarily logical and informational (over 70%), with more than half being high-risk clinical errors.

CONCLUSION

LLMs show promise in complex clinical reasoning but risk critical errors in high-risk settings. While useful for education and decision support, their error potential must be carefully assessed in high-stakes environments.

摘要

目的

大型语言模型(LLMs)在复杂临床推理任务中的表现尚未得到充分证实。本研究在中国麻醉科主治医师考试(CAAPE)中对ChatGPT(GPT - 3.5、GPT - 4)和豆包(DeepSeek - V3、DeepSeek - R1)进行比较,旨在设定医学评估中的人工智能基准并加强人工智能驱动的医学教育。

方法

这项横断面研究在2025年CAAPE题库(5647道题)上评估了两个主要大型语言模型的四个版本。测试采用了多种查询策略和语言,并按亚专业、知识类型和问题格式进行亚组分析。重点是大型语言模型在临床和逻辑推理任务中的表现,测量准确性、错误类型和响应时间。

结果

在所有策略中,豆包(DeepSeek - R1,70.6% - 73.4%)和GPT - 4(68.6% - 70.3%)的表现优于豆包(DeepSeek - V3,53.1% - 55.5%)和GPT - 3.5(52.2% - 55.7%)。系统角色(SR)提高了性能,而联合回答则降低了性能。在复杂亚专业中,豆包(DeepSeek - R1)的表现优于GPT - 4,在SR组合初始回答下达到最高准确率(73.4%)。GPT模型在英文查询上的表现优于中文查询。所有模型在基础知识和A1型问题上表现出色,但在临床场景和高级推理方面存在困难。尽管豆包(DeepSeek - R1)表现更强,但其响应时间更长。错误主要是逻辑和信息性的(超过70%),其中一半以上是高风险临床错误。

结论

大型语言模型在复杂临床推理中显示出前景,但在高风险环境中存在关键错误的风险。虽然对教育和决策支持有用,但其错误可能性在高风险环境中必须仔细评估。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验