• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ChatGPT和文心一言在外科住院医师考试中的表现。

The performance of ChatGPT and ERNIE Bot in surgical resident examinations.

作者信息

Guo Siyin, Li Genpeng, Du Wei, Situ Fangzhi, Li Zhihui, Lei Jianyong

机构信息

Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China; The Laboratory of Thyroid and Parathyroid Disease, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China.

Beijing Medical Vision Times Technology Development Company Limited, Beijing, China.

出版信息

Int J Med Inform. 2025 Aug;200:105906. doi: 10.1016/j.ijmedinf.2025.105906. Epub 2025 Apr 4.

DOI:10.1016/j.ijmedinf.2025.105906
PMID:40220627
Abstract

STUDY PURPOSE

To assess the application of these two large language models (LLMs) for surgical resident examinations and to compare the performance of these LLMs with that of human residents.

STUDY DESIGN

In this study, 596 questions with a total of 183,556 responses were first included from the Medical Vision World, an authoritative medical education platform across China. Both Chinese prompted and non-prompted questions were input into ChatGPT-4.0 and ERNIE Bot-4.0 to compare their performance in a Chinese question database. Additionally, we screened another 210 surgical questions with detailed response results from 43 residents to compare the performance of residents and these two LLMs.

RESULTS

There were no significant differences in the correctness of the responses to the 596 questions with or without prompts between the two LLMs (ChatGPT-4.0: 68.96 % [without prompt], 71.14 % [with prompts], p = 0.411; ERNIE Bot-4.0: 78.36 % [without prompt], 78.86 % [with prompts], p = 0.832), but ERNIE Bot-4.0 displayed higher correctness than ChatGPT-4.0 did (with prompts: p = 0.002; without prompts: p < 0.001). For another 210 questions with prompts, the two LLMs, especially ERNIE Bot-4.0 (ranking in the top 95 % of the 43 residents' scores), significantly outperformed the residents.

CONCLUSIONS

The performance of ERNIE Bot-4.0 was superior to that of ChatGPT-4.0 and that of residents on surgical resident examinations in a Chinese question database.

摘要

研究目的

评估这两种大语言模型(LLMs)在外科住院医师考试中的应用,并将这些大语言模型的表现与人类住院医师的表现进行比较。

研究设计

在本研究中,首先从中国权威医学教育平台“医学视野世界”纳入了596个问题,共有183,556条回答。将中文提示和无提示的问题都输入ChatGPT-4.0和文心一言4.0,以比较它们在中国问题数据库中的表现。此外,我们筛选了另外210个外科问题,这些问题有来自43名住院医师的详细回答结果,以比较住院医师和这两种大语言模型的表现。

结果

在596个问题上,有无提示时,两种大语言模型回答的正确性没有显著差异(ChatGPT-4.0:无提示时为68.96%,有提示时为71.14%,p = 0.411;文心一言4.0:无提示时为78.36%,有提示时为78.86%,p = 0.832),但文心一言4.0的正确性高于ChatGPT-4.0(有提示时:p = 0.002;无提示时:p < 0.001)。对于另外210个有提示的问题,这两种大语言模型,尤其是文心一言4.0(排名在43名住院医师分数的前95%),明显优于住院医师。

结论

在中文问题数据库的外科住院医师考试中,文心一言4.0的表现优于ChatGPT-4.0和住院医师。

相似文献

1
The performance of ChatGPT and ERNIE Bot in surgical resident examinations.ChatGPT和文心一言在外科住院医师考试中的表现。
Int J Med Inform. 2025 Aug;200:105906. doi: 10.1016/j.ijmedinf.2025.105906. Epub 2025 Apr 4.
2
Physician Versus Large Language Model Chatbot Responses to Web-Based Questions From Autistic Patients in Chinese: Cross-Sectional Comparative Analysis.中文自闭症患者网络问诊中,医生与大型语言模型聊天机器人回复的对比分析:横断面研究。
J Med Internet Res. 2024 Apr 30;26:e54706. doi: 10.2196/54706.
3
Comparing the performance of ChatGPT and ERNIE Bot in answering questions regarding liver cancer interventional radiology in Chinese and English contexts: A comparative study.比较ChatGPT和文心一言在中英文语境下回答肝癌介入放射学相关问题的性能:一项比较研究。
Digit Health. 2025 Jan 23;11:20552076251315511. doi: 10.1177/20552076251315511. eCollection 2025 Jan-Dec.
4
Comparative performance analysis of global and chinese-domain large language models for myopia.全球和中国领域用于近视研究的大语言模型的性能对比分析
Eye (Lond). 2025 Apr 13. doi: 10.1038/s41433-025-03775-5.
5
Application value of generative artificial intelligence in the field of stomatology.生成式人工智能在口腔医学领域的应用价值。
Hua Xi Kou Qiang Yi Xue Za Zhi. 2024 Dec 1;42(6):810-815. doi: 10.7518/hxkq.2024.2024144.
6
Assessing the performance of large language models (LLMs) in answering medical questions regarding breast cancer in the Chinese context.评估大语言模型(LLMs)在中国背景下回答有关乳腺癌医学问题的表现。
Digit Health. 2024 Oct 7;10:20552076241284771. doi: 10.1177/20552076241284771. eCollection 2024 Jan-Dec.
7
Performance of Artificial Intelligence Chatbots on Ultrasound Examinations: Cross-Sectional Comparative Analysis.人工智能聊天机器人在超声检查中的表现:横断面比较分析。
JMIR Med Inform. 2025 Jan 9;13:e63924. doi: 10.2196/63924.
8
Large Language Models in Summarizing Radiology Report Impressions for Lung Cancer in Chinese: Evaluation Study.大型语言模型对中文肺癌放射学报告印象的总结:评估研究
J Med Internet Res. 2025 Apr 3;27:e65547. doi: 10.2196/65547.
9
Evaluating the Effectiveness of Large Language Models in Providing Patient Education for Chinese Patients With Ocular Myasthenia Gravis: Mixed Methods Study.评估大语言模型为中国重症肌无力性眼病患者提供患者教育的有效性:混合方法研究
J Med Internet Res. 2025 Apr 10;27:e67883. doi: 10.2196/67883.
10
Performance of ChatGPT on Nursing Licensure Examinations in the United States and China: Cross-Sectional Study.ChatGPT 在中美护理执照考试中的表现:横断面研究。
JMIR Med Educ. 2024 Oct 3;10:e52746. doi: 10.2196/52746.