• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

GPT-4o与文心一言在中国放射肿瘤学考试中的对比分析

A Comparative Analysis of GPT-4o and ERNIE Bot in a Chinese Radiation Oncology Exam.

作者信息

Wang Weiping, Fu Jingxuan, Zhang Yiming, Hu Ke

机构信息

Department of Radiation Oncology, Peking Union Medical Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, No. 1 Shuaifuyuan Wangfujing, Beijing, 100730, China.

Department of Clinical Laboratory, Xuanwu Hospital, Capital Medical University, Beijing, China.

出版信息

J Cancer Educ. 2025 May 26. doi: 10.1007/s13187-025-02652-9.

DOI:10.1007/s13187-025-02652-9
PMID:40418520
Abstract

Large language models (LLMs) are increasingly utilized in medical education and practice, yet their application in niche fields such as radiation oncology remains underexplored. This study evaluates and compares the performance of OpenAI's GPT-4o and Baidu's ERNIE Bot in a Chinese-language radiation oncology examination. We employed the Chinese National Health Professional Technical Qualification Examination (Intermediate Level) for Radiation Oncology, using a question bank of 1128 items across four sections: Basic Knowledge, Relevant Knowledge, Specialized Knowledge, and Practice Competence. A passing score required an accuracy rate of 60% or higher in all sections. The models' responses were assessed for accuracy against standard answers, with key metrics including overall accuracy, section-specific performance, case analysis performance, and accuracy consensus between the models. The overall accuracy rates were 79.3% for GPT-4o and 76.9% for ERNIE Bot (p = 0.154). Across the four sections, GPT-4o achieved accuracy rates of 82.1%, 84.6%, 78.6%, and 60.9%, respectively, while ERNIE Bot achieved 81.6%, 73.9%, 77.9%, and 69.0%. In the Relevant Knowledge section, GPT-4o achieved significantly higher accuracy (p = 0.002), while no significant differences were found in the other three sections. Across various question types-including single-choice, multiple-answer, case analysis, non-case analysis, and different content areas of case analysis-both models exhibited satisfied accuracy, and ERNIE Bot achieved accuracy rates that were comparable to GPT-4o. The accuracy consensus between the two models was 84.5%, significantly exceeding the individual accuracy rates of GPT-4o (p = 0.003) and ERNIE Bot (p < 0.001). Both GPT-4o and ERNIE Bot successfully passed the highly specialized Chinese-language medical examination in radiation oncology and demonstrated comparable performance. This study provides valuable insights into the application of LLMs in Chinese medical education. These findings support the integration of LLMs in medical education and training within specialized, non-English-speaking contexts.

摘要

大语言模型(LLMs)在医学教育和实践中的应用越来越广泛,但其在放射肿瘤学等细分领域的应用仍有待深入探索。本研究评估并比较了OpenAI的GPT-4o和百度的文心一言在中文放射肿瘤学考试中的表现。我们采用了中国国家卫生专业技术资格考试(中级)放射肿瘤学部分,使用了一个包含1128道题目的题库,涵盖四个部分:基础知识、相关知识、专业知识和实践能力。及格分数要求所有部分的准确率达到60%或更高。根据标准答案评估模型的回答准确性,关键指标包括总体准确率、各部分表现、病例分析表现以及模型之间的准确率一致性。GPT-4o的总体准确率为79.3%,文心一言为76.9%(p = 0.154)。在四个部分中,GPT-4o的准确率分别为82.1%、84.6%、78.6%和60.9%,而文心一言分别为81.6%、73.9%、77.9%和69.0%。在相关知识部分,GPT-4o的准确率显著更高(p = 0.002),而在其他三个部分未发现显著差异。在各种题型中,包括单选题、多选题、病例分析题(有病例分析和非病例分析)以及病例分析的不同内容领域,两个模型都表现出令人满意的准确率,且文心一言的准确率与GPT-4o相当。两个模型之间的准确率一致性为84.5%,显著超过GPT-4o(p = 0.003)和文心一言(p < 0.001)的个体准确率。GPT-4o和文心一言都成功通过了高度专业化的中文放射肿瘤学医学考试,并表现出相当的性能。本研究为大语言模型在中文医学教育中的应用提供了有价值的见解。这些发现支持在非英语的专业环境中将大语言模型整合到医学教育和培训中。

相似文献

1
A Comparative Analysis of GPT-4o and ERNIE Bot in a Chinese Radiation Oncology Exam.GPT-4o与文心一言在中国放射肿瘤学考试中的对比分析
J Cancer Educ. 2025 May 26. doi: 10.1007/s13187-025-02652-9.
2
Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study.Qwen-2.5在中国国家护士执业资格考试中表现优于其他大语言模型:回顾性横断面比较研究。
JMIR Med Inform. 2025 Jan 10;13:e63731. doi: 10.2196/63731.
3
Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.使用标准化多项选择题评估大型语言模型在精神病学中的准确性和可靠性:横断面研究
J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910.
4
Comparison of artificial intelligence-generated and physician-generated patient education materials on early diabetic kidney disease.人工智能生成与医生生成的早期糖尿病肾病患者教育材料的比较
Front Endocrinol (Lausanne). 2025 Apr 22;16:1559265. doi: 10.3389/fendo.2025.1559265. eCollection 2025.
5
Comparing the Accuracy of Two Generated Large Language Models in Identifying Health-Related Rumors or Misconceptions and the Applicability in Health Science Popularization: Proof-of-Concept Study.比较两个生成式大语言模型在识别健康相关谣言或误解方面的准确性及其在健康科学普及中的适用性:概念验证研究。
JMIR Form Res. 2024 Dec 2;8:e63188. doi: 10.2196/63188.
6
Comparing the performance of ChatGPT and ERNIE Bot in answering questions regarding liver cancer interventional radiology in Chinese and English contexts: A comparative study.比较ChatGPT和文心一言在中英文语境下回答肝癌介入放射学相关问题的性能:一项比较研究。
Digit Health. 2025 Jan 23;11:20552076251315511. doi: 10.1177/20552076251315511. eCollection 2025 Jan-Dec.
7
Comparative performance analysis of global and chinese-domain large language models for myopia.全球和中国领域用于近视研究的大语言模型的性能对比分析
Eye (Lond). 2025 Apr 13. doi: 10.1038/s41433-025-03775-5.
8
The performance of ChatGPT and ERNIE Bot in surgical resident examinations.ChatGPT和文心一言在外科住院医师考试中的表现。
Int J Med Inform. 2025 Aug;200:105906. doi: 10.1016/j.ijmedinf.2025.105906. Epub 2025 Apr 4.
9
Physician Versus Large Language Model Chatbot Responses to Web-Based Questions From Autistic Patients in Chinese: Cross-Sectional Comparative Analysis.中文自闭症患者网络问诊中,医生与大型语言模型聊天机器人回复的对比分析:横断面研究。
J Med Internet Res. 2024 Apr 30;26:e54706. doi: 10.2196/54706.
10
Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination.评估先进的大型语言模型在医学知识方面的有效性:使用日本国家医学考试的比较研究。
Int J Med Inform. 2025 Jan;193:105673. doi: 10.1016/j.ijmedinf.2024.105673. Epub 2024 Oct 28.

本文引用的文献

1
Comparing the Accuracy of Two Generated Large Language Models in Identifying Health-Related Rumors or Misconceptions and the Applicability in Health Science Popularization: Proof-of-Concept Study.比较两个生成式大语言模型在识别健康相关谣言或误解方面的准确性及其在健康科学普及中的适用性:概念验证研究。
JMIR Form Res. 2024 Dec 2;8:e63188. doi: 10.2196/63188.
2
Performance of ChatGPT and Bard on the medical licensing examinations varies across different cultures: a comparison study.ChatGPT 和 Bard 在医学执照考试中的表现因文化而异:一项比较研究。
BMC Med Educ. 2024 Nov 26;24(1):1372. doi: 10.1186/s12909-024-06309-x.
3
ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis.
ChatGPT-4 在 USMLE 学科和临床技能中的全能表现:比较分析。
JMIR Med Educ. 2024 Nov 6;10:e63430. doi: 10.2196/63430.
4
Assessing knowledge about medical physics in language-generative AI with large language model: using the medical physicist exam.用大型语言模型评估语言生成式人工智能中关于医学物理学的知识:使用医学物理学家考试。
Radiol Phys Technol. 2024 Dec;17(4):929-937. doi: 10.1007/s12194-024-00838-2. Epub 2024 Sep 10.
5
Influence of Model Evolution and System Roles on ChatGPT's Performance in Chinese Medical Licensing Exams: Comparative Study.模型演进和系统角色对 ChatGPT 在中文医师资格考试中表现的影响:对比研究。
JMIR Med Educ. 2024 Aug 13;10:e52784. doi: 10.2196/52784.
6
Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4.评估生成式预训练转换器(GPT)在临床决策中的应用:GPT-3.5 和 GPT-4 的对比分析。
J Med Internet Res. 2024 Jun 27;26:e54571. doi: 10.2196/54571.
7
Assessing the role of advanced artificial intelligence as a tool in multidisciplinary tumor board decision-making for primary head and neck cancer cases.评估先进人工智能作为一种工具在多学科肿瘤委员会针对原发性头颈癌病例进行决策中的作用。
Front Oncol. 2024 May 24;14:1353031. doi: 10.3389/fonc.2024.1353031. eCollection 2024.
8
Physician Versus Large Language Model Chatbot Responses to Web-Based Questions From Autistic Patients in Chinese: Cross-Sectional Comparative Analysis.中文自闭症患者网络问诊中,医生与大型语言模型聊天机器人回复的对比分析:横断面研究。
J Med Internet Res. 2024 Apr 30;26:e54706. doi: 10.2196/54706.
9
Large language models leverage external knowledge to extend clinical insight beyond language boundaries.大语言模型利用外部知识将临床洞察力扩展到语言边界之外。
J Am Med Inform Assoc. 2024 Sep 1;31(9):2054-2064. doi: 10.1093/jamia/ocae079.
10
Benchmarking ChatGPT-4 on a radiation oncology in-training exam and Red Journal Gray Zone cases: potentials and challenges for ai-assisted medical education and decision making in radiation oncology.在放射肿瘤学培训考试和《红杂志》灰色地带病例上对ChatGPT-4进行基准测试:人工智能辅助放射肿瘤学医学教育和决策的潜力与挑战
Front Oncol. 2023 Sep 14;13:1265024. doi: 10.3389/fonc.2023.1265024. eCollection 2023.