Wang Weiping, Fu Jingxuan, Zhang Yiming, Hu Ke
Department of Radiation Oncology, Peking Union Medical Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, No. 1 Shuaifuyuan Wangfujing, Beijing, 100730, China.
Department of Clinical Laboratory, Xuanwu Hospital, Capital Medical University, Beijing, China.
J Cancer Educ. 2025 May 26. doi: 10.1007/s13187-025-02652-9.
Large language models (LLMs) are increasingly utilized in medical education and practice, yet their application in niche fields such as radiation oncology remains underexplored. This study evaluates and compares the performance of OpenAI's GPT-4o and Baidu's ERNIE Bot in a Chinese-language radiation oncology examination. We employed the Chinese National Health Professional Technical Qualification Examination (Intermediate Level) for Radiation Oncology, using a question bank of 1128 items across four sections: Basic Knowledge, Relevant Knowledge, Specialized Knowledge, and Practice Competence. A passing score required an accuracy rate of 60% or higher in all sections. The models' responses were assessed for accuracy against standard answers, with key metrics including overall accuracy, section-specific performance, case analysis performance, and accuracy consensus between the models. The overall accuracy rates were 79.3% for GPT-4o and 76.9% for ERNIE Bot (p = 0.154). Across the four sections, GPT-4o achieved accuracy rates of 82.1%, 84.6%, 78.6%, and 60.9%, respectively, while ERNIE Bot achieved 81.6%, 73.9%, 77.9%, and 69.0%. In the Relevant Knowledge section, GPT-4o achieved significantly higher accuracy (p = 0.002), while no significant differences were found in the other three sections. Across various question types-including single-choice, multiple-answer, case analysis, non-case analysis, and different content areas of case analysis-both models exhibited satisfied accuracy, and ERNIE Bot achieved accuracy rates that were comparable to GPT-4o. The accuracy consensus between the two models was 84.5%, significantly exceeding the individual accuracy rates of GPT-4o (p = 0.003) and ERNIE Bot (p < 0.001). Both GPT-4o and ERNIE Bot successfully passed the highly specialized Chinese-language medical examination in radiation oncology and demonstrated comparable performance. This study provides valuable insights into the application of LLMs in Chinese medical education. These findings support the integration of LLMs in medical education and training within specialized, non-English-speaking contexts.
大语言模型(LLMs)在医学教育和实践中的应用越来越广泛,但其在放射肿瘤学等细分领域的应用仍有待深入探索。本研究评估并比较了OpenAI的GPT-4o和百度的文心一言在中文放射肿瘤学考试中的表现。我们采用了中国国家卫生专业技术资格考试(中级)放射肿瘤学部分,使用了一个包含1128道题目的题库,涵盖四个部分:基础知识、相关知识、专业知识和实践能力。及格分数要求所有部分的准确率达到60%或更高。根据标准答案评估模型的回答准确性,关键指标包括总体准确率、各部分表现、病例分析表现以及模型之间的准确率一致性。GPT-4o的总体准确率为79.3%,文心一言为76.9%(p = 0.154)。在四个部分中,GPT-4o的准确率分别为82.1%、84.6%、78.6%和60.9%,而文心一言分别为81.6%、73.9%、77.9%和69.0%。在相关知识部分,GPT-4o的准确率显著更高(p = 0.002),而在其他三个部分未发现显著差异。在各种题型中,包括单选题、多选题、病例分析题(有病例分析和非病例分析)以及病例分析的不同内容领域,两个模型都表现出令人满意的准确率,且文心一言的准确率与GPT-4o相当。两个模型之间的准确率一致性为84.5%,显著超过GPT-4o(p = 0.003)和文心一言(p < 0.001)的个体准确率。GPT-4o和文心一言都成功通过了高度专业化的中文放射肿瘤学医学考试,并表现出相当的性能。本研究为大语言模型在中文医学教育中的应用提供了有价值的见解。这些发现支持在非英语的专业环境中将大语言模型整合到医学教育和培训中。