Guo Siyin, Li Genpeng, Du Wei, Situ Fangzhi, Li Zhihui, Lei Jianyong
Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China; The Laboratory of Thyroid and Parathyroid Disease, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, China.
Beijing Medical Vision Times Technology Development Company Limited, Beijing, China.
Int J Med Inform. 2025 Aug;200:105906. doi: 10.1016/j.ijmedinf.2025.105906. Epub 2025 Apr 4.
To assess the application of these two large language models (LLMs) for surgical resident examinations and to compare the performance of these LLMs with that of human residents.
In this study, 596 questions with a total of 183,556 responses were first included from the Medical Vision World, an authoritative medical education platform across China. Both Chinese prompted and non-prompted questions were input into ChatGPT-4.0 and ERNIE Bot-4.0 to compare their performance in a Chinese question database. Additionally, we screened another 210 surgical questions with detailed response results from 43 residents to compare the performance of residents and these two LLMs.
There were no significant differences in the correctness of the responses to the 596 questions with or without prompts between the two LLMs (ChatGPT-4.0: 68.96 % [without prompt], 71.14 % [with prompts], p = 0.411; ERNIE Bot-4.0: 78.36 % [without prompt], 78.86 % [with prompts], p = 0.832), but ERNIE Bot-4.0 displayed higher correctness than ChatGPT-4.0 did (with prompts: p = 0.002; without prompts: p < 0.001). For another 210 questions with prompts, the two LLMs, especially ERNIE Bot-4.0 (ranking in the top 95 % of the 43 residents' scores), significantly outperformed the residents.
The performance of ERNIE Bot-4.0 was superior to that of ChatGPT-4.0 and that of residents on surgical resident examinations in a Chinese question database.
评估这两种大语言模型(LLMs)在外科住院医师考试中的应用,并将这些大语言模型的表现与人类住院医师的表现进行比较。
在本研究中,首先从中国权威医学教育平台“医学视野世界”纳入了596个问题,共有183,556条回答。将中文提示和无提示的问题都输入ChatGPT-4.0和文心一言4.0,以比较它们在中国问题数据库中的表现。此外,我们筛选了另外210个外科问题,这些问题有来自43名住院医师的详细回答结果,以比较住院医师和这两种大语言模型的表现。
在596个问题上,有无提示时,两种大语言模型回答的正确性没有显著差异(ChatGPT-4.0:无提示时为68.96%,有提示时为71.14%,p = 0.411;文心一言4.0:无提示时为78.36%,有提示时为78.86%,p = 0.832),但文心一言4.0的正确性高于ChatGPT-4.0(有提示时:p = 0.002;无提示时:p < 0.001)。对于另外210个有提示的问题,这两种大语言模型,尤其是文心一言4.0(排名在43名住院医师分数的前95%),明显优于住院医师。
在中文问题数据库的外科住院医师考试中,文心一言4.0的表现优于ChatGPT-4.0和住院医师。