Xiong Yu-Tao, Zhan Zheng-Zhe, Zhong Cheng-Lan, Zeng Wei, Guo Ji-Xiang, Tang Wei, Liu Chang
State Key Laboratory of Oral Diseases & National Center for Stomatology & National Clinical Research Center for Oral Diseases & Department of Oral and Maxillofacial Surgery, West China Hospital of Stomatology, Sichuan University, Chengdu, China.
Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu, China.
Eur J Dent Educ. 2025 May;29(2):332-340. doi: 10.1111/eje.13073. Epub 2025 Jan 31.
This study aimed to simulate diverse scenarios of students employing LLMs for CDLE examination preparation, providing a detailed evaluation of their performance in medical education.
A stratified random sampling strategy was implemented to select and subsequently revise 200 questions from the CDLE. Seven LLMs, recognised for their exceptional performance in the Chinese domain, were selected as test subjects. Three distinct testing scenarios were constructed: answering questions, explaining questions and adversarial testing. The evaluation metrics included accuracy, agreement rate and teaching effectiveness score. Wald χ tests and Kruskal-Wallis tests were employed to determine whether the differences among the LLMs across various scenarios and before and after adversarial testing were statistically significant.
The majority of the tested LLMs met the passing threshold on the CDLE benchmark, with Doubao-pro 32k and Qwen2-72b (81%) achieving the highest accuracy rates. Doubao-pro 32k demonstrated the highest 98% agreement rate with the reference answers when providing explanations. Although statistically significant differences existed among various LLMs in their teaching effectiveness scores based on the Likert scale, all these models demonstrated a commendable ability to deliver comprehensible and effective instructional content. In adversarial testing, GPT-4 exhibited the smallest decline in accuracy (2%, p = 0.623), while ChatGLM-4 demonstrated the least reduction in agreement rate (14.6%, p = 0.001).
LLMs trained on Chinese corpora, such as Doubao-pro 32k, demonstrated superior performance compared to GPT-4 in answering and explaining questions, with no statistically significant difference. However, during adversarial testing, all models exhibited diminished performance, with GPT-4 displaying comparatively greater robustness. Future research should further investigate the interpretability of LLM outputs and develop strategies to mitigate hallucinations generated in medical education.
本研究旨在模拟学生使用大语言模型进行临床诊断学习考试(CDLE)备考的各种场景,详细评估其在医学教育中的表现。
采用分层随机抽样策略从CDLE中选取并修订200道题目。选择了7个在中文领域表现卓越的大语言模型作为测试对象。构建了三种不同的测试场景:回答问题、解释问题和对抗测试。评估指标包括准确率、一致率和教学效果得分。采用Wald χ检验和Kruskal-Wallis检验来确定不同场景下以及对抗测试前后大语言模型之间的差异是否具有统计学意义。
大多数测试的大语言模型在CDLE基准上达到了及格阈值,豆包专业版32k和文生2 - 72b(81%)的准确率最高。豆包专业版32k在提供解释时与参考答案的一致率最高,达到98%。尽管基于李克特量表,不同大语言模型的教学效果得分存在统计学显著差异,但所有这些模型都表现出了令人称赞的能力,能够提供易于理解且有效的教学内容。在对抗测试中,GPT-4的准确率下降幅度最小(2%,p = 0.623),而ChatGLM-4的一致率下降幅度最小(14.6%,p = 0.001)。
在回答和解释问题方面,诸如豆包专业版32k等基于中文语料库训练的大语言模型表现优于GPT-4,且无统计学显著差异。然而,在对抗测试中,所有模型的性能都有所下降,GPT-4表现出相对更强的稳健性。未来的研究应进一步探究大语言模型输出的可解释性,并制定策略以减轻医学教育中产生的幻觉。