• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估大语言模型在回答和分析中国牙科执业资格考试方面的表现。

Evaluating the Performance of Large Language Models (LLMs) in Answering and Analysing the Chinese Dental Licensing Examination.

作者信息

Xiong Yu-Tao, Zhan Zheng-Zhe, Zhong Cheng-Lan, Zeng Wei, Guo Ji-Xiang, Tang Wei, Liu Chang

机构信息

State Key Laboratory of Oral Diseases & National Center for Stomatology & National Clinical Research Center for Oral Diseases & Department of Oral and Maxillofacial Surgery, West China Hospital of Stomatology, Sichuan University, Chengdu, China.

Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu, China.

出版信息

Eur J Dent Educ. 2025 May;29(2):332-340. doi: 10.1111/eje.13073. Epub 2025 Jan 31.

DOI:10.1111/eje.13073
PMID:39889108
Abstract

BACKGROUND

This study aimed to simulate diverse scenarios of students employing LLMs for CDLE examination preparation, providing a detailed evaluation of their performance in medical education.

METHODS

A stratified random sampling strategy was implemented to select and subsequently revise 200 questions from the CDLE. Seven LLMs, recognised for their exceptional performance in the Chinese domain, were selected as test subjects. Three distinct testing scenarios were constructed: answering questions, explaining questions and adversarial testing. The evaluation metrics included accuracy, agreement rate and teaching effectiveness score. Wald χ tests and Kruskal-Wallis tests were employed to determine whether the differences among the LLMs across various scenarios and before and after adversarial testing were statistically significant.

RESULTS

The majority of the tested LLMs met the passing threshold on the CDLE benchmark, with Doubao-pro 32k and Qwen2-72b (81%) achieving the highest accuracy rates. Doubao-pro 32k demonstrated the highest 98% agreement rate with the reference answers when providing explanations. Although statistically significant differences existed among various LLMs in their teaching effectiveness scores based on the Likert scale, all these models demonstrated a commendable ability to deliver comprehensible and effective instructional content. In adversarial testing, GPT-4 exhibited the smallest decline in accuracy (2%, p = 0.623), while ChatGLM-4 demonstrated the least reduction in agreement rate (14.6%, p = 0.001).

CONCLUSIONS

LLMs trained on Chinese corpora, such as Doubao-pro 32k, demonstrated superior performance compared to GPT-4 in answering and explaining questions, with no statistically significant difference. However, during adversarial testing, all models exhibited diminished performance, with GPT-4 displaying comparatively greater robustness. Future research should further investigate the interpretability of LLM outputs and develop strategies to mitigate hallucinations generated in medical education.

摘要

背景

本研究旨在模拟学生使用大语言模型进行临床诊断学习考试(CDLE)备考的各种场景,详细评估其在医学教育中的表现。

方法

采用分层随机抽样策略从CDLE中选取并修订200道题目。选择了7个在中文领域表现卓越的大语言模型作为测试对象。构建了三种不同的测试场景:回答问题、解释问题和对抗测试。评估指标包括准确率、一致率和教学效果得分。采用Wald χ检验和Kruskal-Wallis检验来确定不同场景下以及对抗测试前后大语言模型之间的差异是否具有统计学意义。

结果

大多数测试的大语言模型在CDLE基准上达到了及格阈值,豆包专业版32k和文生2 - 72b(81%)的准确率最高。豆包专业版32k在提供解释时与参考答案的一致率最高,达到98%。尽管基于李克特量表,不同大语言模型的教学效果得分存在统计学显著差异,但所有这些模型都表现出了令人称赞的能力,能够提供易于理解且有效的教学内容。在对抗测试中,GPT-4的准确率下降幅度最小(2%,p = 0.623),而ChatGLM-4的一致率下降幅度最小(14.6%,p = 0.001)。

结论

在回答和解释问题方面,诸如豆包专业版32k等基于中文语料库训练的大语言模型表现优于GPT-4,且无统计学显著差异。然而,在对抗测试中,所有模型的性能都有所下降,GPT-4表现出相对更强的稳健性。未来的研究应进一步探究大语言模型输出的可解释性,并制定策略以减轻医学教育中产生的幻觉。

相似文献

1
Evaluating the Performance of Large Language Models (LLMs) in Answering and Analysing the Chinese Dental Licensing Examination.评估大语言模型在回答和分析中国牙科执业资格考试方面的表现。
Eur J Dent Educ. 2025 May;29(2):332-340. doi: 10.1111/eje.13073. Epub 2025 Jan 31.
2
Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions.大型语言模型在外科检查问题中的视觉能力基准测试
J Surg Educ. 2025 Apr;82(4):103442. doi: 10.1016/j.jsurg.2025.103442. Epub 2025 Feb 9.
3
Accuracy of latest large language models in answering multiple choice questions in dentistry: A comparative study.最新大语言模型在回答牙科多项选择题方面的准确性:一项比较研究。
PLoS One. 2025 Jan 29;20(1):e0317423. doi: 10.1371/journal.pone.0317423. eCollection 2025.
4
Performance of Large Language Models on the Korean Dental Licensing Examination: A Comparative Study.大语言模型在韩国牙科执照考试中的表现:一项比较研究。
Int Dent J. 2025 Feb;75(1):176-184. doi: 10.1016/j.identj.2024.09.002. Epub 2024 Oct 6.
5
Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study.Qwen-2.5在中国国家护士执业资格考试中表现优于其他大语言模型:回顾性横断面比较研究。
JMIR Med Inform. 2025 Jan 10;13:e63731. doi: 10.2196/63731.
6
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响:比较案例研究
JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.
7
Large Language Models in Dental Licensing Examinations: Systematic Review and Meta-Analysis.大型语言模型在牙科执照考试中的应用:系统评价与荟萃分析
Int Dent J. 2025 Feb;75(1):213-222. doi: 10.1016/j.identj.2024.10.014. Epub 2024 Nov 12.
8
Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study.基础医学考试中与大语言模型准确性相关的因素:横断面研究
JMIR Med Educ. 2025 Jan 13;11:e58898. doi: 10.2196/58898.
9
Performance of three artificial intelligence (AI)-based large language models in standardized testing; implications for AI-assisted dental education.三种基于人工智能(AI)的大语言模型在标准化测试中的表现;对人工智能辅助牙科教育的启示。
J Periodontal Res. 2025 Feb;60(2):121-133. doi: 10.1111/jre.13323. Epub 2024 Jul 18.
10
Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination.评估先进的大型语言模型在医学知识方面的有效性:使用日本国家医学考试的比较研究。
Int J Med Inform. 2025 Jan;193:105673. doi: 10.1016/j.ijmedinf.2024.105673. Epub 2024 Oct 28.