Kim Hak-Sun, Kim Gyu-Tae
Department of Oral and Maxillofacial Radiology, Kyung Hee University Dental Hospital, Seoul, Republic of Korea.
Department of Oral and Maxillofacial Radiology, College of Dentistry, Kyung Hee University, Seoul, Republic of Korea.
J Dent Sci. 2025 Apr;20(2):895-900. doi: 10.1016/j.jds.2024.08.020. Epub 2024 Sep 11.
BACKGROUND/PURPOSE: Numerous studies have shown that large language models (LLMs) can score above the passing grade on various board examinations. Therefore, this study aimed to evaluate national dental board-style examination questions created by an LLM versus those created by human experts using item analysis.
This study was conducted in June 2024 and included senior dental students ( = 30) who participated voluntarily. An LLM, ChatGPT 4o, was used to generate 44 national dental board-style examination questions based on textbook content. Twenty questions for the LLM set were randomly selected after removing false questions. Two experts created another set of 20 questions based on the same content and in the same style as the LLM. Participating students simultaneously answered a total of 40 questions divided into two sets using Google Forms in the classroom. The responses were analyzed to assess difficulty, discrimination index, and distractor efficiency. Statistical comparisons were performed using the Wilcoxon signed rank test or linear-by-linear association test, with a confidence level of 95%.
The response rate was 100%. The median difficulty indices of the LLM and human set were 55.00% and 50.00%, both within the range of "excellent" range. The median discrimination indices were 0.29 for the LLM set and 0.14 for the human set. Both sets had a median distractor efficiency of 80.00%. The differences in all criteria were not statistically significant ( > 0.050).
The LLM can create national board-style examination questions of equivalent quality to those created by human experts.
背景/目的:众多研究表明,大语言模型(LLMs)在各种专业资格考试中能够取得及格以上的成绩。因此,本研究旨在通过项目分析评估由大语言模型生成的国家牙科委员会风格的考试题目与人类专家所命制题目的差异。
本研究于2024年6月进行,纳入了30名自愿参与的高年级牙科学生。使用大语言模型ChatGPT 4o,根据教材内容生成了44道国家牙科委员会风格的考试题目。在剔除错题后,从大语言模型生成的题目中随机抽取20道组成一组。另外两名专家基于相同内容和相同风格,也命制了20道题目组成另一组。参与的学生在课堂上使用谷歌表单同时回答总共40道题目,分为两组。对学生的回答进行分析,以评估题目难度、区分度指数和干扰项效率。采用威尔科克森符号秩检验或线性-线性关联检验进行统计比较,置信水平为95%。
回复率为100%。大语言模型组和人工组的难度指数中位数分别为55.00%和50.00%,均处于“优秀”范围内。大语言模型组的区分度指数中位数为0.29,人工组为0.14。两组的干扰项效率中位数均为80.00%。所有标准的差异均无统计学意义(P>0.050)。
大语言模型能够生成与人类专家命制的质量相当的国家委员会风格的考试题目。