Suppr超能文献

大型语言模型能否创建可接受的牙科学术委员会风格的考试问题?一项横断面前瞻性研究。

Can a large language model create acceptable dental board-style examination questions? A cross-sectional prospective study.

作者信息

Kim Hak-Sun, Kim Gyu-Tae

机构信息

Department of Oral and Maxillofacial Radiology, Kyung Hee University Dental Hospital, Seoul, Republic of Korea.

Department of Oral and Maxillofacial Radiology, College of Dentistry, Kyung Hee University, Seoul, Republic of Korea.

出版信息

J Dent Sci. 2025 Apr;20(2):895-900. doi: 10.1016/j.jds.2024.08.020. Epub 2024 Sep 11.

Abstract

BACKGROUND/PURPOSE: Numerous studies have shown that large language models (LLMs) can score above the passing grade on various board examinations. Therefore, this study aimed to evaluate national dental board-style examination questions created by an LLM versus those created by human experts using item analysis.

MATERIALS AND METHODS

This study was conducted in June 2024 and included senior dental students ( = 30) who participated voluntarily. An LLM, ChatGPT 4o, was used to generate 44 national dental board-style examination questions based on textbook content. Twenty questions for the LLM set were randomly selected after removing false questions. Two experts created another set of 20 questions based on the same content and in the same style as the LLM. Participating students simultaneously answered a total of 40 questions divided into two sets using Google Forms in the classroom. The responses were analyzed to assess difficulty, discrimination index, and distractor efficiency. Statistical comparisons were performed using the Wilcoxon signed rank test or linear-by-linear association test, with a confidence level of 95%.

RESULTS

The response rate was 100%. The median difficulty indices of the LLM and human set were 55.00% and 50.00%, both within the range of "excellent" range. The median discrimination indices were 0.29 for the LLM set and 0.14 for the human set. Both sets had a median distractor efficiency of 80.00%. The differences in all criteria were not statistically significant ( > 0.050).

CONCLUSION

The LLM can create national board-style examination questions of equivalent quality to those created by human experts.

摘要

背景/目的:众多研究表明,大语言模型(LLMs)在各种专业资格考试中能够取得及格以上的成绩。因此,本研究旨在通过项目分析评估由大语言模型生成的国家牙科委员会风格的考试题目与人类专家所命制题目的差异。

材料与方法

本研究于2024年6月进行,纳入了30名自愿参与的高年级牙科学生。使用大语言模型ChatGPT 4o,根据教材内容生成了44道国家牙科委员会风格的考试题目。在剔除错题后,从大语言模型生成的题目中随机抽取20道组成一组。另外两名专家基于相同内容和相同风格,也命制了20道题目组成另一组。参与的学生在课堂上使用谷歌表单同时回答总共40道题目,分为两组。对学生的回答进行分析,以评估题目难度、区分度指数和干扰项效率。采用威尔科克森符号秩检验或线性-线性关联检验进行统计比较,置信水平为95%。

结果

回复率为100%。大语言模型组和人工组的难度指数中位数分别为55.00%和50.00%,均处于“优秀”范围内。大语言模型组的区分度指数中位数为0.29,人工组为0.14。两组的干扰项效率中位数均为80.00%。所有标准的差异均无统计学意义(P>0.050)。

结论

大语言模型能够生成与人类专家命制的质量相当的国家委员会风格的考试题目。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/17eb/11993092/adeb4c57f64f/gr1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验