Acad Med. 2024 May 1;99(5):508-512. doi: 10.1097/ACM.0000000000005626. Epub 2023 Dec 28.
PROBLEM: Creating medical exam questions is time consuming, but well-written questions can be used for test-enhanced learning, which has been shown to have a positive effect on student learning. The automated generation of high-quality questions using large language models (LLMs), such as ChatGPT, would therefore be desirable. However, there are no current studies that compare students' performance on LLM-generated questions to questions developed by humans. APPROACH: The authors compared student performance on questions generated by ChatGPT (LLM questions) with questions created by medical educators (human questions). Two sets of 25 multiple-choice questions (MCQs) were created, each with 5 answer options, 1 of which was correct. The first set of questions was written by an experienced medical educator, and the second set was created by ChatGPT 3.5 after the authors identified learning objectives and extracted some specifications from the human questions. Students answered all questions in random order in a formative paper-and-pencil test that was offered leading up to the final summative neurophysiology exam (summer 2023). For each question, students also indicated whether they thought it had been written by a human or ChatGPT. OUTCOMES: The final data set consisted of 161 participants and 46 MCQs (25 human and 21 LLM questions). There was no statistically significant difference in item difficulty between the 2 question sets, but discriminatory power was statistically significantly higher in human than LLM questions (mean = .36, standard deviation [SD] = .09 vs mean = .24, SD = .14; P = .001). On average, students identified 57% of question sources (human or LLM) correctly. NEXT STEPS: Future research should replicate the study procedure in other contexts (e.g., other medical subjects, semesters, countries, and languages). In addition, the question of whether LLMs are suitable for generating different question types, such as key feature questions, should be investigated.
问题:编写医学考试试题既耗时又耗力,但精心编写的试题可用于促进学习,这已被证明对学生的学习有积极影响。因此,使用大型语言模型(如 ChatGPT)自动生成高质量的试题将是理想的。然而,目前尚无研究比较学生在大型语言模型生成的试题和人类编写的试题上的表现。
方法:作者比较了 ChatGPT(大型语言模型试题)生成的试题和医学教育工作者编写的试题(人类试题)中学生的表现。共创建了两组 25 道多项选择题(MCQ),每题有 5 个答案选项,其中 1 个是正确的。第一组问题由一位经验丰富的医学教育工作者编写,第二组问题由 ChatGPT 3.5 在确定学习目标并从人类问题中提取一些规范后编写。学生在形成性纸质测试中以随机顺序回答所有问题,该测试在最终总结性神经生理学考试(2023 年夏季)之前提供。对于每个问题,学生还表明他们认为该问题是由人类还是 ChatGPT 编写的。
结果:最终数据集包括 161 名参与者和 46 道 MCQ(25 道人类试题和 21 道大型语言模型试题)。两组试题的项目难度没有统计学上的显著差异,但人类试题的区分度明显高于大型语言模型试题(均值 =.36,标准差 [SD] =.09 与均值 =.24,SD =.14;P =.001)。平均而言,学生正确识别了 57%的试题来源(人类或大型语言模型)。
下一步:未来的研究应在其他情境(例如其他医学科目、学期、国家和语言)中复制该研究程序。此外,应研究大型语言模型是否适合生成不同类型的试题,例如关键特征试题。
J Med Internet Res. 2024-8-20
Indian J Plast Surg. 2023-8-28
World J Gastroenterol. 2025-8-21
Int J Med Sci. 2025-5-31
Curr Urol Rep. 2025-5-29
JMIR Med Educ. 2025-4-10