Coşkun Özlem, Kıyak Yavuz Selim, Budakoğlu Işıl İrem
Department of Medical Education and Informatics, Gazi University, Ankara, Turkey.
Med Teach. 2025 Feb;47(2):268-274. doi: 10.1080/0142159X.2024.2327477. Epub 2024 Mar 13.
This study aimed to evaluate the real-life performance of clinical vignettes and multiple-choice questions generated by using ChatGPT.
This was a randomized controlled study in an evidence-based medicine training program. We randomly assigned seventy-four medical students to two groups. The ChatGPT group received ill-defined cases generated by ChatGPT, while the control group received human-written cases. At the end of the training, they evaluated the cases by rating 10 statements using a Likert scale. They also answered 15 multiple-choice questions (MCQs) generated by ChatGPT. The case evaluations of the two groups were compared. Some psychometric characteristics (item difficulty and point-biserial correlations) of the test were also reported.
None of the scores in 10 statements regarding the cases showed a significant difference between the ChatGPT group and the control group ( > .05). In the test, only six MCQs had acceptable levels (higher than 0.30) of point-biserial correlation, and five items could be considered acceptable in classroom settings.
The results showed that the quality of the vignettes are comparable to those created by human authors, and some multiple-questions have acceptable psychometric characteristics. ChatGPT has potential in generating clinical vignettes for teaching and MCQs for assessment in medical education.
本研究旨在评估使用ChatGPT生成的临床案例和多项选择题在实际应用中的表现。
这是一项在循证医学培训项目中的随机对照研究。我们将74名医学生随机分为两组。ChatGPT组接收由ChatGPT生成的模糊病例,而对照组接收人工编写的病例。在培训结束时,他们使用李克特量表对10条陈述进行评分来评估这些病例。他们还回答了由ChatGPT生成的15道多项选择题。比较了两组的病例评估结果。还报告了测试的一些心理测量学特征(题目难度和点二列相关)。
关于病例的10条陈述中的分数,ChatGPT组和对照组之间均无显著差异(>0.05)。在测试中,只有6道多项选择题的点二列相关水平可接受(高于0.30),并且在课堂环境中有5道题目可被认为是可接受的。
结果表明,案例的质量与人工编写的相当,并且一些多项选择题具有可接受的心理测量学特征。ChatGPT在为医学教育中的教学生成临床案例和为评估生成多项选择题方面具有潜力。