Kıyak Yavuz Selim, Coşkun Ali Kağan, Kaymak Şahin, Coşkun Özlem, Budakoğlu Işıl İrem
Department of Medical Education and Informatics, Gazi University Faculty of Medicine, Ankara, Turkey.
Department of General Surgery, UHS Gulhane School of Medicine, Ankara, Turkey.
Proc (Bayl Univ Med Cent). 2024 Oct 22;38(1):48-52. doi: 10.1080/08998280.2024.2418752. eCollection 2025.
This study aimed to determine whether surgical multiple-choice questions generated by ChatGPT are comparable to those written by human experts (surgeons).
The study was conducted at a medical school and involved 112 fourth-year medical students. Based on five learning objectives in general surgery (colorectal, gastric, trauma, breast, thyroid), ChatGPT and surgeons generated five multiple-choice questions. No change was made to the ChatGPT-generated questions. The statistical properties of these questions, including correlations between two group of questions and correlations with total scores (item discrimination) in a general surgery clerkship exam, were reported.
There was a significant positive correlation between the ChatGPT-generated and human-written questions for one learning objective (colorectal). More importantly, only one ChatGPT-generated question (colorectal) achieved an acceptable discrimination level, while other four failed to achieve it. In contrast, human-written questions showed acceptable discrimination levels.
While ChatGPT has the potential to generate multiple-choice questions comparable to human-written ones in specific contexts, the variability across surgical topics points to the need for human oversight and review before their use in exams. It is important to integrate artificial intelligence tools like ChatGPT with human expertise to enhance efficiency and quality.
本研究旨在确定由ChatGPT生成的外科多项选择题是否与人类专家(外科医生)编写的题目相当。
该研究在一所医学院进行,涉及112名四年级医学生。基于普通外科的五个学习目标(结直肠、胃、创伤、乳腺、甲状腺),ChatGPT和外科医生分别生成了五道多项选择题。ChatGPT生成的题目未作修改。报告了这些题目的统计特性,包括两组题目之间的相关性以及与普通外科实习考试总成绩的相关性(题目区分度)。
对于一个学习目标(结直肠),ChatGPT生成的题目与人类编写的题目之间存在显著正相关。更重要的是,ChatGPT生成的题目中只有一道(结直肠)达到了可接受的区分度水平,而其他四道未达到。相比之下,人类编写的题目显示出可接受的区分度水平。
虽然ChatGPT有潜力在特定背景下生成与人类编写的题目相当的多项选择题,但外科主题之间的差异表明在将其用于考试之前需要人工监督和审查。将ChatGPT等人工智能工具与人类专业知识相结合以提高效率和质量很重要。