Touma Naji J, Patel Ruchit, Skinner Thomas, Leveridge Michael
Department of Urology, Queen's University, Kingston, Ontario, Canada.
J Urol. 2025 Apr;213(4):504-511. doi: 10.1097/JU.0000000000004357. Epub 2024 Dec 9.
Assessments in medical education play a central role in evaluating trainees' progress and eventual competence. Generative artificial intelligence is finding an increasing role in clinical care and medical education. The objective of this study was to evaluate the ability of the large language model ChatGPT to generate examination questions that are discriminating in the evaluation of graduating urology residents.
Graduating urology residents representing all Canadian training programs gather yearly for a mock examination that simulates their upcoming board certification examination. The examination consists of a written multiple-choice question (MCQ) examination and an oral objective structured clinical examination. In 2023, ChatGPT Version 4 was used to generate 20 MCQs that were added to the written component. ChatGPT was asked to use Campbell-Walsh Urology, AUA, and Canadian Urological Association guidelines as resources. Psychometric analysis of the ChatGPT MCQs was conducted. The MCQs were also researched by 3 faculty for face validity and to ascertain whether they came from a valid source.
The mean score of the 35 examination takers on the ChatGPT MCQs was 60.7% vs 61.1% for the overall examination. Twenty-five of ChatGPT MCQs showed a discrimination index > 0.3, the threshold for questions that properly discriminate between high and low examination performers. Twenty-five percent of ChatGPT MCQs showed a point biserial > 0.2, which is considered a high correlation with overall performance on the examination. The assessment by faculty found that ChatGPT MCQs often provided incomplete information in the stem, provided multiple potentially correct answers, and were sometimes not rooted in the literature. Thirty-five percent of the MCQs generated by ChatGPT provided wrong answers to stems.
Despite what seems to be similar performance on ChatGPT MCQs and the overall examination, ChatGPT MCQs tend not to be highly discriminating. Poorly phrased questions with potential for artificial intelligence hallucinations are ever present. Careful vetting for quality of ChatGPT questions should be undertaken before their use on assessments in urology training examinations.
医学教育评估在评估学员的进展和最终能力方面发挥着核心作用。生成式人工智能在临床护理和医学教育中的作用日益凸显。本研究的目的是评估大语言模型ChatGPT生成在评估即将毕业的泌尿外科住院医师时具有区分度的考试问题的能力。
代表加拿大所有培训项目的即将毕业的泌尿外科住院医师每年都会参加一次模拟考试,该考试模拟他们即将到来的委员会认证考试。考试包括书面多项选择题(MCQ)考试和口头客观结构化临床考试。2023年,使用ChatGPT版本4生成了20道MCQ,并添加到书面部分。要求ChatGPT以《坎贝尔-沃尔什泌尿外科学》、美国泌尿外科学会(AUA)和加拿大泌尿外科学会的指南作为参考资料。对ChatGPT生成的MCQ进行了心理测量分析。3名教员还对这些MCQ进行了研究,以评估其表面效度,并确定它们是否来自有效来源。
35名考生在ChatGPT生成的MCQ上的平均得分是60.7%,而整个考试的平均得分是61.1%。ChatGPT生成的25道MCQ的区分指数>0.3,这是能够正确区分高分和低分考生的问题的阈值。ChatGPT生成的25%的MCQ的点二列相关系数>0.2,这被认为与考试的整体表现高度相关。教员评估发现,ChatGPT生成的MCQ在题干中常常提供不完整的信息,提供多个潜在的正确答案,有时并非基于文献。ChatGPT生成的35%的MCQ对题干给出了错误答案。
尽管ChatGPT生成的MCQ与整个考试的表现看似相似,但ChatGPT生成的MCQ往往区分度不高。存在措辞不当的问题,且有产生人工智能幻觉的可能性。在将ChatGPT生成的问题用于泌尿外科培训考试的评估之前,应仔细审查其质量。