Jackson Frank I, Keller Nathan A, Kouba Insaf, Kouba Wassil, Bracero Luis A, Blitz Matthew J
Acad Med. 2025 Oct 1;100(10):1163-1166. doi: 10.1097/ACM.0000000000006137. Epub 2025 Jun 23.
Clinical vignette-based multiple-choice questions (MCQs) have been used to assess postgraduate medical trainees but require substantial time and effort to develop. Large language models, a type of artificial intelligence (AI), can potentially expedite this task. This report describes prompt engineering techniques used with ChatGPT-4 to generate clinical vignettes and MCQs for obstetrics-gynecology residents and evaluates whether residents and attending physicians can differentiate between human- and AI-generated content.
The authors generated MCQs using a structured prompt engineering approach, incorporating authoritative source documents and an iterative prompt chaining technique, to refine output quality. Fifty human-generated and 50 AI-generated MCQs were randomly arranged into 10 quizzes (10 questions each). The AI-generated MCQs were developed in August 2024 and surveys conducted in September 2024. Obstetrics-gynecology residents and attending physician faculty members at Northwell Health or Donald and Barbara Zucker School of Medicine at Hofstra/Northwell completed an online survey, answering each MCQ and indicating whether they believed it was human or AI written or if they were uncertain.
Thirty-three participants (16 residents, 17 attendings) completed the survey (80.5% response rate). Respondents correctly identified MCQ authorship a median (interquartile range [IQR]) of 39.1% (30.0%-50.0%) of the time, indicating difficulty in distinguishing human- and AI-generated questions. The median (IQR) correct answer selection rate was 62.3% (50.0%-75.0%) for human-generated MCQs and 64.4% (50.0%-83.3%) for AI-generated MCQs ( P = .74). The difficulty (0.69 vs 0.66, P = .83) and discriminatory (0.42 vs 0.38, P = .90) indexes showed no significant differences, supporting the feasibility of large language model-generated MCQs in medical education.
Future studies should explore the optimal balance between AI-generated content and expert review, identifying strategies to maximize efficiency without compromising accuracy. The authors will develop practice exams and assess their predictive validity by comparing scores with standardized exam results.
基于临床病例的多项选择题(MCQs)已被用于评估医学研究生,但开发此类题目需要大量时间和精力。大型语言模型作为一种人工智能(AI),有可能加快这项任务的进程。本报告描述了与ChatGPT-4一起使用的提示工程技术,以生成针对妇产科住院医师的临床病例和多项选择题,并评估住院医师和主治医师是否能够区分人工生成和AI生成的内容。
作者采用结构化提示工程方法生成多项选择题,纳入权威源文档并运用迭代提示链技术,以提高输出质量。将50道人工生成的和50道AI生成的多项选择题随机排列成10组测验(每组10道题)。AI生成的多项选择题于2024年8月开发,并于2024年9月进行调查。诺斯韦尔健康中心或霍夫斯特拉/诺斯韦尔唐纳德和芭芭拉·扎克医学院的妇产科住院医师和主治医师教员完成了一项在线调查,回答每道多项选择题,并指出他们认为该题是人工编写还是AI编写,或者他们不确定。
33名参与者(16名住院医师,17名主治医师)完成了调查(回复率80.5%)。受访者正确识别多项选择题作者身份的中位数(四分位间距[IQR])为39.1%(30.0%-50.0%),这表明区分人工生成和AI生成的问题存在困难。人工生成的多项选择题的正确答案选择率中位数(IQR)为62.3%(50.0%-75.0%),AI生成的多项选择题为64.4%(50.0%-83.3%)(P = 0.74)。难度指数(0.69对0.66,P = 0.83)和区分度指数(0.42对0.38,P = 0.90)均无显著差异,这支持了大型语言模型生成的多项选择题在医学教育中的可行性。
未来的研究应探索AI生成内容与专家评审之间的最佳平衡,确定在不影响准确性的前提下提高效率的策略。作者将开发练习考试,并通过将分数与标准化考试结果进行比较来评估其预测效度。