Güvel Muhammed Cihan, Kıyak Yavuz Selim, Varan Hacer Doğan, Sezenöz Burak, Coşkun Özlem, Uluoğlu Canan
Department Medical Pharmacology, Gazi University Faculty of Medicine, Ankara, Turkey.
Department of Medical Education and Informatics, Gazi University Faculty of Medicine, Ankara, Turkey.
Eur J Clin Pharmacol. 2025 Jun;81(6):875-883. doi: 10.1007/s00228-025-03838-2. Epub 2025 Apr 9.
This study evaluated the performance of three generative AI models-ChatGPT- 4o, Gemini 1.5 Advanced Pro, and Claude 3.5 Sonnet-in producing case-based rational pharmacology questions compared to expert educators.
Using one-shot prompting, 60 questions (20 per model) addressing essential hypertension and type 2 diabetes subjects were generated. A multidisciplinary panel categorized questions by usability (no revisions needed, minor or major revisions required, or unusable). Subsequently, 24 AI-generated and 8 expert-created questions were asked to 103 medical students in a real-world exam setting. Performance metrics, including correct response rate, discrimination index, and identification of nonfunctional distractors, were analyzed.
No statistically significant differences were found between AI-generated and expert-created questions, with mean correct response rates surpassing 50% and discrimination indices consistently equal to or above 0.20. Claude produced the highest proportion of error-free items (12/20), whereas ChatGPT exhibited the fewest unusable items (5/20). Expert revisions required approximately one minute per AI-generated question, representing a substantial efficiency gain over manual question preperation. Nonetheless, 19 out of 60 AI-generated questions were deemed unusable, highlighting the necessity of expert oversight.
Large language models can profoundly accelerate the development of high-quality assessment questions in medical education. However, expert review remains critical to address lapses in reliability and validity. A hybrid model, integrating AI-driven efficiencies with rigorous expert validation, may offer an optimal approach for enhancing educational outcomes.
本研究评估了三种生成式人工智能模型——ChatGPT-4o、Gemini 1.5 Advanced Pro和Claude 3.5 Sonnet——与专家教育工作者相比,在生成基于病例的合理药理学问题方面的表现。
使用一次性提示,生成了60个问题(每个模型20个),涉及原发性高血压和2型糖尿病主题。一个多学科小组根据可用性对问题进行分类(无需修订、需要 minor 或 major 修订,或不可用)。随后,在实际考试环境中,向103名医学生提出了24个由人工智能生成的问题和8个由专家创建的问题。分析了包括正确回答率、区分指数和识别无功能干扰项在内的性能指标。
人工智能生成的问题和专家创建的问题之间未发现统计学上的显著差异,平均正确回答率超过50%,区分指数始终等于或高于0.20。Claude生成的无错误项目比例最高(12/20),而ChatGPT展示的不可用项目最少(5/20)。专家修订每个由人工智能生成的问题大约需要一分钟,与手动编写问题相比,效率有了显著提高。尽管如此,60个由人工智能生成的问题中有19个被认为不可用,这凸显了专家监督的必要性。
大语言模型可以极大地加速医学教育中高质量评估问题的开发。然而,专家评审对于解决可靠性和有效性方面的不足仍然至关重要。将人工智能驱动的效率与严格的专家验证相结合的混合模型,可能为提高教育成果提供最佳方法。