评估大语言模型生成的与专家编写的临床解剖学多项选择题:医学教育中基于学生认知的比较研究。

Assessing LLM-generated vs. expert-created clinical anatomy MCQs: a student perception-based comparative study in medical education.

作者信息

Elzayyat Maram, Mohammad Janatul Naeim, Zaqout Sami

机构信息

Department of Basic Medical Sciences, College of Medicine, QU Health, Qatar University, Doha, Qatar.

出版信息

Med Educ Online. 2025 Dec;30(1):2554678. doi: 10.1080/10872981.2025.2554678. Epub 2025 Aug 30.

Abstract

Large language models (LLMs) such as ChatGPT and Gemini are increasingly used to generate educational content in medical education, including multiple-choice questions (MCQs), but their effectiveness compared to expert-written questions remains underexplored, particularly in anatomy. We conducted a cross-sectional, mixed-methods study involving Year 2-4 medical students at Qatar University, where participants completed and evaluated three anonymized MCQ sets-authored by ChatGPT, Google-Gemini, and a clinical anatomist-across 17 quality criteria. Descriptive and chi-square analyses were performed, and optional feedback was reviewed thematically. Among 48 participants, most rated the three MCQ sources as equally effective, although ChatGPT was more often preferred for helping students identify and confront their knowledge gaps through challenging distractors and diagnostic insight, while expert-written questions were rated highest for deeper analytical thinking. A significant variation in preferences was observed across sources (χ² (64) = 688.79, < .001). Qualitative feedback emphasized the need for better difficulty calibration and clearer distractors in some AI-generated items. Overall, LLM-generated anatomy MCQs can closely match expert-authored ones in learner-perceived value and may support deeper engagement, but expert review remains critical to ensure clarity and alignment with curricular goals. A hybrid AI-human workflow may provide a promising path for scalable, high-quality assessment design in medical education.

摘要

诸如ChatGPT和Gemini等大型语言模型(LLMs)越来越多地用于生成医学教育中的教学内容,包括多项选择题(MCQs),但其与专家编写的题目相比的有效性仍未得到充分探索,尤其是在解剖学方面。我们进行了一项横断面混合方法研究,涉及卡塔尔大学二至四年级的医学生,参与者根据17项质量标准完成并评估了由ChatGPT、谷歌Gemini和一名临床解剖学家编写的三组匿名多项选择题。进行了描述性分析和卡方分析,并对可选反馈进行了主题审查。在48名参与者中,大多数人认为这三种多项选择题来源同样有效,尽管ChatGPT更常被认为有助于学生通过具有挑战性的干扰项和诊断洞察力来识别和正视自己的知识差距,而专家编写的题目在深度分析思维方面得分最高。不同来源的偏好存在显著差异(χ²(64) = 688.79, <.001)。定性反馈强调,在一些人工智能生成的题目中,需要更好地校准难度和使干扰项更清晰。总体而言,大型语言模型生成的解剖学多项选择题在学习者感知的价值方面可以与专家编写的题目紧密匹配,并可能促进更深入的参与,但专家审查对于确保清晰度和与课程目标的一致性仍然至关重要。人工智能与人类相结合的工作流程可能为医学教育中可扩展的高质量评估设计提供一条有前景的途径。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8e14/12404065/77992987db00/ZMEO_A_2554678_F0001_OC.jpg

相似文献

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索