Suppr超能文献

聊天机器人在生成本科医学生评估的单一最佳答案问题中的作用:比较分析

Chatbots' Role in Generating Single Best Answer Questions for Undergraduate Medical Student Assessment: Comparative Analysis.

作者信息

Abouzeid Enjy, Wassef Rita, Jawwad Ayesha, Harris Patricia

机构信息

School of Medicine, University of Ulster, Northland Road, Derry-Londonderry, BT48 7JL, United Kingdom, 44 7516989748.

出版信息

JMIR Med Educ. 2025 May 30;11:e69521. doi: 10.2196/69521.

Abstract

BACKGROUND

Programmatic assessment supports flexible learning and individual progression but challenges educators to develop frequent assessments reflecting different competencies. The continuous creation of large volumes of assessment items, in a consistent format and comparatively restricted time, is laborious. The application of technological innovations, including artificial intelligence (AI), has been tried to address this challenge. A major concern raised is the validity of the information produced by AI tools, and if not properly verified, it can produce inaccurate and therefore inappropriate assessments.

OBJECTIVE

This study was designed to examine the content validity and consistency of different AI chatbots in creating single best answer (SBA) questions, a refined format of multiple choice questions better suited to assess higher levels of knowledge, for undergraduate medical students.

METHODS

This study followed 3 steps. First, 3 researchers used a unified prompt script to generate 10 SBA questions across 4 chatbot platforms. Second, assessors evaluated the chatbot outputs for consistency by identifying similarities and differences between users and across chatbots. With 3 assessors and 10 learning objectives, the maximum possible score for any individual chatbot was 30. Third, 7 assessors internally moderated the questions using a rating scale developed by the research team to evaluate scientific accuracy and educational quality.

RESULTS

In response to the prompts, all chatbots generated 10 questions each, except Bing, which failed to respond to 1 prompt. ChatGPT-4 exhibited the highest variation in question generation but did not fully satisfy the "cover test." Gemini performed well across most evaluation criteria, except for item balance, and relied heavily on the vignette for answers but showed a preference for one answer option. Bing scored low in most evaluation areas but generated appropriately structured lead-in questions. SBA questions from GPT-3.5, Gemini, and ChatGPT-4 had similar Item Content Validity Index and Scale Level Content Validity Index values, while the Krippendorff alpha coefficient was low (0.016). Bing performed poorly in content clarity, overall validity, and item construction accuracy. A 2-way ANOVA without replication revealed statistically significant differences among chatbots and domains (P<.05). However, the Tukey-Kramer HSD (honestly significant difference) post hoc test showed no significant pairwise differences between individual chatbots, as all comparisons had P values >.05 and overlapping CIs.

CONCLUSIONS

AI chatbots can aid the production of questions aligned with learning objectives, and individual chatbots have their own strengths and weaknesses. Nevertheless, all require expert evaluation to ensure their suitability for use. Using AI to generate SBA prompts us to reconsider Bloom's taxonomy of the cognitive domain, which traditionally positions creation as the highest level of cognition.

摘要

背景

程序化评估支持灵活学习和个人进步,但对教育工作者提出了挑战,要求他们开展频繁的评估以反映不同的能力。在一致的格式和相对有限的时间内持续创建大量评估项目是一项艰巨的任务。人们尝试应用包括人工智能(AI)在内的技术创新来应对这一挑战。引发的一个主要担忧是人工智能工具产生信息的有效性,如果未经适当验证,它可能会产生不准确从而不适当的评估。

目的

本研究旨在检验不同人工智能聊天机器人在为本科医学生创建单项最佳答案(SBA)问题(一种更适合评估更高层次知识的多项选择题的改进形式)时的内容效度和一致性。

方法

本研究分三步进行。首先,3名研究人员使用统一的提示脚本来在4个聊天机器人平台上生成10个SBA问题。其次,评估人员通过识别用户之间以及不同聊天机器人之间的异同来评估聊天机器人输出的一致性。有3名评估人员和10个学习目标,任何单个聊天机器人的最高可能得分是30分。第三,7名评估人员使用研究团队制定的评分量表对问题进行内部审核,以评估科学准确性和教育质量。

结果

针对提示,除必应(Bing)未对1个提示做出回应外,所有聊天机器人均各自生成了10个问题。ChatGPT-4在问题生成方面表现出最高的变异性,但未完全满足“覆盖测试”。Gemini在大多数评估标准方面表现良好,除了项目平衡性,并且严重依赖 vignette 来获取答案,但对一个答案选项表现出偏好。必应在大多数评估领域得分较低,但生成了结构适当的导入问题。来自GPT-3.5、Gemini和ChatGPT-4的SBA问题具有相似的项目内容效度指数和量表水平内容效度指数值,而克里彭多夫阿尔法系数较低(0.016)。必应在内容清晰度、整体效度和项目构建准确性方面表现不佳。无重复的双向方差分析显示聊天机器人和领域之间存在统计学显著差异(P<0.05)。然而,图基 - 克莱默 honestly significant difference(HSD)事后检验显示单个聊天机器人之间没有显著的成对差异,因为所有比较的P值均>.05且置信区间重叠。

结论

人工智能聊天机器人有助于生成与学习目标一致的问题,并且单个聊天机器人有其自身的优势和劣势。然而,所有这些都需要专家评估以确保其适用性。使用人工智能生成SBA促使我们重新审视布鲁姆认知领域分类法,该分类法传统上把创造定位为认知的最高层次。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/34e8/12143854/05f80cb6b7dd/mededu-v11-e69521-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验