Suppr超能文献

评估使用自动化流程创建的医学多项选择题的质量。

Evaluating the quality of medical multiple-choice items created with automated processes.

机构信息

Centre for Research in Applied Measurement and Evaluation, Faculty of Education, University of Alberta, Edmonton, Alberta, Canada.

出版信息

Med Educ. 2013 Jul;47(7):726-33. doi: 10.1111/medu.12202.

Abstract

OBJECTIVES

Computerised assessment raises formidable challenges because it requires large numbers of test items. Automatic item generation (AIG) can help address this test development problem because it yields large numbers of new items both quickly and efficiently. To date, however, the quality of the items produced using a generative approach has not been evaluated. The purpose of this study was to determine whether automatic processes yield items that meet standards of quality that are appropriate for medical testing. Quality was evaluated firstly by subjecting items created using both AIG and traditional processes to rating by a four-member expert medical panel using indicators of multiple-choice item quality, and secondly by asking the panellists to identify which items were developed using AIG in a blind review.

METHODS

Fifteen items from the domain of therapeutics were created in three different experimental test development conditions. The first 15 items were created by content specialists using traditional test development methods (Group 1 Traditional). The second 15 items were created by the same content specialists using AIG methods (Group 1 AIG). The third 15 items were created by a new group of content specialists using traditional methods (Group 2 Traditional). These 45 items were then evaluated for quality by a four-member panel of medical experts and were subsequently categorised as either Traditional or AIG items.

RESULTS

Three outcomes were reported: (i) the items produced using traditional and AIG processes were comparable on seven of eight indicators of multiple-choice item quality; (ii) AIG items can be differentiated from Traditional items by the quality of their distractors, and (iii) the overall predictive accuracy of the four expert medical panellists was 42%.

CONCLUSIONS

Items generated by AIG methods are, for the most part, equivalent to traditionally developed items from the perspective of expert medical reviewers. While the AIG method produced comparatively fewer plausible distractors than the traditional method, medical experts cannot consistently distinguish AIG items from traditionally developed items in a blind review.

摘要

目的

计算机评估带来了巨大的挑战,因为它需要大量的测试项目。自动项目生成(AIG)可以帮助解决这个测试开发问题,因为它可以快速有效地生成大量新的项目。然而,到目前为止,使用生成方法生成的项目的质量尚未得到评估。本研究的目的是确定自动过程是否产生符合医学测试质量标准的项目。首先,通过让四名医学专家组成的小组使用多项选择题质量指标对使用 AIG 和传统方法创建的项目进行评分,评估项目的质量;其次,让小组成员在盲审中识别哪些项目是使用 AIG 开发的。

方法

从治疗领域创建了 15 个项目,这些项目是在三种不同的实验测试开发条件下创建的。前 15 个项目是由内容专家使用传统测试开发方法创建的(第 1 组传统方法)。后 15 个项目是由同一名内容专家使用 AIG 方法创建的(第 1 组 AIG)。第三组 15 个项目是由一组新的内容专家使用传统方法创建的(第 2 组传统方法)。然后,由四名医学专家组成的小组对这 45 个项目进行质量评估,并将其归类为传统项目或 AIG 项目。

结果

报告了三个结果:(i)使用传统和 AIG 过程生成的项目在八项多项选择题质量指标中的七个指标上相似;(ii)可以通过干扰项的质量将 AIG 项目与传统项目区分开来;(iii)四位医学专家小组成员的整体预测准确性为 42%。

结论

从专家医学审查员的角度来看,AIG 方法生成的项目在大多数方面与传统开发的项目相当。虽然 AIG 方法生成的干扰项比传统方法少,但医学专家无法在盲审中始终如一地将 AIG 项目与传统开发的项目区分开来。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验