The University of Electro-Communications, Tokyo, Japan.
Behav Res Methods. 2021 Aug;53(4):1440-1454. doi: 10.3758/s13428-020-01498-x. Epub 2020 Nov 9.
Performance assessments, in which human raters assess examinee performance in practical tasks, have attracted much attention in various assessment contexts involving measurement of higher-order abilities. However, difficulty persists in that ability measurement accuracy strongly depends on rater and task characteristics such as rater severity and task difficulty. To resolve this problem, various item response theory (IRT) models incorporating rater and task parameters, including many-facet Rasch models (MFRMs), have been proposed. When applying such IRT models to datasets comprising results of multiple performance tests administered to different examinees, test linking is needed to unify the scale for model parameters estimated from individual test results. In test linking, test administrators generally need to design multiple tests such that raters and tasks partially overlap. The accuracy of linking under this design is highly reliant on the numbers of common raters and tasks. However, the numbers of common raters and tasks required to ensure high accuracy in test linking remain unclear, making it difficult to determine appropriate test designs. We therefore empirically evaluate the accuracy of IRT-based performance-test linking under common rater and task designs. Concretely, we conduct evaluations through simulation experiments that examine linking accuracy based on a MFRM while changing numbers of common raters and tasks with various factors that possibly affect linking accuracy.
表现评估,即人类评分者对考生在实际任务中的表现进行评估,在涉及高阶能力测量的各种评估情境中引起了广泛关注。然而,困难在于,能力测量的准确性强烈依赖于评分者和任务的特征,例如评分者的严格程度和任务的难度。为了解决这个问题,已经提出了各种包含评分者和任务参数的项目反应理论(IRT)模型,包括多方面 Rasch 模型(MFRM)。当将此类 IRT 模型应用于由不同考生的多个表现测试结果组成的数据集时,需要进行测试链接以统一从个别测试结果估计的模型参数的量表。在测试链接中,测试管理员通常需要设计多个测试,以使评分者和任务部分重叠。在这种设计下链接的准确性高度依赖于共同评分者和任务的数量。然而,确保测试链接具有高准确性所需的共同评分者和任务的数量仍不清楚,这使得确定适当的测试设计变得困难。因此,我们通过模拟实验来实证评估基于 IRT 的表现测试链接在常见评分者和任务设计下的准确性。具体来说,我们通过改变可能影响链接准确性的各种因素下的共同评分者和任务数量,基于 MFRM 进行链接准确性的评估。