基于多方面 Rasch 模型的绩效测试链接准确性。

Accuracy of performance-test linking based on a many-facet Rasch model.

机构信息

The University of Electro-Communications, Tokyo, Japan.

出版信息

Behav Res Methods. 2021 Aug;53(4):1440-1454. doi: 10.3758/s13428-020-01498-x. Epub 2020 Nov 9.

DOI:10.3758/s13428-020-01498-x

PMID:33169286

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8367909/

Abstract

Performance assessments, in which human raters assess examinee performance in practical tasks, have attracted much attention in various assessment contexts involving measurement of higher-order abilities. However, difficulty persists in that ability measurement accuracy strongly depends on rater and task characteristics such as rater severity and task difficulty. To resolve this problem, various item response theory (IRT) models incorporating rater and task parameters, including many-facet Rasch models (MFRMs), have been proposed. When applying such IRT models to datasets comprising results of multiple performance tests administered to different examinees, test linking is needed to unify the scale for model parameters estimated from individual test results. In test linking, test administrators generally need to design multiple tests such that raters and tasks partially overlap. The accuracy of linking under this design is highly reliant on the numbers of common raters and tasks. However, the numbers of common raters and tasks required to ensure high accuracy in test linking remain unclear, making it difficult to determine appropriate test designs. We therefore empirically evaluate the accuracy of IRT-based performance-test linking under common rater and task designs. Concretely, we conduct evaluations through simulation experiments that examine linking accuracy based on a MFRM while changing numbers of common raters and tasks with various factors that possibly affect linking accuracy.

摘要

表现评估，即人类评分者对考生在实际任务中的表现进行评估，在涉及高阶能力测量的各种评估情境中引起了广泛关注。然而，困难在于，能力测量的准确性强烈依赖于评分者和任务的特征，例如评分者的严格程度和任务的难度。为了解决这个问题，已经提出了各种包含评分者和任务参数的项目反应理论（IRT）模型，包括多方面 Rasch 模型（MFRM）。当将此类 IRT 模型应用于由不同考生的多个表现测试结果组成的数据集时，需要进行测试链接以统一从个别测试结果估计的模型参数的量表。在测试链接中，测试管理员通常需要设计多个测试，以使评分者和任务部分重叠。在这种设计下链接的准确性高度依赖于共同评分者和任务的数量。然而，确保测试链接具有高准确性所需的共同评分者和任务的数量仍不清楚，这使得确定适当的测试设计变得困难。因此，我们通过模拟实验来实证评估基于 IRT 的表现测试链接在常见评分者和任务设计下的准确性。具体来说，我们通过改变可能影响链接准确性的各种因素下的共同评分者和任务数量，基于 MFRM 进行链接准确性的评估。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2dc6/8367909/3a5201eef722/13428_2020_1498_Fig1_HTML.jpg

相似文献

Accuracy of performance-test linking based on a many-facet Rasch model.基于多方面 Rasch 模型的绩效测试链接准确性。

Behav Res Methods. 2021 Aug;53(4):1440-1454. doi: 10.3758/s13428-020-01498-x. Epub 2020 Nov 9.

Linking essay-writing tests using many-facet models and neural automated essay scoring.运用多维模型和神经自动作文评分技术对作文考试进行关联。

Behav Res Methods. 2024 Dec;56(8):8450-8479. doi: 10.3758/s13428-024-02485-2. Epub 2024 Aug 20.

Empirical comparison of item response theory models with rater's parameters.具有评分者参数的项目反应理论模型的实证比较。

Heliyon. 2018 May 8;4(5):e00622. doi: 10.1016/j.heliyon.2018.e00622. eCollection 2018 May.

Item response theory model highlighting rating scale of a rubric and rater-rubric interaction in objective structured clinical examination.项目反应理论模型突出了客观结构化临床考试中等级量表的评分和评分者-等级量表的交互作用。

PLoS One. 2024 Sep 6;19(9):e0309887. doi: 10.1371/journal.pone.0309887. eCollection 2024.

A Bayesian many-facet Rasch model with Markov modeling for rater severity drift.贝叶斯多项 RASCH 模型与马尔可夫建模用于评分者严重偏差。

Behav Res Methods. 2023 Oct;55(7):3910-3928. doi: 10.3758/s13428-022-01997-z. Epub 2022 Oct 25.

Examining Rater Judgements in Music Performance Assessment using Many-Facets Rasch Rating Scale Measurement Model.使用多面Rasch评分量表测量模型检验音乐表演评估中的评分者判断。

J Appl Meas. 2019;20(1):79-99.

A comparative analysis of the ratings in performance assessment using generalizability theory and the many-facet Rasch model.使用概化理论和多面Rasch模型对绩效评估中的评分进行比较分析。

J Appl Meas. 2009;10(4):408-23.

Item response theory: applications of modern test theory in medical education.项目反应理论：现代测试理论在医学教育中的应用。

Med Educ. 2003 Aug;37(8):739-45. doi: 10.1046/j.1365-2923.2003.01587.x.

Using Generalizability Theory and Many-Facet Rasch Model to Evaluate In-Basket Tests for Managerial Positions.运用概化理论和多面Rasch模型评估管理职位的公文筐测试。

Front Psychol. 2021 Jul 29;12:660553. doi: 10.3389/fpsyg.2021.660553. eCollection 2021.

Using Repeated Ratings to Improve Measurement Precision in Incomplete Rating Designs.在不完全评分设计中使用重复评分提高测量精度

J Appl Meas. 2018;19(2):148-161.

引用本文的文献

Linking essay-writing tests using many-facet models and neural automated essay scoring.运用多维模型和神经自动作文评分技术对作文考试进行关联。

Behav Res Methods. 2024 Dec;56(8):8450-8479. doi: 10.3758/s13428-024-02485-2. Epub 2024 Aug 20.

Human ratings take time: A hierarchical facets model for the joint analysis of ratings and rating times.人力评分需要时间：一种联合分析评分和评分时间的层次因素模型。

Behav Res Methods. 2024 Apr;56(4):3535-3547. doi: 10.3758/s13428-023-02259-2. Epub 2023 Nov 2.

A Bayesian many-facet Rasch model with Markov modeling for rater severity drift.贝叶斯多项 RASCH 模型与马尔可夫建模用于评分者严重偏差。

Behav Res Methods. 2023 Oct;55(7):3910-3928. doi: 10.3758/s13428-022-01997-z. Epub 2022 Oct 25.

本文引用的文献

Using the Many-Facet Rasch Model to analyse and evaluate the quality of objective structured clinical examination: a non-experimental cross-sectional design.运用多面Rasch模型分析和评估客观结构化临床考试的质量：一项非实验性横断面设计。

BMJ Open. 2019 Sep 6;9(9):e029208. doi: 10.1136/bmjopen-2019-029208.

Exploring the Combined Effects of Rater Misfit and Differential Rater Functioning in Performance Assessments.探索评分者不匹配和评分者差异功能在绩效评估中的综合影响。

Educ Psychol Meas. 2019 Oct;79(5):962-987. doi: 10.1177/0013164419834613. Epub 2019 Apr 2.

Evaluating Different Equating Setups in the Continuous Item Pool Calibration for Computerized Adaptive Testing.评估计算机自适应测试连续项目池校准中的不同等值设置

Front Psychol. 2019 Jun 6;10:1277. doi: 10.3389/fpsyg.2019.01277. eCollection 2019.

Evaluating Anchor-Item Designs for Concurrent Calibration With the GGUM.评估用于与广义通用单维模型（GGUM）同时校准的锚定项目设计。

Appl Psychol Meas. 2017 Mar;41(2):83-96. doi: 10.1177/0146621616673997. Epub 2016 Nov 4.

Empirical comparison of item response theory models with rater's parameters.具有评分者参数的项目反应理论模型的实证比较。

Heliyon. 2018 May 8;4(5):e00622. doi: 10.1016/j.heliyon.2018.e00622. eCollection 2018 May.

The computation of equating errors in international surveys in education.国际教育调查中等值误差的计算。

J Appl Meas. 2007;8(3):323-35.

Detecting and measuring rater effects using many-facet Rasch measurement: part I.使用多面Rasch测量法检测和衡量评分者效应：第一部分。

J Appl Meas. 2003;4(4):386-422.

Detecting differential rater functioning over time (DRIFT) using a Rasch multi-faceted rating scale model.使用Rasch多维度评分量表模型检测随时间变化的评分者差异功能（DRIFT）。

J Appl Meas. 2001;2(3):256-80.

Constructing rater and task banks for performance assessments.构建用于绩效评估的评分者库和任务库。

J Outcome Meas. 1997;1(1):19-33.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于多方面 Rasch 模型的绩效测试链接准确性。

Accuracy of performance-test linking based on a many-facet Rasch model.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献