• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于多方面 Rasch 模型的绩效测试链接准确性。

Accuracy of performance-test linking based on a many-facet Rasch model.

机构信息

The University of Electro-Communications, Tokyo, Japan.

出版信息

Behav Res Methods. 2021 Aug;53(4):1440-1454. doi: 10.3758/s13428-020-01498-x. Epub 2020 Nov 9.

DOI:10.3758/s13428-020-01498-x
PMID:33169286
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8367909/
Abstract

Performance assessments, in which human raters assess examinee performance in practical tasks, have attracted much attention in various assessment contexts involving measurement of higher-order abilities. However, difficulty persists in that ability measurement accuracy strongly depends on rater and task characteristics such as rater severity and task difficulty. To resolve this problem, various item response theory (IRT) models incorporating rater and task parameters, including many-facet Rasch models (MFRMs), have been proposed. When applying such IRT models to datasets comprising results of multiple performance tests administered to different examinees, test linking is needed to unify the scale for model parameters estimated from individual test results. In test linking, test administrators generally need to design multiple tests such that raters and tasks partially overlap. The accuracy of linking under this design is highly reliant on the numbers of common raters and tasks. However, the numbers of common raters and tasks required to ensure high accuracy in test linking remain unclear, making it difficult to determine appropriate test designs. We therefore empirically evaluate the accuracy of IRT-based performance-test linking under common rater and task designs. Concretely, we conduct evaluations through simulation experiments that examine linking accuracy based on a MFRM while changing numbers of common raters and tasks with various factors that possibly affect linking accuracy.

摘要

表现评估,即人类评分者对考生在实际任务中的表现进行评估,在涉及高阶能力测量的各种评估情境中引起了广泛关注。然而,困难在于,能力测量的准确性强烈依赖于评分者和任务的特征,例如评分者的严格程度和任务的难度。为了解决这个问题,已经提出了各种包含评分者和任务参数的项目反应理论(IRT)模型,包括多方面 Rasch 模型(MFRM)。当将此类 IRT 模型应用于由不同考生的多个表现测试结果组成的数据集时,需要进行测试链接以统一从个别测试结果估计的模型参数的量表。在测试链接中,测试管理员通常需要设计多个测试,以使评分者和任务部分重叠。在这种设计下链接的准确性高度依赖于共同评分者和任务的数量。然而,确保测试链接具有高准确性所需的共同评分者和任务的数量仍不清楚,这使得确定适当的测试设计变得困难。因此,我们通过模拟实验来实证评估基于 IRT 的表现测试链接在常见评分者和任务设计下的准确性。具体来说,我们通过改变可能影响链接准确性的各种因素下的共同评分者和任务数量,基于 MFRM 进行链接准确性的评估。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2dc6/8367909/3a5201eef722/13428_2020_1498_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2dc6/8367909/3a5201eef722/13428_2020_1498_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2dc6/8367909/3a5201eef722/13428_2020_1498_Fig1_HTML.jpg

相似文献

1
Accuracy of performance-test linking based on a many-facet Rasch model.基于多方面 Rasch 模型的绩效测试链接准确性。
Behav Res Methods. 2021 Aug;53(4):1440-1454. doi: 10.3758/s13428-020-01498-x. Epub 2020 Nov 9.
2
Linking essay-writing tests using many-facet models and neural automated essay scoring.运用多维模型和神经自动作文评分技术对作文考试进行关联。
Behav Res Methods. 2024 Dec;56(8):8450-8479. doi: 10.3758/s13428-024-02485-2. Epub 2024 Aug 20.
3
Empirical comparison of item response theory models with rater's parameters.具有评分者参数的项目反应理论模型的实证比较。
Heliyon. 2018 May 8;4(5):e00622. doi: 10.1016/j.heliyon.2018.e00622. eCollection 2018 May.
4
Item response theory model highlighting rating scale of a rubric and rater-rubric interaction in objective structured clinical examination.项目反应理论模型突出了客观结构化临床考试中等级量表的评分和评分者-等级量表的交互作用。
PLoS One. 2024 Sep 6;19(9):e0309887. doi: 10.1371/journal.pone.0309887. eCollection 2024.
5
A Bayesian many-facet Rasch model with Markov modeling for rater severity drift.贝叶斯多项 RASCH 模型与马尔可夫建模用于评分者严重偏差。
Behav Res Methods. 2023 Oct;55(7):3910-3928. doi: 10.3758/s13428-022-01997-z. Epub 2022 Oct 25.
6
Examining Rater Judgements in Music Performance Assessment using Many-Facets Rasch Rating Scale Measurement Model.使用多面Rasch评分量表测量模型检验音乐表演评估中的评分者判断。
J Appl Meas. 2019;20(1):79-99.
7
A comparative analysis of the ratings in performance assessment using generalizability theory and the many-facet Rasch model.使用概化理论和多面Rasch模型对绩效评估中的评分进行比较分析。
J Appl Meas. 2009;10(4):408-23.
8
Item response theory: applications of modern test theory in medical education.项目反应理论:现代测试理论在医学教育中的应用。
Med Educ. 2003 Aug;37(8):739-45. doi: 10.1046/j.1365-2923.2003.01587.x.
9
Using Generalizability Theory and Many-Facet Rasch Model to Evaluate In-Basket Tests for Managerial Positions.运用概化理论和多面Rasch模型评估管理职位的公文筐测试。
Front Psychol. 2021 Jul 29;12:660553. doi: 10.3389/fpsyg.2021.660553. eCollection 2021.
10
Using Repeated Ratings to Improve Measurement Precision in Incomplete Rating Designs.在不完全评分设计中使用重复评分提高测量精度
J Appl Meas. 2018;19(2):148-161.

引用本文的文献

1
Linking essay-writing tests using many-facet models and neural automated essay scoring.运用多维模型和神经自动作文评分技术对作文考试进行关联。
Behav Res Methods. 2024 Dec;56(8):8450-8479. doi: 10.3758/s13428-024-02485-2. Epub 2024 Aug 20.
2
Human ratings take time: A hierarchical facets model for the joint analysis of ratings and rating times.人力评分需要时间:一种联合分析评分和评分时间的层次因素模型。
Behav Res Methods. 2024 Apr;56(4):3535-3547. doi: 10.3758/s13428-023-02259-2. Epub 2023 Nov 2.
3
A Bayesian many-facet Rasch model with Markov modeling for rater severity drift.

本文引用的文献

1
Using the Many-Facet Rasch Model to analyse and evaluate the quality of objective structured clinical examination: a non-experimental cross-sectional design.运用多面Rasch模型分析和评估客观结构化临床考试的质量:一项非实验性横断面设计。
BMJ Open. 2019 Sep 6;9(9):e029208. doi: 10.1136/bmjopen-2019-029208.
2
Exploring the Combined Effects of Rater Misfit and Differential Rater Functioning in Performance Assessments.探索评分者不匹配和评分者差异功能在绩效评估中的综合影响。
Educ Psychol Meas. 2019 Oct;79(5):962-987. doi: 10.1177/0013164419834613. Epub 2019 Apr 2.
3
Evaluating Different Equating Setups in the Continuous Item Pool Calibration for Computerized Adaptive Testing.
贝叶斯多项 RASCH 模型与马尔可夫建模用于评分者严重偏差。
Behav Res Methods. 2023 Oct;55(7):3910-3928. doi: 10.3758/s13428-022-01997-z. Epub 2022 Oct 25.
评估计算机自适应测试连续项目池校准中的不同等值设置
Front Psychol. 2019 Jun 6;10:1277. doi: 10.3389/fpsyg.2019.01277. eCollection 2019.
4
Evaluating Anchor-Item Designs for Concurrent Calibration With the GGUM.评估用于与广义通用单维模型(GGUM)同时校准的锚定项目设计。
Appl Psychol Meas. 2017 Mar;41(2):83-96. doi: 10.1177/0146621616673997. Epub 2016 Nov 4.
5
Empirical comparison of item response theory models with rater's parameters.具有评分者参数的项目反应理论模型的实证比较。
Heliyon. 2018 May 8;4(5):e00622. doi: 10.1016/j.heliyon.2018.e00622. eCollection 2018 May.
6
The computation of equating errors in international surveys in education.国际教育调查中等值误差的计算。
J Appl Meas. 2007;8(3):323-35.
7
Detecting and measuring rater effects using many-facet Rasch measurement: part I.使用多面Rasch测量法检测和衡量评分者效应:第一部分。
J Appl Meas. 2003;4(4):386-422.
8
Detecting differential rater functioning over time (DRIFT) using a Rasch multi-faceted rating scale model.使用Rasch多维度评分量表模型检测随时间变化的评分者差异功能(DRIFT)。
J Appl Meas. 2001;2(3):256-80.
9
Constructing rater and task banks for performance assessments.构建用于绩效评估的评分者库和任务库。
J Outcome Meas. 1997;1(1):19-33.