• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

运用多维模型和神经自动作文评分技术对作文考试进行关联。

Linking essay-writing tests using many-facet models and neural automated essay scoring.

机构信息

The University of Electro-Communications, Tokyo, Japan.

出版信息

Behav Res Methods. 2024 Dec;56(8):8450-8479. doi: 10.3758/s13428-024-02485-2. Epub 2024 Aug 20.

DOI:10.3758/s13428-024-02485-2
PMID:39164563
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11525454/
Abstract

For essay-writing tests, challenges arise when scores assigned to essays are influenced by the characteristics of raters, such as rater severity and consistency. Item response theory (IRT) models incorporating rater parameters have been developed to tackle this issue, exemplified by the many-facet Rasch models. These IRT models enable the estimation of examinees' abilities while accounting for the impact of rater characteristics, thereby enhancing the accuracy of ability measurement. However, difficulties can arise when different groups of examinees are evaluated by different sets of raters. In such cases, test linking is essential for unifying the scale of model parameters estimated for individual examinee-rater groups. Traditional test-linking methods typically require administrators to design groups in which either examinees or raters are partially shared. However, this is often impractical in real-world testing scenarios. To address this, we introduce a novel method for linking the parameters of IRT models with rater parameters that uses neural automated essay scoring technology. Our experimental results indicate that our method successfully accomplishes test linking with accuracy comparable to that of linear linking using few common examinees.

摘要

对于论文写作测试,当评分者的特征(如评分者严厉性和一致性)影响到论文的评分时,就会出现挑战。已经开发了包含评分者参数的项目反应理论(IRT)模型来解决这个问题,多方面 Rasch 模型就是一个例子。这些 IRT 模型能够在考虑评分者特征影响的情况下估计考生的能力,从而提高能力测量的准确性。然而,当不同组的考生由不同的评分者进行评估时,可能会出现困难。在这种情况下,测试链接对于统一为个别考生-评分者群体估计的模型参数的规模是必不可少的。传统的测试链接方法通常要求管理员设计部分考生或评分者共享的组。然而,这在实际测试场景中往往是不切实际的。为了解决这个问题,我们引入了一种使用神经自动作文评分技术链接 IRT 模型参数和评分者参数的新方法。我们的实验结果表明,我们的方法成功地完成了测试链接,其准确性可与使用少量常见考生的线性链接相媲美。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/d609dc689c02/13428_2024_2485_Fig14_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/cffade824d5e/13428_2024_2485_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/6368328d8b7d/13428_2024_2485_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/0ddbea89b3af/13428_2024_2485_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/8d6abf262be3/13428_2024_2485_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/780d1431d212/13428_2024_2485_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/1594e142a84c/13428_2024_2485_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/857a38dcae06/13428_2024_2485_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/805c9da351ee/13428_2024_2485_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/4bf3bc05d7ff/13428_2024_2485_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/fa3c7967bcb2/13428_2024_2485_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/00e3e83e0a4d/13428_2024_2485_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/7a0ba6c45068/13428_2024_2485_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/22266b0eb21b/13428_2024_2485_Fig13_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/d609dc689c02/13428_2024_2485_Fig14_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/cffade824d5e/13428_2024_2485_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/6368328d8b7d/13428_2024_2485_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/0ddbea89b3af/13428_2024_2485_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/8d6abf262be3/13428_2024_2485_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/780d1431d212/13428_2024_2485_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/1594e142a84c/13428_2024_2485_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/857a38dcae06/13428_2024_2485_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/805c9da351ee/13428_2024_2485_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/4bf3bc05d7ff/13428_2024_2485_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/fa3c7967bcb2/13428_2024_2485_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/00e3e83e0a4d/13428_2024_2485_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/7a0ba6c45068/13428_2024_2485_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/22266b0eb21b/13428_2024_2485_Fig13_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/26bd/11525454/d609dc689c02/13428_2024_2485_Fig14_HTML.jpg

相似文献

1
Linking essay-writing tests using many-facet models and neural automated essay scoring.运用多维模型和神经自动作文评分技术对作文考试进行关联。
Behav Res Methods. 2024 Dec;56(8):8450-8479. doi: 10.3758/s13428-024-02485-2. Epub 2024 Aug 20.
2
Accuracy of performance-test linking based on a many-facet Rasch model.基于多方面 Rasch 模型的绩效测试链接准确性。
Behav Res Methods. 2021 Aug;53(4):1440-1454. doi: 10.3758/s13428-020-01498-x. Epub 2020 Nov 9.
3
Item response theory model highlighting rating scale of a rubric and rater-rubric interaction in objective structured clinical examination.项目反应理论模型突出了客观结构化临床考试中等级量表的评分和评分者-等级量表的交互作用。
PLoS One. 2024 Sep 6;19(9):e0309887. doi: 10.1371/journal.pone.0309887. eCollection 2024.
4
Human ratings take time: A hierarchical facets model for the joint analysis of ratings and rating times.人力评分需要时间:一种联合分析评分和评分时间的层次因素模型。
Behav Res Methods. 2024 Apr;56(4):3535-3547. doi: 10.3758/s13428-023-02259-2. Epub 2023 Nov 2.
5
Automated essay scoring and the future of educational assessment in medical education.自动作文评分与医学教育中教育评估的未来。
Med Educ. 2014 Oct;48(10):950-62. doi: 10.1111/medu.12517.
6
Prediction of true test scores from observed item scores and ancillary data.根据观察到的项目得分和辅助数据预测真实测试分数。
Br J Math Stat Psychol. 2015 May;68(2):363-85. doi: 10.1111/bmsp.12052. Epub 2015 Mar 13.
7
Effects of using a scoring guide on essay scores: generalizability theory.使用评分指南对作文分数的影响:概化理论
Percept Mot Skills. 2007 Dec;105(3 Pt 1):891-905. doi: 10.2466/pms.105.3.891-905.
8
The raters' differences in Arabic writing rubrics through the Many-Facet Rasch measurement model.通过多面Rasch测量模型分析评分者在阿拉伯语写作评分标准上的差异。
Front Psychol. 2022 Dec 16;13:988272. doi: 10.3389/fpsyg.2022.988272. eCollection 2022.
9
Investigating rater-student interaction, gender bias, and major bias in the assessment of research seminar presentation.调查研究研讨会报告评估中的评分者-学生互动、性别偏见和专业偏见。
Heliyon. 2023 May 23;9(6):e16548. doi: 10.1016/j.heliyon.2023.e16548. eCollection 2023 Jun.
10
Comparing holistic and analytic scoring for performance assessment with many-facet Rasch model.运用多面Rasch模型比较整体评分与分析评分在绩效评估中的应用
J Appl Meas. 2001;2(4):379-88.

本文引用的文献

1
A Bayesian many-facet Rasch model with Markov modeling for rater severity drift.贝叶斯多项 RASCH 模型与马尔可夫建模用于评分者严重偏差。
Behav Res Methods. 2023 Oct;55(7):3910-3928. doi: 10.3758/s13428-022-01997-z. Epub 2022 Oct 25.
2
A new item response theory model for rater centrality using a hierarchical rater model approach.一种使用层次评分者模型方法的评分者中心度新的项目反应理论模型。
Behav Res Methods. 2022 Aug;54(4):1854-1868. doi: 10.3758/s13428-021-01699-y. Epub 2021 Nov 1.
3
Automated language essay scoring systems: a literature review.
自动化语言作文评分系统:文献综述
PeerJ Comput Sci. 2019 Aug 12;5:e208. doi: 10.7717/peerj-cs.208. eCollection 2019.
4
Accuracy of performance-test linking based on a many-facet Rasch model.基于多方面 Rasch 模型的绩效测试链接准确性。
Behav Res Methods. 2021 Aug;53(4):1440-1454. doi: 10.3758/s13428-020-01498-x. Epub 2020 Nov 9.
5
Exploring the Combined Effects of Rater Misfit and Differential Rater Functioning in Performance Assessments.探索评分者不匹配和评分者差异功能在绩效评估中的综合影响。
Educ Psychol Meas. 2019 Oct;79(5):962-987. doi: 10.1177/0013164419834613. Epub 2019 Apr 2.
6
Trifactor Models for Multiple-Ratings Data.三因子模型在多评分数据中的应用
Multivariate Behav Res. 2019 May-Jun;54(3):360-381. doi: 10.1080/00273171.2018.1530091. Epub 2019 Mar 28.
7
Empirical comparison of item response theory models with rater's parameters.具有评分者参数的项目反应理论模型的实证比较。
Heliyon. 2018 May 8;4(5):e00622. doi: 10.1016/j.heliyon.2018.e00622. eCollection 2018 May.
8
Detecting and measuring rater effects using many-facet Rasch measurement: Part II.使用多面Rasch测量法检测和衡量评分者效应:第二部分。
J Appl Meas. 2004;5(2):189-227.
9
Detecting and measuring rater effects using many-facet Rasch measurement: part I.使用多面Rasch测量法检测和衡量评分者效应:第一部分。
J Appl Meas. 2003;4(4):386-422.
10
Constructing rater and task banks for performance assessments.构建用于绩效评估的评分者库和任务库。
J Outcome Meas. 1997;1(1):19-33.