• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评分者特征、回答内容和评分情境:剖析评分准确性的决定因素。

Rater characteristics, response content, and scoring contexts: Decomposing the determinates of scoring accuracy.

作者信息

Palermo Corey

机构信息

Measurement Incorporated, Durham, NC, United States.

出版信息

Front Psychol. 2022 Aug 10;13:937097. doi: 10.3389/fpsyg.2022.937097. eCollection 2022.

DOI:10.3389/fpsyg.2022.937097
PMID:36033049
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9399925/
Abstract

Raters may introduce construct-irrelevant variance when evaluating written responses to performance assessments, threatening the validity of students' scores. Numerous factors in the rating process, including the content of students' responses, the characteristics of raters, and the context in which the scoring occurs, are thought to influence the quality of raters' scores. Despite considerable study of rater effects, little research has examined the relative impacts of the factors that influence rater accuracy. In practice, such integrated examinations are needed to afford evidence-based decisions of rater selection, training, and feedback. This study provides the first naturalistic, integrated examination of rater accuracy in a large-scale assessment program. Leveraging rater monitoring data from an English language arts (ELA) summative assessment program, I specified cross-classified, multilevel models via Bayesian (i.e., Markov chain Monte Carlo) estimation to decompose the impact of response content, rater characteristics, and scoring contexts on rater accuracy. Results showed relatively little variation in accuracy attributable to teams, items, and raters. Raters did not collectively exhibit differential accuracy over time, though there was significant variation in individual rater's scoring accuracy from response to response and day to day. I found considerable variation in accuracy across responses, which was in part explained by text features and other measures of response content that influenced scoring difficulty. Some text features differentially influenced the difficulty of scoring research and writing content. Multiple measures of raters' qualification performance predicted their scoring accuracy, but general rater background characteristics including experience and education did not. Site-based and remote raters demonstrated comparable accuracy, while evening-shift raters were slightly less accurate, on average, than day-shift raters. This naturalistic, integrated examination of rater accuracy extends previous research and provides implications for rater recruitment, training, monitoring, and feedback to improve human evaluation of written responses.

摘要

在评估对绩效评估的书面回复时,评分者可能会引入与结构无关的方差,从而威胁到学生分数的有效性。评分过程中的众多因素,包括学生回复的内容、评分者的特征以及评分发生的背景,都被认为会影响评分者分数的质量。尽管对评分者效应进行了大量研究,但很少有研究考察影响评分者准确性的因素的相对影响。在实践中,需要进行这样的综合考察,以便为评分者的选择、培训和反馈提供基于证据的决策。本研究首次在大规模评估项目中对评分者准确性进行了自然主义的综合考察。利用来自英语语言艺术(ELA)总结性评估项目的评分者监测数据,我通过贝叶斯(即马尔可夫链蒙特卡罗)估计指定了交叉分类的多层次模型,以分解回复内容、评分者特征和评分背景对评分者准确性的影响。结果显示,可归因于团队、项目和评分者的准确性差异相对较小。评分者在总体上没有随着时间的推移表现出不同的准确性,尽管单个评分者在不同回复之间以及每天的评分准确性存在显著差异。我发现不同回复之间的准确性存在很大差异,部分原因是文本特征和其他影响评分难度的回复内容度量。一些文本特征对研究和写作内容的评分难度有不同的影响。评分者资格表现的多种度量预测了他们的评分准确性,但包括经验和教育程度在内的一般评分者背景特征则没有。现场评分者和远程评分者表现出相当的准确性,而晚班评分者平均而言比日班评分者的准确性略低。这种对评分者准确性的自然主义综合考察扩展了先前的研究,并为评分者的招募、培训、监测和反馈提供了启示,以改进对书面回复的人工评估。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5a96/9399925/0b80d8c42ae4/fpsyg-13-937097-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5a96/9399925/9cb13a87f60f/fpsyg-13-937097-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5a96/9399925/688f1d95482e/fpsyg-13-937097-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5a96/9399925/b8aaf8b747d5/fpsyg-13-937097-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5a96/9399925/5845eae2ec99/fpsyg-13-937097-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5a96/9399925/8dcbad9c8a48/fpsyg-13-937097-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5a96/9399925/a70b1e9e830f/fpsyg-13-937097-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5a96/9399925/0b80d8c42ae4/fpsyg-13-937097-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5a96/9399925/9cb13a87f60f/fpsyg-13-937097-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5a96/9399925/688f1d95482e/fpsyg-13-937097-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5a96/9399925/b8aaf8b747d5/fpsyg-13-937097-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5a96/9399925/5845eae2ec99/fpsyg-13-937097-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5a96/9399925/8dcbad9c8a48/fpsyg-13-937097-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5a96/9399925/a70b1e9e830f/fpsyg-13-937097-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5a96/9399925/0b80d8c42ae4/fpsyg-13-937097-g007.jpg

相似文献

1
Rater characteristics, response content, and scoring contexts: Decomposing the determinates of scoring accuracy.评分者特征、回答内容和评分情境:剖析评分准确性的决定因素。
Front Psychol. 2022 Aug 10;13:937097. doi: 10.3389/fpsyg.2022.937097. eCollection 2022.
2
Effects of a rater training on rating accuracy in a physical examination skills assessment.评分员培训对体格检查技能评估中评分准确性的影响。
GMS Z Med Ausbild. 2014 Nov 17;31(4):Doc41. doi: 10.3205/zma000933. eCollection 2014.
3
Does a Rater's Professional Background Influence Communication Skills Assessment?评分者的专业背景会影响沟通技能评估吗?
J Vet Med Educ. 2015 Winter;42(4):315-23. doi: 10.3138/jvme.0215-023R. Epub 2015 Aug 28.
4
A method for identifying extreme OSCE examiners.一种识别极端客观结构化临床考试考官的方法。
Clin Teach. 2013 Feb;10(1):27-31. doi: 10.1111/j.1743-498X.2012.00607.x.
5
Experts' prediction of item difficulty of multiple-choice questions in the Ethiopian Undergraduate Medicine Licensure Examination.专家对埃塞俄比亚本科医学执照考试选择题项目难度的预测。
BMC Med Educ. 2024 Sep 16;24(1):1016. doi: 10.1186/s12909-024-06012-x.
6
Writing Evaluation: Rater and Task Effects on the Reliability of Writing Scores for Children in Grades 3 and 4.写作评估:评分者与任务对三、四年级儿童写作分数可靠性的影响
Read Writ. 2017 Jun;30(6):1287-1310. doi: 10.1007/s11145-017-9724-6. Epub 2017 Feb 6.
7
Rater Model Using Signal Detection Theory for Latent Differential Rater Functioning.基于信号检测理论的潜在评分者功能差异的评分者模型。
Multivariate Behav Res. 2019 Jul-Aug;54(4):492-504. doi: 10.1080/00273171.2018.1522496. Epub 2018 Dec 17.
8
Qualitative Analysis of Multiple Mini Interview Interviewer Comments.多重迷你面试面试官评语的定性分析
Med Sci Educ. 2019 Jul 29;29(4):941-945. doi: 10.1007/s40670-019-00778-2. eCollection 2019 Dec.
9
Exploring the Impersonal Judgments and Personal Preferences of Raters in Rater-Mediated Assessments With Unfolding Models.使用展开模型在评分者介导评估中探究评分者的客观判断和个人偏好。
Educ Psychol Meas. 2019 Aug;79(4):773-795. doi: 10.1177/0013164419827345. Epub 2019 Feb 5.
10
Does Benchmarking of Rating Scales Improve Ratings of Search Performance Given by Specialist Search Dog Handlers?对评分量表进行基准测试是否能提高专业搜救犬训导员给出的搜索性能评分?
Front Vet Sci. 2021 Feb 2;8:545398. doi: 10.3389/fvets.2021.545398. eCollection 2021.

引用本文的文献

1
A comparative study of student performance in all-ceramic crown preparation by clinical-phase students.临床阶段学生全瓷冠修复预备操作中学生表现的比较研究。
BMC Med Educ. 2025 Jul 15;25(1):1057. doi: 10.1186/s12909-025-07512-0.
2
Rubric system for the evaluation of root canal treatment of single canal tooth performed by preclinical dental students: a cross-sectional study.用于评估临床前牙科学生对单根管牙齿进行根管治疗的评分系统:一项横断面研究。
Saudi Dent J. 2025 Apr 29;37(1-3):6. doi: 10.1007/s44445-025-00008-z.

本文引用的文献

1
Examining the Impacts of Rater Effects in Performance Assessments.审视评分者效应在绩效评估中的影响。
Appl Psychol Meas. 2019 Mar;43(2):159-171. doi: 10.1177/0146621618789391. Epub 2018 Aug 5.
2
Sequential Effects in Essay Ratings: Evidence of Assimilation Effects Using Cross-Classified Models.论文评分中的顺序效应:使用交叉分类模型的同化效应证据
Front Psychol. 2017 Jun 7;8:933. doi: 10.3389/fpsyg.2017.00933. eCollection 2017.