Bartman Ilona, Smee Sydney, Roy Marguerite
Evaluation Bureau, Medical Council of Canada, Ottawa, Ontario K1G 5A2, Canada.
Clin Teach. 2013 Feb;10(1):27-31. doi: 10.1111/j.1743-498X.2012.00607.x.
Performance assessments rely on human judgment, and are vulnerable to rater effects (e.g. leniency or harshness). Making valid inferences from performance ratings for high-stakes decisions requires the management of rater effects. A simple method for detecting extreme raters that does not require sophisticated statistical knowledge or software has been developed as part of the quality assurance process for objective structured clinical examinations (OSCEs). We believe it is applicable to a range of examinations that rely on human raters.
The method has three steps. First, extreme raters are identified by comparing individual rater means with the mean of all raters. A rater is deemed extreme if their mean was three standard deviations below (hawks) or above (doves) the overall mean. This criterion is adjustable. Second, the distribution of an extreme rater's scores was compared with the overall distribution for the station. This step mitigates a station effect. Third, the cohort of candidates seen by the rater is examined to ensure that any cohort effect is ruled out.
Of 3000+ raters, fewer than 0.3% have been identified as being extreme using the proposed criteria. Rater performance is being monitored on a regular basis, and the impact of these raters on candidate results will be considered before results are finalised. Extreme raters are contacted by the organisation to review their rating style. If this intervention fails to modify the rater's scoring pattern, the rater is no longer invited back. As more data are collected the organisation will assess them to inform the development of approaches to improve extreme rater performance.
绩效评估依赖于人为判断,容易受到评分者效应(如宽松或严苛)的影响。从高风险决策的绩效评级中做出有效的推断需要对评分者效应进行管理。作为客观结构化临床考试(OSCE)质量保证流程的一部分,已经开发出一种简单的方法来检测极端评分者,该方法不需要复杂的统计知识或软件。我们认为它适用于一系列依赖人为评分者的考试。
该方法有三个步骤。首先,通过将单个评分者的平均分与所有评分者的平均分进行比较来识别极端评分者。如果评分者的平均分比总体平均分低三个标准差(鹰派)或高三个标准差(鸽派),则该评分者被视为极端评分者。这个标准是可调整的。其次,将极端评分者的分数分布与该考站的总体分布进行比较。这一步减轻了考站效应。第三,检查评分者所评阅的考生群体,以确保排除任何群体效应。
在3000多名评分者中,使用所提出的标准被识别为极端评分者的不到0.3%。评分者的表现正在定期监测,在最终确定结果之前将考虑这些评分者对考生成绩的影响。组织会联系极端评分者以审查他们的评分方式。如果这种干预未能改变评分者的评分模式,该评分者将不再被邀请回来。随着收集到更多数据,组织将对其进行评估,为改进极端评分者表现的方法开发提供依据。