评分者边际分布对其匹配一致性的影响：一种用于解释kappa的重新缩放框架。

The Effect of the Raters' Marginal Distributions on Their Matched Agreement: A Rescaling Framework for Interpreting Kappa.

作者信息

Karelitz Tzur M, Budescu David V

机构信息

a National Institute for Testing and Evaluation , Jerusalem , Israel.

b Department of Psychology , Fordham University.

出版信息

Multivariate Behav Res. 2013 Nov;48(6):923-52. doi: 10.1080/00273171.2013.830064.

DOI:10.1080/00273171.2013.830064

PMID:26745599

Abstract

Cohen's κ measures the improvement in classification above chance level and it is the most popular measure of interjudge agreement. Yet, there is considerable confusion about its interpretation. Specifically, researchers often ignore the fact that the observed level of matched agreement is bounded from above and below and the bounds are a function of the particular marginal distributions of the table. We propose that these bounds should be used to rescale the components of κ (observed and expected agreement). Rescaling κ in this manner results in κ', a measure that was originally proposed by Cohen (1960) and was largely ignored in both research and practice. This measure provides a common scale for agreement measures of tables with different marginal distributions. It reaches the maximal value of 1 when the judges show the highest level of agreement possible, given their marginal disagreements. We conclude that κ' should be used to measure the level of matched agreement contingent on a particular set of marginal distributions. The article provides a framework and a set of guidelines that facilitate comparisons between various types of agreement tables. We illustrate our points with simulations and real data from two studies-one involving judges' ratings of baseball players and one involving ratings of essays in high-stakes tests.

摘要

科恩 κ 系数衡量高于随机水平的分类改进情况，它是衡量评判者间一致性最常用的指标。然而，对其解释存在相当大的困惑。具体而言，研究人员常常忽略这样一个事实，即观察到的匹配一致性水平存在上下限，且这些界限是表格特定边际分布的函数。我们建议应使用这些界限对 κ 系数的组成部分（观察到的一致性和预期一致性）进行重新缩放。以这种方式重新缩放 κ 系数会得到 κ'，这是科恩（1960 年）最初提出的一个指标，在研究和实践中基本都被忽视了。该指标为具有不同边际分布的表格的一致性度量提供了一个通用尺度。当评判者在存在边际分歧的情况下达到可能的最高一致水平时，它达到最大值 1。我们得出结论，κ' 应用于衡量特定一组边际分布条件下的匹配一致性水平。本文提供了一个框架和一套指导方针，便于对各种类型的一致性表格进行比较。我们通过模拟以及来自两项研究的真实数据来说明我们的观点，一项研究涉及对棒球运动员的评判者评分，另一项研究涉及高风险测试中作文的评分。