Department of Data Science and Analytics, BI Norwegian Business School, Oslo, Norway.
Psychometrika. 2024 Jun;89(2):517-541. doi: 10.1007/s11336-023-09945-2. Epub 2024 Jan 8.
Most measures of agreement are chance-corrected. They differ in three dimensions: their definition of chance agreement, their choice of disagreement function, and how they handle multiple raters. Chance agreement is usually defined in a pairwise manner, following either Cohen's kappa or Fleiss's kappa. The disagreement function is usually a nominal, quadratic, or absolute value function. But how to handle multiple raters is contentious, with the main contenders being Fleiss's kappa, Conger's kappa, and Hubert's kappa, the variant of Fleiss's kappa where agreement is said to occur only if every rater agrees. More generally, multi-rater agreement coefficients can be defined in a g-wise way, where the disagreement weighting function uses g raters instead of two. This paper contains two main contributions. (a) We propose using Fréchet variances to handle the case of multiple raters. The Fréchet variances are intuitive disagreement measures and turn out to generalize the nominal, quadratic, and absolute value functions to the case of more than two raters. (b) We derive the limit theory of g-wise weighted agreement coefficients, with chance agreement of the Cohen-type or Fleiss-type, for the case where every item is rated by the same number of raters. Trying out three confidence interval constructions, we end up recommending calculating confidence intervals using the arcsine transform or the Fisher transform.
大多数一致性度量都经过了机会校正。它们在三个维度上有所不同:它们对机会一致性的定义、不一致函数的选择以及如何处理多个评分者。机会一致性通常以成对的方式定义,遵循 Cohen 的 kappa 或 Fleiss 的 kappa。不一致函数通常是名义的、二次的或绝对值函数。但是,如何处理多个评分者存在争议,主要竞争者是 Fleiss 的 kappa、Conger 的 kappa 和 Hubert 的 kappa,后者是 Fleiss 的 kappa 的变体,只有当每个评分者都同意时才认为存在一致性。更一般地,多评分者一致性系数可以以 g 为单位进行定义,其中不一致权重函数使用 g 个评分者而不是两个评分者。本文有两个主要贡献。(a)我们提出使用 Fréchet 方差来处理多个评分者的情况。Fréchet 方差是直观的不一致度量,并且可以将名义、二次和绝对值函数推广到超过两个评分者的情况。(b)我们推导出了具有 Cohen 型或 Fleiss 型机会一致性的 g 加权一致性系数的极限理论,对于每个项目都由相同数量的评分者进行评分的情况。通过尝试三种置信区间构造方法,我们最终建议使用反正弦变换或 Fisher 变换计算置信区间。