Westover Alek M, Westover Tara M, Westover M Brandon
Massachusetts Institute of Technology, Boston, MA, USA.
Harvard Medical School, Beth Israel Deaconess Medical Center, Boston, MA, USA.
Open J Stat. 2024 Oct;14(5):481-491. doi: 10.4236/ojs.2024.145021. Epub 2024 Oct 28.
Interrater reliability (IRR) statistics, like Cohen's kappa, measure agreement between raters beyond what is expected by chance when classifying items into categories. While Cohen's kappa has been widely used, it has several limitations, prompting development of Gwet's agreement statistic, an alternative "kappa"statistic which models chance agreement via an "occasional guessing" model. However, we show that Gwet's formula for estimating the proportion of agreement due to chance is itself biased for intermediate levels of agreement, despite overcoming limitations of Cohen's kappa at high and low agreement levels. We derive a maximum likelihood estimator for the occasional guessing model that yields an unbiased estimator of the IRR, which we call the maximum likelihood kappa ( ). The key result is that the chance agreement probability under the occasional guessing model is simply equal to the observed rate of disagreement between raters. The statistic provides a theoretically principled approach to quantifying IRR that addresses limitations of previous coefficients. Given the widespread use of IRR measures, having an unbiased estimator is important for reliable inference across domains where rater judgments are analyzed.
评分者间信度(IRR)统计量,如科恩kappa系数,用于衡量在将项目分类时评分者之间的一致性,超出了随机预期的水平。虽然科恩kappa系数已被广泛使用,但它有几个局限性,这促使了格韦特一致性统计量的发展,这是一种替代的“kappa”统计量,它通过“偶尔猜测”模型对随机一致性进行建模。然而,我们表明,尽管格韦特在高一致性和低一致性水平上克服了科恩kappa系数的局限性,但他用于估计由于机会导致的一致性比例的公式本身在中等一致性水平上存在偏差。我们推导出了偶尔猜测模型的最大似然估计量,它产生了一个无偏的IRR估计量,我们称之为最大似然kappa( )。关键结果是,偶尔猜测模型下的随机一致性概率简单地等于评分者之间观察到的不一致率。 统计量为量化IRR提供了一种理论上有原则的方法,解决了以前 系数的局限性。鉴于IRR测量的广泛使用,拥有一个无偏估计量对于在分析评分者判断的各个领域进行可靠推断很重要。