Department of General Psychology, University of Padua.
Department of Developmental and Social Psychology, University of Padua.
Psychol Methods. 2021 Oct;26(5):622-634. doi: 10.1037/met0000416.
The evaluation of agreement among experts in a classification task is crucial in many situations (e.g., medical and psychological diagnosis, legal reports). Traditional indexes used to estimate interrater agreement (such as Cohen's j) simply count the number of observed agreements and correct them by removing chance agreements. In this article, we introduce a new theoretical framework for the evaluation of interrater agreement based on the possibility of adjusting the observed classifications conducted by the raters. This framework refers to the introduction and formalization of two concepts involved in the classification task: (a) the belonging measure of an object to a category and (b) the rater's belonging threshold, which is the minimally sufficient value of the belonging measure at which the rater will classify an object into a category. These factors are ignored by traditional indexes for interrater agreement, though their role may be decisive. Two Bayesian models are tested through a Monte Carlo simulation study to evaluate the accuracy of the new methodology for estimating raters' threshold and the actual degree of agreement between two independent raters. Results show that the computation of traditional indexes for interrater agreement on the adjusted classifications leads to a more accurate estimation of the experts' actual agreement. This improvement is greater when a large difference between raters' belonging thresholds is observed; when the difference is small, the proposed method provides similar results to those obtained in the simple observed classifications. Finally, an empirical application to the field of psychological assessment is presented to show how the method could be used in practice. (PsycInfo Database Record (c) 2021 APA, all rights reserved).
在许多情况下(例如医学和心理诊断、法律报告),评估分类任务中的专家一致性至关重要。传统的用于估计评分者间一致性的指标(如 Cohen's j)简单地计算观察到的一致性数量,并通过去除偶然一致性进行校正。在本文中,我们介绍了一种基于调整评分者进行的观察分类的可能性来评估评分者间一致性的新理论框架。该框架涉及到分类任务中两个涉及概念的引入和形式化:(a)物体属于某个类别的归属度量,以及(b)评分者的归属阈值,即评分者将物体分类到某个类别所需要的归属度量的最小充分值。尽管这些因素可能是决定性的,但传统的评分者间一致性指标忽略了这些因素。通过蒙特卡罗模拟研究测试了两个贝叶斯模型,以评估新方法估计评分者阈值和两个独立评分者之间实际一致性的准确性。结果表明,在调整后的分类中计算传统的评分者间一致性指标会导致对专家实际一致性的更准确估计。当评分者归属阈值之间存在较大差异时,这种改进更为明显;当差异较小时,所提出的方法提供的结果与在简单观察分类中获得的结果相似。最后,提出了一个心理评估领域的实证应用,以展示该方法如何在实践中使用。