Laboratory for Bionic Integration, Department of Biomedical Engineering, ND20, Cleveland Clinic, 9500 Euclid Avenue, Cleveland, OH, 44195, USA.
BMC Med Res Methodol. 2018 Nov 19;18(1):141. doi: 10.1186/s12874-018-0606-7.
Indices of inter-evaluator reliability are used in many fields such as computational linguistics, psychology, and medical science; however, the interpretation of resulting values and determination of appropriate thresholds lack context and are often guided only by arbitrary "rules of thumb" or simply not addressed at all. Our goal for this work was to develop a method for determining the relationship between inter-evaluator agreement and error to facilitate meaningful interpretation of values, thresholds, and reliability.
Three expert human evaluators completed a video analysis task, and averaged their results together to create a reference dataset of 300 time measurements. We simulated unique combinations of systematic error and random error onto the reference dataset to generate 4900 new hypothetical evaluators (each with 300 time measurements). The systematic errors and random errors made by the hypothetical evaluator population were approximated as the mean and variance of a normally-distributed error signal. Calculating the error (using percent error) and inter-evaluator agreement (using Krippendorff's alpha) between each hypothetical evaluator and the reference dataset allowed us to establish a mathematical model and value envelope of the worst possible percent error for any given amount of agreement.
We used the relationship between inter-evaluator agreement and error to make an informed judgment of an acceptable threshold for Krippendorff's alpha within the context of our specific test. To demonstrate the utility of our modeling approach, we calculated the percent error and Krippendorff's alpha between the reference dataset and a new cohort of trained human evaluators and used our contextually-derived Krippendorff's alpha threshold as a gauge of evaluator quality. Although all evaluators had relatively high agreement (> 0.9) compared to the rule of thumb (0.8), our agreement threshold permitted evaluators with low error, while rejecting one evaluator with relatively high error.
We found that our approach established threshold values of reliability, within the context of our evaluation criteria, that were far less permissive than the typically accepted "rule of thumb" cutoff for Krippendorff's alpha. This procedure provides a less arbitrary method for determining a reliability threshold and can be tailored to work within the context of any reliability index.
在计算语言学、心理学和医学等领域,人们会使用组内评估者可靠性指数。然而,对于评估者可靠性指数的解读以及确定合适的阈值缺乏背景知识,目前通常仅依据一些任意的“经验法则”,或者根本不涉及这方面的问题。我们的目标是开发一种方法,以确定评估者间一致性和误差之间的关系,从而促进对可靠性指数值、阈值和可靠性的有意义的解读。
三位专家评估者完成了一项视频分析任务,并将他们的结果平均在一起,为 300 个时间测量值创建了一个参考数据集。我们将系统误差和随机误差的独特组合模拟到参考数据集上,以生成 4900 个新的假设评估者(每个评估者有 300 个时间测量值)。假设评估者群体的系统误差和随机误差近似为正态分布误差信号的均值和方差。计算每个假设评估者与参考数据集之间的误差(使用百分比误差)和组内评估者间一致性(使用克里普多夫的阿尔法系数),使我们能够建立一个数学模型和最差可能的百分比误差值范围,对于任何给定的一致性程度。
我们使用组内评估者间一致性和误差之间的关系,根据我们特定测试的情况,对克里普多夫的阿尔法系数的可接受阈值做出明智的判断。为了演示我们的建模方法的实用性,我们计算了参考数据集和一组经过培训的新人类评估者之间的百分比误差和克里普多夫的阿尔法系数,并使用我们从上下文中得出的克里普多夫的阿尔法系数阈值作为评估者质量的衡量标准。尽管所有评估者与经验法则(0.8)相比,具有较高的一致性(>0.9),但我们的一致性阈值允许误差较低的评估者通过,同时拒绝了一位误差相对较高的评估者。
我们发现,我们的方法在我们的评估标准的背景下建立了可靠性的阈值值,这些值比克里普多夫的阿尔法系数通常接受的“经验法则”截止值要严格得多。这种方法为确定可靠性阈值提供了一种不那么任意的方法,并且可以根据任何可靠性指数的上下文进行调整。