Suppr超能文献

基于评估者的测试的可靠性:使用模拟构建的模型确定上下文相关的一致性阈值。

Reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholds.

机构信息

Laboratory for Bionic Integration, Department of Biomedical Engineering, ND20, Cleveland Clinic, 9500 Euclid Avenue, Cleveland, OH, 44195, USA.

出版信息

BMC Med Res Methodol. 2018 Nov 19;18(1):141. doi: 10.1186/s12874-018-0606-7.

Abstract

BACKGROUND

Indices of inter-evaluator reliability are used in many fields such as computational linguistics, psychology, and medical science; however, the interpretation of resulting values and determination of appropriate thresholds lack context and are often guided only by arbitrary "rules of thumb" or simply not addressed at all. Our goal for this work was to develop a method for determining the relationship between inter-evaluator agreement and error to facilitate meaningful interpretation of values, thresholds, and reliability.

METHODS

Three expert human evaluators completed a video analysis task, and averaged their results together to create a reference dataset of 300 time measurements. We simulated unique combinations of systematic error and random error onto the reference dataset to generate 4900 new hypothetical evaluators (each with 300 time measurements). The systematic errors and random errors made by the hypothetical evaluator population were approximated as the mean and variance of a normally-distributed error signal. Calculating the error (using percent error) and inter-evaluator agreement (using Krippendorff's alpha) between each hypothetical evaluator and the reference dataset allowed us to establish a mathematical model and value envelope of the worst possible percent error for any given amount of agreement.

RESULTS

We used the relationship between inter-evaluator agreement and error to make an informed judgment of an acceptable threshold for Krippendorff's alpha within the context of our specific test. To demonstrate the utility of our modeling approach, we calculated the percent error and Krippendorff's alpha between the reference dataset and a new cohort of trained human evaluators and used our contextually-derived Krippendorff's alpha threshold as a gauge of evaluator quality. Although all evaluators had relatively high agreement (> 0.9) compared to the rule of thumb (0.8), our agreement threshold permitted evaluators with low error, while rejecting one evaluator with relatively high error.

CONCLUSIONS

We found that our approach established threshold values of reliability, within the context of our evaluation criteria, that were far less permissive than the typically accepted "rule of thumb" cutoff for Krippendorff's alpha. This procedure provides a less arbitrary method for determining a reliability threshold and can be tailored to work within the context of any reliability index.

摘要

背景

在计算语言学、心理学和医学等领域,人们会使用组内评估者可靠性指数。然而,对于评估者可靠性指数的解读以及确定合适的阈值缺乏背景知识,目前通常仅依据一些任意的“经验法则”,或者根本不涉及这方面的问题。我们的目标是开发一种方法,以确定评估者间一致性和误差之间的关系,从而促进对可靠性指数值、阈值和可靠性的有意义的解读。

方法

三位专家评估者完成了一项视频分析任务,并将他们的结果平均在一起,为 300 个时间测量值创建了一个参考数据集。我们将系统误差和随机误差的独特组合模拟到参考数据集上,以生成 4900 个新的假设评估者(每个评估者有 300 个时间测量值)。假设评估者群体的系统误差和随机误差近似为正态分布误差信号的均值和方差。计算每个假设评估者与参考数据集之间的误差(使用百分比误差)和组内评估者间一致性(使用克里普多夫的阿尔法系数),使我们能够建立一个数学模型和最差可能的百分比误差值范围,对于任何给定的一致性程度。

结果

我们使用组内评估者间一致性和误差之间的关系,根据我们特定测试的情况,对克里普多夫的阿尔法系数的可接受阈值做出明智的判断。为了演示我们的建模方法的实用性,我们计算了参考数据集和一组经过培训的新人类评估者之间的百分比误差和克里普多夫的阿尔法系数,并使用我们从上下文中得出的克里普多夫的阿尔法系数阈值作为评估者质量的衡量标准。尽管所有评估者与经验法则(0.8)相比,具有较高的一致性(>0.9),但我们的一致性阈值允许误差较低的评估者通过,同时拒绝了一位误差相对较高的评估者。

结论

我们发现,我们的方法在我们的评估标准的背景下建立了可靠性的阈值值,这些值比克里普多夫的阿尔法系数通常接受的“经验法则”截止值要严格得多。这种方法为确定可靠性阈值提供了一种不那么任意的方法,并且可以根据任何可靠性指数的上下文进行调整。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d813/6245899/5feb8c435b1d/12874_2018_606_Fig1_HTML.jpg

相似文献

6
Interevaluator reliability of a mock paramedic practical examination.模拟急救员实践考试的评估者间信度。
Prehosp Emerg Care. 2012 Apr-Jun;16(2):277-83. doi: 10.3109/10903127.2011.640413. Epub 2012 Jan 9.
7
Evaluating the evaluators: interrater reliability on EMT licensing examinations.
Prehosp Emerg Care. 1998 Jan-Mar;2(1):37-46. doi: 10.1080/10903129808958838.
9
Professional judgment and the interpretation of viable mold air sampling data.
J Occup Environ Hyg. 2008 Oct;5(10):656-63. doi: 10.1080/15459620802310796.

引用本文的文献

4
Hypothesis testing for detecting outlier evaluators.用于检测异常值评估者的假设检验。
Int J Biostat. 2024 Nov 4;20(2):419-431. doi: 10.1515/ijb-2023-0004. eCollection 2024 Nov 1.

本文引用的文献

2
Inter-rater reliability of shoulder measurements in middle-aged women.中年女性肩部测量的评分者间信度
Physiotherapy. 2017 Jun;103(2):222-230. doi: 10.1016/j.physio.2016.07.002. Epub 2016 Jul 20.
4
Reliability of two-point discrimination thresholds using a 4-2-1 stepping algorithm.使用4-2-1步进算法的两点辨别阈值的可靠性
Somatosens Mot Res. 2016 Sep-Dec;33(3-4):156-160. doi: 10.1080/08990220.2016.1227313. Epub 2016 Sep 5.
10
Sample size and optimal designs for reliability studies.可靠性研究的样本量与最优设计。
Stat Med. 1998 Jan 15;17(1):101-10. doi: 10.1002/(sici)1097-0258(19980115)17:1<101::aid-sim727>3.0.co;2-e.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验