Suppr超能文献

注释者一致性、真实数据估计和算法评估的实证研究。

An Empirical Study Into Annotator Agreement, Ground Truth Estimation, and Algorithm Evaluation.

出版信息

IEEE Trans Image Process. 2016 Jun;25(6):2557-2572. doi: 10.1109/TIP.2016.2544703. Epub 2016 Mar 21.

Abstract

Although agreement between the annotators who mark feature locations within images has been studied in the past from a statistical viewpoint, little work has attempted to quantify the extent to which this phenomenon affects the evaluation of foreground-background segmentation algorithms. Many researchers utilize ground truth (GT) in experimentation and more often than not this GT is derived from one annotator's opinion. How does the difference in opinion affects an algorithm's evaluation? A methodology is applied to four image-processing problems to quantify the interannotator variance and to offer insight into the mechanisms behind agreement and the use of GT. It is found that when detecting linear structures, annotator agreement is very low. The agreement in a structure's position can be partially explained through basic image properties. Automatic segmentation algorithms are compared with annotator agreement and it is found that there is a clear relation between the two. Several GT estimation methods are used to infer a number of algorithm performances. It is found that the rank of a detector is highly dependent upon the method used to form the GT, and that although STAPLE and LSML appear to represent the mean of the performance measured using individual annotations, when there are few annotations, or there is a large variance in them, these estimates tend to degrade. Furthermore, one of the most commonly adopted combination methods-consensus voting-accentuates more obvious features, resulting in an overestimation of performance. It is concluded that in some data sets, it is not possible to confidently infer an algorithm ranking when evaluating upon one GT.

摘要

尽管过去已经从统计学角度研究了在图像中标记特征位置的注释者之间的一致性,但很少有工作试图量化这种现象对前景-背景分割算法评估的影响程度。许多研究人员在实验中利用地面实况 (GT),而且往往 GT 是由一个注释者的意见得出的。意见的差异如何影响算法的评估?本文应用一种方法来量化四个图像处理问题中的注释者间方差,并深入了解一致性背后的机制以及 GT 的使用。结果发现,在检测线性结构时,注释者之间的一致性非常低。结构位置的一致性可以通过基本的图像属性部分解释。自动分割算法与注释者的一致性进行了比较,发现两者之间存在明显的关系。使用了几种 GT 估计方法来推断算法的性能。结果发现,检测器的等级高度依赖于形成 GT 的方法,并且虽然 STAPLE 和 LSML 似乎代表了使用单个注释测量的性能的平均值,但当 GT 数量较少或存在较大的方差时,这些估计往往会降级。此外,最常采用的组合方法之一——共识投票——突出了更明显的特征,导致性能高估。因此,结论是在某些数据集上,当使用一个 GT 进行评估时,不可能有信心推断算法的排名。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验