注释者一致性、真实数据估计和算法评估的实证研究。

An Empirical Study Into Annotator Agreement, Ground Truth Estimation, and Algorithm Evaluation.

出版信息

IEEE Trans Image Process. 2016 Jun;25(6):2557-2572. doi: 10.1109/TIP.2016.2544703. Epub 2016 Mar 21.

DOI:10.1109/TIP.2016.2544703

PMID:27019487

Abstract

Although agreement between the annotators who mark feature locations within images has been studied in the past from a statistical viewpoint, little work has attempted to quantify the extent to which this phenomenon affects the evaluation of foreground-background segmentation algorithms. Many researchers utilize ground truth (GT) in experimentation and more often than not this GT is derived from one annotator's opinion. How does the difference in opinion affects an algorithm's evaluation? A methodology is applied to four image-processing problems to quantify the interannotator variance and to offer insight into the mechanisms behind agreement and the use of GT. It is found that when detecting linear structures, annotator agreement is very low. The agreement in a structure's position can be partially explained through basic image properties. Automatic segmentation algorithms are compared with annotator agreement and it is found that there is a clear relation between the two. Several GT estimation methods are used to infer a number of algorithm performances. It is found that the rank of a detector is highly dependent upon the method used to form the GT, and that although STAPLE and LSML appear to represent the mean of the performance measured using individual annotations, when there are few annotations, or there is a large variance in them, these estimates tend to degrade. Furthermore, one of the most commonly adopted combination methods-consensus voting-accentuates more obvious features, resulting in an overestimation of performance. It is concluded that in some data sets, it is not possible to confidently infer an algorithm ranking when evaluating upon one GT.

摘要

尽管过去已经从统计学角度研究了在图像中标记特征位置的注释者之间的一致性，但很少有工作试图量化这种现象对前景-背景分割算法评估的影响程度。许多研究人员在实验中利用地面实况 (GT)，而且往往 GT 是由一个注释者的意见得出的。意见的差异如何影响算法的评估？本文应用一种方法来量化四个图像处理问题中的注释者间方差，并深入了解一致性背后的机制以及 GT 的使用。结果发现，在检测线性结构时，注释者之间的一致性非常低。结构位置的一致性可以通过基本的图像属性部分解释。自动分割算法与注释者的一致性进行了比较，发现两者之间存在明显的关系。使用了几种 GT 估计方法来推断算法的性能。结果发现，检测器的等级高度依赖于形成 GT 的方法，并且虽然 STAPLE 和 LSML 似乎代表了使用单个注释测量的性能的平均值，但当 GT 数量较少或存在较大的方差时，这些估计往往会降级。此外，最常采用的组合方法之一——共识投票——突出了更明显的特征，导致性能高估。因此，结论是在某些数据集上，当使用一个 GT 进行评估时，不可能有信心推断算法的排名。

相似文献

An Empirical Study Into Annotator Agreement, Ground Truth Estimation, and Algorithm Evaluation.注释者一致性、真实数据估计和算法评估的实证研究。

IEEE Trans Image Process. 2016 Jun;25(6):2557-2572. doi: 10.1109/TIP.2016.2544703. Epub 2016 Mar 21.

Assessing Inter-Annotator Agreement for Medical Image Segmentation.评估医学图像分割中注释者之间的一致性。

IEEE Access. 2023;11:21300-21312. doi: 10.1109/access.2023.3249759. Epub 2023 Feb 27.

Ranking biomedical annotations with annotator's semantic relevancy.基于注释者语义相关性对生物医学注释进行排序。

Comput Math Methods Med. 2014;2014:258929. doi: 10.1155/2014/258929. Epub 2014 May 11.

Evaluation of uterine cervix segmentations using ground truth from multiple experts.使用多位专家提供的真实数据对子宫颈分割进行评估。

Comput Med Imaging Graph. 2009 Apr;33(3):205-16. doi: 10.1016/j.compmedimag.2008.12.002. Epub 2009 Feb 13.

Community annotation experiment for ground truth generation for the i2b2 medication challenge.社区注释实验，为 i2b2 药物挑战赛生成真实数据。

J Am Med Inform Assoc. 2010 Sep-Oct;17(5):519-23. doi: 10.1136/jamia.2010.004200.

Modeling annotator preference and stochastic annotation error for medical image segmentation.医学图像分割中的标注者偏好建模和随机标注错误。

Med Image Anal. 2024 Feb;92:103028. doi: 10.1016/j.media.2023.103028. Epub 2023 Nov 17.

Modeling multiple time series annotations as noisy distortions of the ground truth: An Expectation-Maximization approach.将多个时间序列注释建模为真实情况的噪声失真：一种期望最大化方法。

IEEE Trans Affect Comput. 2018 Jan-Mar;9(1):76-89. doi: 10.1109/TAFFC.2016.2592918. Epub 2016 Jul 19.

Learning from multiple annotators for medical image segmentation.从多个标注者处学习以进行医学图像分割。

Pattern Recognit. 2023 Jun;138:None. doi: 10.1016/j.patcog.2023.109400.

Performance-based classifier combination in atlas-based image segmentation using expectation-maximization parameter estimation.基于期望最大化参数估计的基于图谱的图像分割中基于性能的分类器组合

IEEE Trans Med Imaging. 2004 Aug;23(8):983-94. doi: 10.1109/TMI.2004.830803.

An analysis of early studies released by the Lung Imaging Database Consortium (LIDC).肺部影像数据库联盟（LIDC）发布的早期研究分析。

Acad Radiol. 2007 Nov;14(11):1382-8. doi: 10.1016/j.acra.2007.08.004.

引用本文的文献

The Venus score for the assessment of the quality and trustworthiness of biomedical datasets.用于评估生物医学数据集质量和可信度的维纳斯评分。

BioData Min. 2025 Jan 9;18(1):1. doi: 10.1186/s13040-024-00412-x.

Lessons Learned in Building Expertly Annotated Multi-Institution Datasets and Hosting the RSNA AI Challenges.构建专业标注的多机构数据集及主办放射学会人工智能挑战赛的经验教训。

Radiol Artif Intell. 2024 May;6(3):e230227. doi: 10.1148/ryai.230227.

Non-inferiority of deep learning ischemic stroke segmentation on non-contrast CT within 16-hours compared to expert neuroradiologists.深度学习算法在 16 小时内对非对比度 CT 影像进行缺血性脑卒中分割的效果不劣于专家神经放射科医师。

Sci Rep. 2023 Sep 26;13(1):16153. doi: 10.1038/s41598-023-42961-x.

Assessing Inter-Annotator Agreement for Medical Image Segmentation.评估医学图像分割中注释者之间的一致性。

IEEE Access. 2023;11:21300-21312. doi: 10.1109/access.2023.3249759. Epub 2023 Feb 27.

[Research on classification of benign and malignant lung nodules based on three-dimensional multi-view squeeze-and-excitation convolutional neural network].基于三维多视图挤压激励卷积神经网络的肺结节良恶性分类研究

Sheng Wu Yi Xue Gong Cheng Xue Za Zhi. 2022 Jun 25;39(3):452-461. doi: 10.7507/1001-5515.202110059.

MeQryEP: A Texture Based Descriptor for Biomedical Image Retrieval.MeQryEP：一种基于纹理的生物医学图像检索描述符。

J Healthc Eng. 2022 Apr 11;2022:9505229. doi: 10.1155/2022/9505229. eCollection 2022.

Training Strategies for Radiology Deep Learning Models in Data-limited Scenarios.数据受限场景下放射学深度学习模型的训练策略

Radiol Artif Intell. 2021 Oct 6;3(6):e210014. doi: 10.1148/ryai.2021210014. eCollection 2021 Nov.

On Clinical Agreement on the Visibility and Extent of Anatomical Layers in Digital Gonio Photographs.论数字眼前节照相中解剖层面的可见度和范围的临床共识

Transl Vis Sci Technol. 2021 Sep 1;10(11):1. doi: 10.1167/tvst.10.11.1.

Quantifying Parkinson's disease motor severity under uncertainty using MDS-UPDRS videos.使用 MDS-UPDRS 视频对帕金森病运动严重程度进行不确定性量化。

Med Image Anal. 2021 Oct;73:102179. doi: 10.1016/j.media.2021.102179. Epub 2021 Jul 21.

Enhanced Field-Based Detection of Potato Blight in Complex Backgrounds Using Deep Learning.基于深度学习的复杂背景下马铃薯晚疫病增强田间检测

Plant Phenomics. 2021 May 16;2021:9835724. doi: 10.34133/2021/9835724. eCollection 2021.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

注释者一致性、真实数据估计和算法评估的实证研究。

An Empirical Study Into Annotator Agreement, Ground Truth Estimation, and Algorithm Evaluation.

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献