Bryan Frederick W, Xu Zhoubing, Asman Andrew J, Allen Wade M, Reich Daniel S, Landman Bennett A
Electrical Engineering, Vanderbilt University, Nashville, Tennessee 37235.
Translational Neuroradiology Unit, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, Maryland 20892.
Med Phys. 2014 Mar;41(3):031903. doi: 10.1118/1.4864236.
Expert manual labeling is the gold standard for image segmentation, but this process is difficult, time-consuming, and prone to inter-individual differences. While fully automated methods have successfully targeted many anatomies, automated methods have not yet been developed for numerous essential structures (e.g., the internal structure of the spinal cord as seen on magnetic resonance imaging). Collaborative labeling is a new paradigm that offers a robust alternative that may realize both the throughput of automation and the guidance of experts. Yet, distributing manual labeling expertise across individuals and sites introduces potential human factors concerns (e.g., training, software usability) and statistical considerations (e.g., fusion of information, assessment of confidence, bias) that must be further explored. During the labeling process, it is simple to ask raters to self-assess the confidence of their labels, but this is rarely done and has not been previously quantitatively studied. Herein, the authors explore the utility of self-assessment in relation to automated assessment of rater performance in the context of statistical fusion.
The authors conducted a study of 66 volumes manually labeled by 75 minimally trained human raters recruited from the university undergraduate population. Raters were given 15 min of training during which they were shown examples of correct segmentation, and the online segmentation tool was demonstrated. The volumes were labeled 2D slice-wise, and the slices were unordered. A self-assessed quality metric was produced by raters for each slice by marking a confidence bar superimposed on the slice. Volumes produced by both voting and statistical fusion algorithms were compared against a set of expert segmentations of the same volumes.
Labels for 8825 distinct slices were obtained. Simple majority voting resulted in statistically poorer performance than voting weighted by self-assessed performance. Statistical fusion resulted in statistically indistinguishable performance from self-assessed weighted voting. The authors developed a new theoretical basis for using self-assessed performance in the framework of statistical fusion and demonstrated that the combined sources of information (both statistical assessment and self-assessment) yielded statistically significant improvement over the methods considered separately.
The authors present the first systematic characterization of self-assessed performance in manual labeling. The authors demonstrate that self-assessment and statistical fusion yield similar, but complementary, benefits for label fusion. Finally, the authors present a new theoretical basis for combining self-assessments with statistical label fusion.
专家手动标注是图像分割的金标准,但该过程困难、耗时且容易出现个体差异。虽然全自动方法已成功应用于许多解剖结构,但针对众多重要结构(如磁共振成像所见脊髓内部结构)的自动化方法尚未开发出来。协作标注是一种新的模式,它提供了一种强大的替代方案,可能实现自动化的通量和专家的指导。然而,将手动标注专业知识分散到不同个体和地点会引入潜在的人为因素问题(如培训、软件可用性)和统计考量(如信息融合、置信度评估、偏差),这些都必须进一步探讨。在标注过程中,要求评分者对其标注的置信度进行自我评估很简单,但这种情况很少发生,且此前尚未进行过定量研究。在此,作者探讨了在统计融合背景下,自我评估与评分者表现的自动评估相关的效用。
作者对由从大学本科人群中招募的75名经过最少培训的人类评分者手动标注的66个容积进行了研究。评分者接受了15分钟的培训,期间向他们展示了正确分割的示例,并演示了在线分割工具。容积按二维切片方式进行标注,且切片是无序的。评分者通过在叠加在切片上的置信条上做标记,为每个切片生成一个自我评估的质量指标。将投票算法和统计融合算法生成的容积与同一容积的一组专家分割结果进行比较。
获得了8825个不同切片的标注。简单多数投票在统计学上的表现比根据自我评估表现加权的投票要差。统计融合在统计学上的表现与自我评估加权投票无法区分。作者为在统计融合框架中使用自我评估表现建立了一个新的理论基础,并证明信息的综合来源(统计评估和自我评估)比单独考虑的方法在统计学上有显著改进。
作者首次对手动标注中自我评估表现进行了系统描述。作者证明自我评估和统计融合在标签融合方面产生了相似但互补的益处。最后,作者提出了将自我评估与统计标签融合相结合的新理论基础。