Warfield Simon K, Zou Kelly H, Wells William M
Harvard Medical School and the Department of Radiology of Brigham and Women's Hospital, 75 Francis St, Boston, MA 02115, USA.
IEEE Trans Med Imaging. 2004 Jul;23(7):903-21. doi: 10.1109/TMI.2004.828354.
Characterizing the performance of image segmentation approaches has been a persistent challenge. Performance analysis is important since segmentation algorithms often have limited accuracy and precision. Interactive drawing of the desired segmentation by human raters has often been the only acceptable approach, and yet suffers from intra-rater and inter-rater variability. Automated algorithms have been sought in order to remove the variability introduced by raters, but such algorithms must be assessed to ensure they are suitable for the task. The performance of raters (human or algorithmic) generating segmentations of medical images has been difficult to quantify because of the difficulty of obtaining or estimating a known true segmentation for clinical data. Although physical and digital phantoms can be constructed for which ground truth is known or readily estimated, such phantoms do not fully reflect clinical images due to the difficulty of constructing phantoms which reproduce the full range of imaging characteristics and normal and pathological anatomical variability observed in clinical data. Comparison to a collection of segmentations by raters is an attractive alternative since it can be carried out directly on the relevant clinical imaging data. However, the most appropriate measure or set of measures with which to compare such segmentations has not been clarified and several measures are used in practice. We present here an expectation-maximization algorithm for simultaneous truth and performance level estimation (STAPLE). The algorithm considers a collection of segmentations and computes a probabilistic estimate of the true segmentation and a measure of the performance level represented by each segmentation. The source of each segmentation in the collection may be an appropriately trained human rater or raters, or may be an automated segmentation algorithm. The probabilistic estimate of the true segmentation is formed by estimating an optimal combination of the segmentations, weighting each segmentation depending upon the estimated performance level, and incorporating a prior model for the spatial distribution of structures being segmented as well as spatial homogeneity constraints. STAPLE is straightforward to apply to clinical imaging data, it readily enables assessment of the performance of an automated image segmentation algorithm, and enables direct comparison of human rater and algorithm performance.
对图像分割方法的性能进行表征一直是一项长期挑战。性能分析很重要,因为分割算法的准确性和精确性往往有限。由人类评估者交互式绘制所需的分割结果通常是唯一可接受的方法,但存在评估者内部和评估者之间的变异性。为了消除评估者引入的变异性,人们一直在寻求自动化算法,但必须对这类算法进行评估,以确保它们适用于该任务。由于难以获得或估计临床数据的已知真实分割结果,生成医学图像分割结果的评估者(无论是人类还是算法)的性能一直难以量化。尽管可以构建已知或易于估计真实情况的物理和数字模型,但由于难以构建能够再现临床数据中观察到的全部成像特征以及正常和病理解剖变异性的模型,此类模型并不能完全反映临床图像。与评估者的一组分割结果进行比较是一种有吸引力的替代方法,因为它可以直接在相关的临床成像数据上进行。然而,用于比较此类分割结果的最合适的度量或一组度量尚未明确,实际中使用了多种度量。我们在此提出一种用于同时估计真值和性能水平的期望最大化算法(STAPLE)。该算法考虑一组分割结果,并计算真实分割的概率估计以及每个分割所代表的性能水平的度量。该集合中每个分割的来源可能是经过适当训练的一个或多个人类评估者,也可能是自动化分割算法。通过估计分割结果的最优组合、根据估计的性能水平对每个分割结果进行加权,并纳入被分割结构的空间分布的先验模型以及空间均匀性约束,来形成真实分割的概率估计。STAPLE易于应用于临床成像数据,它能够轻松评估自动化图像分割算法的性能,并能够直接比较人类评估者和算法的性能。