Computational Radiology Laboratory, Department of Radiology, Children's Hospital, Boston, MA 02115, USA.
IEEE Trans Med Imaging. 2010 Mar;29(3):771-80. doi: 10.1109/TMI.2009.2036011.
The evaluation of the quality of segmentations of an image, and the assessment of intra- and inter-expert variability in segmentation performance, has long been recognized as a difficult task. For a segmentation validation task, it may be effective to compare the results of an automatic segmentation algorithm to multiple expert segmentations. Recently an expectation-maximization (EM) algorithm for simultaneous truth and performance level estimation (STAPLE) was developed to this end to compute both an estimate of the reference standard segmentation and performance parameters from a set of segmentations of an image. The performance is characterized by the rate of detection of each segmentation label by each expert in comparison to the estimated reference standard. This previous work provides estimates of performance parameters,but does not provide any information regarding the uncertainty of the estimated values. An estimate of this inferential uncertainty, if available, would allow the estimation of confidence intervals for the values of the parameters. This would facilitate the interpretation of the performance of segmentation generators and help determine if sufficient data size and number of segmentations have been obtained to precisely characterize the performance parameters. We present a new algorithm to estimate the inferential uncertainty of the performance parameters for binary and multi-category segmentations. It is derived for the special case of the STAPLE algorithm based on established theory for general purpose covariance matrix estimation for EM algorithms. The bounds on the performance parameters are estimated by the computation of the observed information matrix.We use this algorithm to study the bounds on performance parameters estimates from simulated images with specified performance parameters, and from interactive segmentations of neonatal brain MRIs. We demonstrate that confidence intervals for expert segmentation performance parameters can be estimated with our algorithm. We investigate the influence of the number of experts and of the segmented data size on these bounds, showing that it is possible to determine the number of image segmentations and the size of images necessary to achieve a chosen level of accuracy in segmentation performance assessment.
图像分割质量的评估以及分割性能的专家内和专家间可变性评估一直以来都是一项艰巨的任务。对于分割验证任务,将自动分割算法的结果与多个专家分割进行比较可能是有效的。最近,为了实现这一目标,开发了一种用于同时真实和性能水平估计(STAPLE)的期望最大化(EM)算法,以从一组图像分割中计算参考标准分割和性能参数的估计值。性能的特征是每个专家对每个分割标签的检测率与估计的参考标准进行比较。这项之前的工作提供了性能参数的估计值,但没有提供有关估计值不确定性的任何信息。如果有这样的推断不确定性的估计,就可以估计参数值的置信区间。这将有助于解释分割生成器的性能,并帮助确定是否获得了足够的数据大小和分割数量来精确地描述性能参数。我们提出了一种用于估计二进制和多类别分割性能参数推断不确定性的新算法。它是根据 EM 算法的一般协方差矩阵估计的既定理论,针对 STAPLE 算法的特殊情况推导出来的。通过计算观察信息矩阵,可以估计性能参数的界限。我们使用该算法来研究模拟图像中指定性能参数的性能参数估计值的界限,以及新生儿脑 MRI 的交互式分割。我们证明了可以使用我们的算法估计专家分割性能参数的置信区间。我们研究了专家数量和分割数据大小对这些界限的影响,表明可以确定获得所需分割性能评估准确性的图像分割数量和图像大小。