Fitzpatrick J M, Hill D L, Shyr Y, West J, Studholme C, Maurer C R
Department of Computer Science, Vanderbilt University, Nashville, TN 37235, USA.
IEEE Trans Med Imaging. 1998 Aug;17(4):571-85. doi: 10.1109/42.730402.
In a previous study we demonstrated that automatic retrospective registration algorithms can frequently register magnetic resonance (MR) and computed tomography (CT) images of the brain with an accuracy of better than 2 mm, but in that same study we found that such algorithms sometimes fail, leading to errors of 6 mm or more. Before these algorithms can be used routinely in the clinic, methods must be provided for distinguishing between registration solutions that are clinically satisfactory and those that are not. One approach is to rely on a human observer to inspect the registration results and reject images that have been registered with insufficient accuracy. In this paper, we present a methodology for evaluating the efficacy of the visual assessment of registration accuracy. Since the clinical requirements for level of registration accuracy are likely to be application dependent, we have evaluated the accuracy of the observer's estimate relative to six thresholds: 1-6 mm. The performance of the observers was evaluated relative to the registration solution obtained using external fiducial markers that are screwed into the patient's skull and that are visible in both MR and CT images. This fiducial marker system provides the gold standard for our study. Its accuracy is shown to be approximately 0.5 mm. Two experienced, blinded observers viewed five pairs of clinical MR and CT brain images, each of which had each been misregistered with respect to the gold standard solution. Fourteen misregistrations were assessed for each image pair with misregistration errors distributed between 0 and 10 mm with approximate uniformity. For each misregistered image pair each observer estimated the registration error (in millimeters) at each of five locations distributed around the head using each of three assessment methods. These estimated errors were compared with the errors as measured by the gold standard to determine agreement relative to each of the six thresholds, where agreement means that the two errors lie on the same side of the threshold. The effect of error in the gold standard itself is taken into account in the analysis of the assessment methods. The results were analyzed by means of the Kappa statistic, the agreement rate, and the area of receiver-operating-characteristic (ROC) curves. No assessment performed well at 1 mm, but all methods performed well at 2 mm and higher. For these five thresholds, two methods agreed with the standard at least 80% of the time and exhibited mean ROC areas greater than 0.84. One of these same methods exhibited Kappa statistics that indicated good agreement relative to chance (Kappa > 0.6) between the pooled observers and the standard for these same five thresholds. Further analysis demonstrates that the results depend strongly on the choice of the distribution of misregistration errors presented to the observers.
在之前的一项研究中,我们证明了自动回顾性配准算法能够频繁地将脑部磁共振(MR)图像和计算机断层扫描(CT)图像进行配准,其精度优于2毫米,但在同一项研究中我们发现,此类算法有时会失败,导致6毫米或更大的误差。在这些算法能够在临床中常规使用之前,必须提供一些方法来区分临床上令人满意的配准结果和不满意的配准结果。一种方法是依靠人工观察者检查配准结果,并拒绝配准精度不足的图像。在本文中,我们提出了一种评估配准精度视觉评估效果的方法。由于临床对配准精度水平的要求可能因应用而异,我们针对六个阈值(1 - 6毫米)评估了观察者估计的准确性。观察者的表现是相对于使用拧入患者颅骨且在MR和CT图像中均可见的外部基准标记获得的配准结果进行评估的。这个基准标记系统为我们的研究提供了金标准。其精度显示约为0.5毫米。两名经验丰富的、不知情的观察者查看了五对临床脑部MR和CT图像,每对图像相对于金标准解决方案均存在配准错误。对于每对配准错误的图像,评估了14个配准错误,配准错误在0至10毫米之间近似均匀分布。对于每对配准错误的图像,每位观察者使用三种评估方法中的每一种,在头部周围分布的五个位置处估计配准误差(以毫米为单位)。将这些估计误差与金标准测量的误差进行比较,以确定相对于六个阈值中的每一个的一致性,其中一致性意味着两个误差位于阈值的同一侧。在评估方法的分析中考虑了金标准本身的误差影响。结果通过卡帕统计量、一致率和接收者操作特征(ROC)曲线面积进行分析。在1毫米时,没有一种评估方法表现良好,但所有方法在2毫米及更高时表现良好。对于这五个阈值,两种方法至少80%的时间与标准一致,并且ROC曲线平均面积大于0.84。其中相同的一种方法的卡帕统计量表明,对于这相同的五个阈值,合并后的观察者与标准之间相对于机遇具有良好的一致性(卡帕>0.6)。进一步分析表明,结果在很大程度上取决于呈现给观察者的配准错误分布的选择。