Kallergi M, Carney G M, Gaviria J
Department of Radiology, University of South Florida, and H. Lee Moffitt Cancer Center & Research Institute, Tampa 33612, USA.
Med Phys. 1999 Feb;26(2):267-75. doi: 10.1118/1.598514.
The initial and relative evaluation of computer methodologies developed for assisting diagnosis in mammography is usually done by comparing the computer output to ground truth data provided by experts and/or biopsy. Reported studies, however, give little information on how the performance indices of computer assisted diagnosis (CAD) algorithms are determined in this initial stage of evaluation. Several strategies exist in the estimation of the true positive (TP) and false positive (FP) rates with respect to ground truth. Adopting one strategy over another yields different performance rates that can be over- or underestimates of the true performance. Furthermore, the estimation of pairs of TP and FP rates gives a partial picture of the performance of an algorithm. It is shown in this work that new performance indices are needed to fully describe the degree of detection (part or whole) and the type of detection (single calcification, cluster of calcifications, mass, or artifact). Several evaluation strategies were tested. The one that yielded the most realistic performances included the following criteria: The detected area should be at least 50% of the true area and no more than four times the true area in order to be considered TP. At least three true calcifications should be detected to within 1 cm2 with nearest neighbor distances of less than square root(2) cm for a cluster to be considered TP. Separate detection measures should be established and used for artifacts and naturally occurring structures to maximize the benefits of the evaluation. Finally, it is critical that CAD investigators provide information on the tested image set as well as the criteria used for the evaluation of the algorithms to allow comparisons and better understanding of their methodologies.
为辅助乳腺钼靶诊断而开发的计算机方法的初步和相对评估,通常是通过将计算机输出与专家提供的真实数据和/或活检结果进行比较来完成的。然而,已发表的研究很少提供关于计算机辅助诊断(CAD)算法的性能指标在评估初始阶段是如何确定的信息。在根据真实情况估计真阳性(TP)率和假阳性(FP)率方面存在几种策略。采用一种策略而非另一种策略会产生不同的性能率,这些性能率可能会高估或低估真实性能。此外,对TP率和FP率对的估计只能部分反映算法的性能。这项工作表明,需要新的性能指标来全面描述检测程度(部分或整体)和检测类型(单个钙化、钙化簇、肿块或伪影)。测试了几种评估策略。产生最现实性能的策略包括以下标准:检测到的区域应至少为真实区域的50%,且不超过真实区域的四倍,才能被视为TP。对于一个簇要被视为TP,应在1平方厘米内检测到至少三个真实钙化,且最近邻距离小于根号2厘米。应建立并使用单独的检测措施来处理伪影和自然出现的结构,以最大限度地提高评估效益。最后,至关重要的是,CAD研究人员应提供有关测试图像集以及用于评估算法的标准的信息,以便进行比较并更好地理解他们的方法。