Eelbode Tom, Bertels Jeroen, Berman Maxim, Vandermeulen Dirk, Maes Frederik, Bisschops Raf, Blaschko Matthew B
IEEE Trans Med Imaging. 2020 Nov;39(11):3679-3690. doi: 10.1109/TMI.2020.3002417. Epub 2020 Oct 28.
In many medical imaging and classical computer vision tasks, the Dice score and Jaccard index are used to evaluate the segmentation performance. Despite the existence and great empirical success of metric-sensitive losses, i.e. relaxations of these metrics such as soft Dice, soft Jaccard and Lovász-Softmax, many researchers still use per-pixel losses, such as (weighted) cross-entropy to train CNNs for segmentation. Therefore, the target metric is in many cases not directly optimized. We investigate from a theoretical perspective, the relation within the group of metric-sensitive loss functions and question the existence of an optimal weighting scheme for weighted cross-entropy to optimize the Dice score and Jaccard index at test time. We find that the Dice score and Jaccard index approximate each other relatively and absolutely, but we find no such approximation for a weighted Hamming similarity. For the Tversky loss, the approximation gets monotonically worse when deviating from the trivial weight setting where soft Tversky equals soft Dice. We verify these results empirically in an extensive validation on six medical segmentation tasks and can confirm that metric-sensitive losses are superior to cross-entropy based loss functions in case of evaluation with Dice Score or Jaccard Index. This further holds in a multi-class setting, and across different object sizes and foreground/background ratios. These results encourage a wider adoption of metric-sensitive loss functions for medical segmentation tasks where the performance measure of interest is the Dice score or Jaccard index.
在许多医学成像和传统计算机视觉任务中,Dice系数和Jaccard指数用于评估分割性能。尽管存在度量敏感损失并且在实践中取得了巨大成功,即这些度量的松弛形式,如软Dice、软Jaccard和Lovász-Softmax,但许多研究人员仍使用逐像素损失,如(加权)交叉熵来训练用于分割的卷积神经网络(CNN)。因此,在许多情况下,目标度量并未直接得到优化。我们从理论角度研究了度量敏感损失函数组内的关系,并质疑加权交叉熵是否存在最优加权方案,以便在测试时优化Dice系数和Jaccard指数。我们发现,Dice系数和Jaccard指数在相对和绝对意义上相互近似,但对于加权汉明相似度,我们未发现这种近似关系。对于Tversky损失,当偏离软Tversky等于软Dice的平凡权重设置时,这种近似会单调变差。我们在六个医学分割任务的广泛验证中通过实验验证了这些结果,并可以确认,在使用Dice系数或Jaccard指数进行评估时,度量敏感损失优于基于交叉熵的损失函数。这在多类设置中以及跨不同对象大小和前景/背景比例的情况下同样成立。这些结果鼓励在以Dice系数或Jaccard指数作为感兴趣的性能度量的医学分割任务中更广泛地采用度量敏感损失函数。