Zenk Maximilian, Zimmerer David, Isensee Fabian, Traub Jeremias, Norajitra Tobias, Jäger Paul F, Maier-Hein Klaus
German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing, Germany; Medical Faculty Heidelberg, Heidelberg University, Heidelberg, Germany.
German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing, Germany.
Med Image Anal. 2025 Apr;101:103392. doi: 10.1016/j.media.2024.103392. Epub 2024 Nov 30.
Semantic segmentation is an essential component of medical image analysis research, with recent deep learning algorithms offering out-of-the-box applicability across diverse datasets. Despite these advancements, segmentation failures remain a significant concern for real-world clinical applications, necessitating reliable detection mechanisms. This paper introduces a comprehensive benchmarking framework aimed at evaluating failure detection methodologies within medical image segmentation. Through our analysis, we identify the strengths and limitations of current failure detection metrics, advocating for the risk-coverage analysis as a holistic evaluation approach. Utilizing a collective dataset comprising five public 3D medical image collections, we assess the efficacy of various failure detection strategies under realistic test-time distribution shifts. Our findings highlight the importance of pixel confidence aggregation and we observe superior performance of the pairwise Dice score (Roy et al., 2019) between ensemble predictions, positioning it as a simple and robust baseline for failure detection in medical image segmentation. To promote ongoing research, we make the benchmarking framework available to the community.
语义分割是医学图像分析研究的重要组成部分,最近的深度学习算法在各种数据集上都具有开箱即用的适用性。尽管有这些进展,但分割失败仍是实际临床应用中的一个重大问题,因此需要可靠的检测机制。本文介绍了一个全面的基准测试框架,旨在评估医学图像分割中的失败检测方法。通过我们的分析,我们确定了当前失败检测指标的优点和局限性,提倡将风险覆盖分析作为一种整体评估方法。利用一个包含五个公共3D医学图像集的集体数据集,我们在实际测试时分布变化的情况下评估了各种失败检测策略的有效性。我们的研究结果突出了像素置信度聚合的重要性,并且我们观察到集成预测之间的成对Dice分数(Roy等人,2019年)具有卓越的性能,将其定位为医学图像分割中失败检测的一种简单而稳健的基线。为了促进正在进行的研究,我们向社区提供了基准测试框架。