He Da, Udupa Jayaram K, Tong Yubing, Torigian Drew A
Medical Image Processing Group, 602 Goddard building, 3710 Hamilton Walk, Department of Radiology, University of Pennsylvania, Philadelphia, PA 19104, United States.
University of Michigan-Shanghai Jiao Tong University Joint Institute, Shanghai Jiao Tong University, Shanghai 200240, China.
medRxiv. 2024 Jun 13:2024.06.12.24308779. doi: 10.1101/2024.06.12.24308779.
Auto-segmentation is one of the critical and foundational steps for medical image analysis. The quality of auto-segmentation techniques influences the efficiency of precision radiology and radiation oncology since high- quality auto-segmentations usually require limited manual correction. Segmentation metrics are necessary and important to evaluate auto-segmentation results and guide the development of auto-segmentation techniques. Currently widely applied segmentation metrics usually compare the auto-segmentation with the ground truth in terms of the overlapping area (e.g., Dice Coefficient (DC)) or the distance between boundaries (e.g., Hausdorff Distance (HD)). However, these metrics may not well indicate the manual mending effort required when observing the auto-segmentation results in clinical practice. In this article, we study different segmentation metrics to explore the appropriate way of evaluating auto-segmentations with clinical demands. The mending time for correcting auto-segmentations by experts is recorded to indicate the required mending effort. Five well-defined metrics, the overlapping area-based metric DC, the segmentation boundary distance-based metric HD, the segmentation boundary length-based metrics surface DC (surDC) and added path length (APL), and a newly proposed hybrid metric Mendability Index (MI) are discussed in the correlation analysis experiment and regression experiment. In addition to these explicitly defined metrics, we also preliminarily explore the feasibility of using deep learning models to predict the mending effort, which takes segmentation masks and the original images as the input. Experiments are conducted using datasets of 7 objects from three different institutions, which contain the original computed tomography (CT) images, the ground truth segmentations, the auto-segmentations, the corrected segmentations, and the recorded mending time. According to the correlation analysis and regression experiments for the five well-defined metrics, the variety of MI shows the best performance to indicate the mending effort for sparse objects, while the variety of HD works best when assessing the mending effort for non-sparse objects. Moreover, the deep learning models could well predict efforts required to mend auto-segmentations, even without the need of ground truth segmentations, demonstrating the potential of a novel and easy way to evaluate and boost auto-segmentation techniques.
自动分割是医学图像分析的关键和基础步骤之一。自动分割技术的质量会影响精准放射学和放射肿瘤学的效率,因为高质量的自动分割通常只需进行有限的人工校正。分割指标对于评估自动分割结果和指导自动分割技术的发展是必要且重要的。当前广泛应用的分割指标通常根据重叠区域(例如,骰子系数(DC))或边界之间的距离(例如,豪斯多夫距离(HD))将自动分割与真实情况进行比较。然而,这些指标可能无法很好地表明在临床实践中观察自动分割结果时所需的人工修正工作量。在本文中,我们研究不同的分割指标,以探索符合临床需求评估自动分割的合适方法。记录专家校正自动分割的修正时间,以表明所需的修正工作量。在相关性分析实验和回归实验中讨论了五个定义明确的指标:基于重叠区域的指标DC、基于分割边界距离的指标HD、基于分割边界长度的指标表面DC(surDC)和增加路径长度(APL),以及新提出的混合指标可修正性指数(MI)。除了这些明确定义的指标外,我们还初步探索了使用深度学习模型预测修正工作量的可行性,该模型将分割掩码和原始图像作为输入。使用来自三个不同机构的7个对象的数据集进行实验,这些数据集包含原始计算机断层扫描(CT)图像、真实分割、自动分割、校正后的分割以及记录的修正时间。根据对五个定义明确的指标的相关性分析和回归实验,MI的变化在表明稀疏对象的修正工作量方面表现最佳,而HD的变化在评估非稀疏对象的修正工作量时效果最佳。此外,深度学习模型能够很好地预测修正自动分割所需的工作量,甚至无需真实分割,这展示了一种评估和改进自动分割技术的新颖且简便方法的潜力。