Danish Centre for Particle Therapy, Aarhus University Hospital, Palle Juul-Jensens Boulevard 25, 8200 Aarhus N, Denmark.
Department of Oncology, Aarhus University Hospital, Palle Juul-Jensens Boulevard 25, 8200 Aarhus N, Denmark.
Phys Med Biol. 2024 Aug 5;69(16). doi: 10.1088/1361-6560/ad682d.
Deep learning shows promise in autosegmentation of head and neck cancer (HNC) primary tumours (GTV-T) and nodal metastases (GTV-N). However, errors such as including non-tumour regions or missing nodal metastases still occur. Conventional methods often make overconfident predictions, compromising reliability. Incorporating uncertainty estimation, which provides calibrated confidence intervals can address this issue. Our aim was to investigate the efficacy of various uncertainty estimation methods in improving segmentation reliability. We evaluated their confidence levels in voxel predictions and ability to reveal potential segmentation errors.We retrospectively collected data from 567 HNC patients with diverse cancer sites and multi-modality images (CT, PET, T1-, and T2-weighted MRI) along with their clinical GTV-T/N delineations. Using the nnUNet 3D segmentation pipeline, we compared seven uncertainty estimation methods, evaluating them based on segmentation accuracy (Dice similarity coefficient, DSC), confidence calibration (Expected Calibration Error, ECE), and their ability to reveal segmentation errors (Uncertainty-Error overlap using DSC, UE-DSC).Evaluated on the hold-out test dataset (= 97), the median DSC scores for GTV-T and GTV-N segmentation across all uncertainty estimation methods had a narrow range, from 0.73 to 0.76 and 0.78 to 0.80, respectively. In contrast, the median ECE exhibited a wider range, from 0.30 to 0.12 for GTV-T and 0.25 to 0.09 for GTV-N. Similarly, the median UE-DSC also ranged broadly, from 0.21 to 0.38 for GTV-T and 0.22 to 0.36 for GTV-N. A probabilistic network-PhiSeg method consistently demonstrated the best performance in terms of ECE and UE-DSC.Our study highlights the importance of uncertainty estimation in enhancing the reliability of deep learning for autosegmentation of HNC GTV. The results show that while segmentation accuracy can be similar across methods, their reliability, measured by calibration error and uncertainty-error overlap, varies significantly. Used with visualisation maps, these methods may effectively pinpoint uncertainties and potential errors at the voxel level.
深度学习在头颈部癌症(HNC)原发肿瘤(GTV-T)和淋巴结转移(GTV-N)的自动分割中显示出前景。然而,仍会出现包括非肿瘤区域或遗漏淋巴结转移等错误。传统方法往往会做出过于自信的预测,从而降低可靠性。纳入不确定性估计,提供校准置信区间可以解决这个问题。我们的目的是研究各种不确定性估计方法在提高分割可靠性方面的效果。我们评估了它们在体素预测中的置信度水平和揭示潜在分割错误的能力。我们回顾性地收集了来自 567 名具有不同癌症部位和多模态图像(CT、PET、T1-和 T2 加权 MRI)的 HNC 患者的数据,以及他们的临床 GTV-T/N 勾画。使用 nnUNet 3D 分割流水线,我们比较了七种不确定性估计方法,根据分割准确性(Dice 相似系数,DSC)、置信度校准(预期校准误差,ECE)以及揭示分割错误的能力(使用 DSC 的不确定性-误差重叠,UE-DSC)对其进行评估。在保留测试数据集(=97)上评估,所有不确定性估计方法的 GTV-T 和 GTV-N 分割的中位数 DSC 评分范围较窄,分别为 0.73 至 0.76 和 0.78 至 0.80。相比之下,中位数 ECE 则范围较宽,GTV-T 为 0.30 至 0.12,GTV-N 为 0.25 至 0.09。同样,中位数 UE-DSC 也有广泛的范围,GTV-T 为 0.21 至 0.38,GTV-N 为 0.22 至 0.36。概率网络-PhiSeg 方法在 ECE 和 UE-DSC 方面始终表现出最佳性能。我们的研究强调了不确定性估计在提高深度学习对头颈部癌症 GTV 自动分割可靠性方面的重要性。结果表明,虽然方法之间的分割准确性可能相似,但它们的可靠性,通过校准误差和不确定性-误差重叠来衡量,差异很大。与可视化地图一起使用,这些方法可以有效地确定体素级别的不确定性和潜在错误。