Strijbis Victor I J, Gurney-Champion O J, Slotman Berend J, Verbakel Wilko F A R
Amsterdam UMC location Vrije Universiteit Amsterdam, Department of Radiation Oncology, De Boelelaan 1117, Amsterdam, the Netherlands.
Cancer Center Amsterdam, Cancer Treatment and Quality of Life, Amsterdam, the Netherlands.
Phys Imaging Radiat Oncol. 2024 Dec 4;32:100684. doi: 10.1016/j.phro.2024.100684. eCollection 2024 Oct.
Segmentation imperfections (noise) in radiotherapy organ-at-risk segmentation naturally arise from specialist experience and image quality. Using clinical contours can result in sub-optimal convolutional neural network (CNN) training and performance, but manual curation is costly. We address the impact of simulated and clinical segmentation noise on CNN parotid gland (PG) segmentation performance and provide proof-of-concept for an easily implemented auto-curation countermeasure.
The impact of segmentation imperfections was investigated by simulating noise in clean, high-quality segmentations. Curation efficacy was tested by removing lowest-scoring Dice similarity coefficient (DSC) cases early during CNN training, both in simulated (5-fold) and clinical (10-fold) settings, using our full radiotherapy clinical cohort (RTCC; N = 1750 individual PGs). Statistical significance was assessed using Bonferroni-corrected Wilcoxon signed-rank tests. Curation efficacies were evaluated using DSC and mean surface distance (MSD) on in-distribution and out-of-distribution data and visual inspection.
The curation step correctly removed median(range) 98(90-100)% of corrupted segmentations and restored the majority (1.2 %/1.3 %) of DSC lost from training with 30 % corrupted segmentations. This effect was masked when using typical (non-curated) validation data. In RTCC, 20 % curation showed improved model generalizability which significantly improved out-of-distribution DSC and MSD (p < 1.0e-12, p < 1.0e-6). Improved consistency was observed in particularly the medial and anterior lobes.
Up to 30% case removal, the curation benefit outweighed the training variance lost through curation. Considering the notable ease of implementation, high sensitivity in simulations and performance gains already at lower curation fractions, as a conservative middle ground, we recommend 15% curation of training cases when training CNNs using clinical PG contours.
放射治疗危及器官分割中的分割缺陷(噪声)自然源于专家经验和图像质量。使用临床轮廓可能导致卷积神经网络(CNN)训练和性能次优,但人工筛选成本高昂。我们研究了模拟和临床分割噪声对CNN腮腺(PG)分割性能的影响,并为一种易于实施的自动筛选对策提供了概念验证。
通过在干净、高质量的分割中模拟噪声来研究分割缺陷的影响。在CNN训练早期,使用我们的完整放射治疗临床队列(RTCC;N = 1750个个体PG),在模拟(5折)和临床(10折)设置中,通过去除得分最低的骰子相似系数(DSC)病例来测试筛选效果。使用Bonferroni校正的Wilcoxon符号秩检验评估统计学显著性。使用DSC和平均表面距离(MSD)对分布内和分布外数据进行评估,并通过目视检查来评估筛选效果。
筛选步骤正确地去除了中位数(范围)为98(90 - 100)%的损坏分割,并恢复了因使用30%损坏分割进行训练而损失的大部分(1.2%/1.3%)DSC。当使用典型(未筛选)验证数据时,这种效果被掩盖。在RTCC中,20%的筛选显示模型泛化能力有所提高,这显著改善了分布外DSC和MSD(p < 1.0e - 12,p < 1.0e - 6)。特别是在内侧和前叶观察到了更高的一致性。
在去除高达30%的病例时,筛选的益处超过了因筛选而损失的训练方差。考虑到实施的显著简便性、模拟中的高敏感性以及在较低筛选比例时已经获得的性能提升,作为保守的中间立场,我们建议在使用临床PG轮廓训练CNN时,对训练病例进行15%的筛选。