注释缺陷和自动整理对基于深度学习的危及器官分割的影响。

Impact of annotation imperfections and auto-curation for deep learning-based organ-at-risk segmentation.

作者信息

Strijbis Victor I J, Gurney-Champion O J, Slotman Berend J, Verbakel Wilko F A R

机构信息

Amsterdam UMC location Vrije Universiteit Amsterdam, Department of Radiation Oncology, De Boelelaan 1117, Amsterdam, the Netherlands.

Cancer Center Amsterdam, Cancer Treatment and Quality of Life, Amsterdam, the Netherlands.

出版信息

Phys Imaging Radiat Oncol. 2024 Dec 4;32:100684. doi: 10.1016/j.phro.2024.100684. eCollection 2024 Oct.

DOI:10.1016/j.phro.2024.100684

PMID:39720784

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11667007/

Abstract

BACKGROUND AND PURPOSE

Segmentation imperfections (noise) in radiotherapy organ-at-risk segmentation naturally arise from specialist experience and image quality. Using clinical contours can result in sub-optimal convolutional neural network (CNN) training and performance, but manual curation is costly. We address the impact of simulated and clinical segmentation noise on CNN parotid gland (PG) segmentation performance and provide proof-of-concept for an easily implemented auto-curation countermeasure.

METHODS AND MATERIALS

The impact of segmentation imperfections was investigated by simulating noise in clean, high-quality segmentations. Curation efficacy was tested by removing lowest-scoring Dice similarity coefficient (DSC) cases early during CNN training, both in simulated (5-fold) and clinical (10-fold) settings, using our full radiotherapy clinical cohort (RTCC; N = 1750 individual PGs). Statistical significance was assessed using Bonferroni-corrected Wilcoxon signed-rank tests. Curation efficacies were evaluated using DSC and mean surface distance (MSD) on in-distribution and out-of-distribution data and visual inspection.

RESULTS

The curation step correctly removed median(range) 98(90-100)% of corrupted segmentations and restored the majority (1.2 %/1.3 %) of DSC lost from training with 30 % corrupted segmentations. This effect was masked when using typical (non-curated) validation data. In RTCC, 20 % curation showed improved model generalizability which significantly improved out-of-distribution DSC and MSD (p < 1.0e-12, p < 1.0e-6). Improved consistency was observed in particularly the medial and anterior lobes.

CONCLUSIONS

Up to 30% case removal, the curation benefit outweighed the training variance lost through curation. Considering the notable ease of implementation, high sensitivity in simulations and performance gains already at lower curation fractions, as a conservative middle ground, we recommend 15% curation of training cases when training CNNs using clinical PG contours.

摘要

背景与目的

放射治疗危及器官分割中的分割缺陷（噪声）自然源于专家经验和图像质量。使用临床轮廓可能导致卷积神经网络（CNN）训练和性能次优，但人工筛选成本高昂。我们研究了模拟和临床分割噪声对CNN腮腺（PG）分割性能的影响，并为一种易于实施的自动筛选对策提供了概念验证。

方法与材料

通过在干净、高质量的分割中模拟噪声来研究分割缺陷的影响。在CNN训练早期，使用我们的完整放射治疗临床队列（RTCC；N = 1750个个体PG），在模拟（5折）和临床（10折）设置中，通过去除得分最低的骰子相似系数（DSC）病例来测试筛选效果。使用Bonferroni校正的Wilcoxon符号秩检验评估统计学显著性。使用DSC和平均表面距离（MSD）对分布内和分布外数据进行评估，并通过目视检查来评估筛选效果。

结果

筛选步骤正确地去除了中位数（范围）为98（90 - 100）%的损坏分割，并恢复了因使用30%损坏分割进行训练而损失的大部分（1.2%/1.3%）DSC。当使用典型（未筛选）验证数据时，这种效果被掩盖。在RTCC中，20%的筛选显示模型泛化能力有所提高，这显著改善了分布外DSC和MSD（p < 1.0e - 12，p < 1.0e - 6）。特别是在内侧和前叶观察到了更高的一致性。