Medical Artificial Intelligence and Automation Laboratory, Department of Radiation Oncology, University of Texas Southwestern Medical Center, Dallas, TX 75390, United States of America.
Phys Med Biol. 2020 Aug 27;65(17):175007. doi: 10.1088/1361-6560/ab99e5.
Partly due to the use of exhaustive-annotated data, deep networks have achieved impressive performance on medical image segmentation. Medical imaging data paired with noisy annotation are, however, ubiquitous, but little is known about the effect of noisy annotation on deep learning based medical image segmentation. We studied the effect of noisy annotation in the context of mandible segmentation from CT images. First, 202 images of head and neck cancer patients were collected from our clinical database, where the organs-at-risk were annotated by one of twelve planning dosimetrists. The mandibles were roughly annotated as the planning avoiding structure. Then, mandible labels were checked and corrected by a head and neck specialist to get the reference standard. At last, by varying the ratios of noisy labels in the training set, deep networks were trained and tested for mandible segmentation. The trained models were further tested on other two public datasets. Experimental results indicated that the network trained with noisy labels had worse segmentation than that trained with reference standard, and in general, fewer noisy labels led to better performance. When using 20% or less noisy cases for training, no significant difference was found on the segmentation results between the models trained by noisy or reference annotation. Cross-dataset validation results verified that the models trained with noisy data achieved competitive performance to that trained with reference standard. This study suggests that the involved network is robust to noisy annotation to some extent in mandible segmentation from CT images. It also highlights the importance of labeling quality in deep learning. In the future work, extra attention should be paid to how to utilize a small number of reference standard samples to improve the performance of deep learning with noisy annotation.
部分由于使用了详尽注释的数据,深度网络在医学图像分割方面取得了令人印象深刻的性能。然而,医学成像数据与嘈杂的注释是普遍存在的,但对于嘈杂注释对基于深度学习的医学图像分割的影响知之甚少。我们研究了嘈杂注释在 CT 图像下颌骨分割中的影响。首先,从我们的临床数据库中收集了 202 张头颈部癌症患者的图像,其中的危及器官由 12 位计划剂量师之一进行注释。下颌骨被粗略地注释为计划避开的结构。然后,由一名头颈部专家检查和纠正下颌骨标签,以获得参考标准。最后,通过改变训练集中嘈杂标签的比例,训练和测试深度网络进行下颌骨分割。训练好的模型进一步在另外两个公共数据集上进行测试。实验结果表明,用嘈杂标签训练的网络的分割效果比用参考标准训练的网络差,而且通常来说,较少的嘈杂标签会导致更好的性能。当用 20%或更少的嘈杂病例进行训练时,用嘈杂或参考注释训练的模型在分割结果上没有显著差异。跨数据集验证结果验证了用嘈杂数据训练的模型在性能上与用参考标准训练的模型相当。这项研究表明,所涉及的网络在 CT 图像下颌骨分割中在一定程度上对嘈杂注释具有鲁棒性。它还强调了注释质量在深度学习中的重要性。在未来的工作中,应特别注意如何利用少量的参考标准样本来提高用嘈杂注释进行深度学习的性能。