Zhang Le, Tanno Ryutaro, Xu Moucheng, Huang Yawen, Bronik Kevin, Jin Chen, Jacob Joseph, Zheng Yefeng, Shao Ling, Ciccarelli Olga, Barkhof Frederik, Alexander Daniel C
Queen Square Institute of Neurology, Faculty of Brain Sciences, University College London, London, WC1B 5EH, United Kingdom.
Centre for Medical Image Computing, Department of Computer Science, University College London, London, WC1E 6BT, United Kingdom.
Pattern Recognit. 2023 Jun;138:None. doi: 10.1016/j.patcog.2023.109400.
Supervised machine learning methods have been widely developed for segmentation tasks in recent years. However, the quality of labels has high impact on the predictive performance of these algorithms. This issue is particularly acute in the medical image domain, where both the cost of annotation and the inter-observer variability are high. Different human experts contribute estimates of the "actual" segmentation labels in a typical label acquisition process, influenced by their personal biases and competency levels. The performance of automatic segmentation algorithms is limited when these noisy labels are used as the expert consensus label. In this work, we use two coupled CNNs to jointly learn, from purely noisy observations alone, the reliability of individual annotators and the expert consensus label distributions. The separation of the two is achieved by maximally describing the annotator's "unreliable behavior" (we call it "maximally unreliable") while achieving high fidelity with the noisy training data. We first create a toy segmentation dataset using MNIST and investigate the properties of the proposed algorithm. We then use three public medical imaging segmentation datasets to demonstrate our method's efficacy, including both simulated (where necessary) and real-world annotations: 1) ISBI2015 (multiple-sclerosis lesions); 2) BraTS (brain tumors); 3) LIDC-IDRI (lung abnormalities). Finally, we create a real-world multiple sclerosis lesion dataset (QSMSC at UCL: Queen Square Multiple Sclerosis Center at UCL, UK) with manual segmentations from 4 different annotators (3 radiologists with different level skills and 1 expert to generate the expert consensus label). In all datasets, our method consistently outperforms competing methods and relevant baselines, especially when the number of annotations is small and the amount of disagreement is large. The studies also reveal that the system is capable of capturing the complicated spatial characteristics of annotators' mistakes.
近年来,监督式机器学习方法已广泛应用于分割任务。然而,标签质量对这些算法的预测性能有很大影响。这个问题在医学图像领域尤为突出,因为标注成本高且观察者间的变异性也很大。在典型的标签获取过程中,不同的人类专家会给出“实际”分割标签的估计值,这些估计值受其个人偏见和能力水平的影响。当将这些有噪声的标签用作专家共识标签时,自动分割算法的性能会受到限制。在这项工作中,我们使用两个耦合的卷积神经网络(CNN),仅从纯噪声观测中联合学习单个标注者的可靠性以及专家共识标签分布。通过最大程度地描述标注者的“不可靠行为”(我们称之为“最大不可靠”),同时与有噪声的训练数据保持高保真度,来实现两者的分离。我们首先使用MNIST创建一个简单的分割数据集,并研究所提出算法的特性。然后,我们使用三个公开的医学影像分割数据集来证明我们方法的有效性,包括模拟(必要时)和真实世界的标注:1)ISBI2015(多发性硬化病变);2)BraTS(脑肿瘤);3)LIDC-IDRI(肺部异常)。最后,我们创建了一个真实世界的多发性硬化病变数据集(英国伦敦大学学院皇后广场多发性硬化中心的QSMSC),其中包含来自4个不同标注者(3名技能水平不同的放射科医生和1名生成专家共识标签的专家)的手动分割。在所有数据集中,我们的方法始终优于竞争方法和相关基线,特别是在标注数量少且分歧量大的情况下。研究还表明,该系统能够捕捉标注者错误的复杂空间特征。