Asare Sarpong Kwadwo, You Fei, Nartey Obed Tettey
School of Electronic Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China.
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China.
Comput Intell Neurosci. 2020 Dec 3;2020:8826568. doi: 10.1155/2020/8826568. eCollection 2020.
The unavailability of large amounts of well-labeled data poses a significant challenge in many medical imaging tasks. Even in the likelihood of having access to sufficient data, the process of accurately labeling the data is an arduous and time-consuming one, requiring expertise skills. Again, the issue of unbalanced data further compounds the abovementioned problems and presents a considerable challenge for many machine learning algorithms. In lieu of this, the ability to develop algorithms that can exploit large amounts of unlabeled data together with a small amount of labeled data, while demonstrating robustness to data imbalance, can offer promising prospects in building highly efficient classifiers. This work proposes a semisupervised learning method that integrates self-training and self-paced learning to generate and select pseudolabeled samples for classifying breast cancer histopathological images. A novel pseudolabel generation and selection algorithm is introduced in the learning scheme to generate and select highly confident pseudolabeled samples from both well-represented classes to less-represented classes. Such a learning approach improves the performance by jointly learning a model and optimizing the generation of pseudolabels on unlabeled-target data to augment the training data and retraining the model with the generated labels. A class balancing framework that normalizes the class-wise confidence scores is also proposed to prevent the model from ignoring samples from less represented classes (hard-to-learn samples), hence effectively handling the issue of data imbalance. Extensive experimental evaluation of the proposed method on the BreakHis dataset demonstrates the effectiveness of the proposed method.
在许多医学成像任务中,无法获取大量标注良好的数据构成了重大挑战。即使有可能获得足够的数据,准确标注数据的过程也是艰巨且耗时的,需要专业技能。此外,数据不平衡问题进一步加剧了上述问题,给许多机器学习算法带来了相当大的挑战。鉴于此,开发能够利用大量未标注数据和少量标注数据,同时对数据不平衡具有鲁棒性的算法,在构建高效分类器方面可能会带来有前景的成果。这项工作提出了一种半监督学习方法,该方法集成了自训练和自步学习,用于生成和选择伪标注样本以对乳腺癌组织病理学图像进行分类。在学习方案中引入了一种新颖的伪标注生成和选择算法,以便从代表性好的类别到代表性差的类别生成和选择高度可信的伪标注样本。这种学习方法通过联合学习模型并优化未标注目标数据上的伪标注生成来增强训练数据,并使用生成的标签对模型进行重新训练,从而提高性能。还提出了一个对类别置信度分数进行归一化的类别平衡框架,以防止模型忽略来自代表性差的类别的样本(难学习样本),从而有效处理数据不平衡问题。在BreakHis数据集上对所提出方法进行的广泛实验评估证明了该方法的有效性。