Medical Biophysics, University of Toronto, Toronto, Canada.
Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Canada.
Sci Rep. 2018 May 8;8(1):7193. doi: 10.1038/s41598-018-24876-0.
Completely labeled pathology datasets are often challenging and time-consuming to obtain. Semi-supervised learning (SSL) methods are able to learn from fewer labeled data points with the help of a large number of unlabeled data points. In this paper, we investigated the possibility of using clustering analysis to identify the underlying structure of the data space for SSL. A cluster-then-label method was proposed to identify high-density regions in the data space which were then used to help a supervised SVM in finding the decision boundary. We have compared our method with other supervised and semi-supervised state-of-the-art techniques using two different classification tasks applied to breast pathology datasets. We found that compared with other state-of-the-art supervised and semi-supervised methods, our SSL method is able to improve classification performance when a limited number of labeled data instances are made available. We also showed that it is important to examine the underlying distribution of the data space before applying SSL techniques to ensure semi-supervised learning assumptions are not violated by the data.
完全标记的病理学数据集通常难以获取且耗时较长。半监督学习 (SSL) 方法能够借助大量未标记的数据点,从更少的标记数据点中进行学习。在本文中,我们研究了使用聚类分析来识别 SSL 中数据空间潜在结构的可能性。提出了一种聚类-标记方法来识别数据空间中的高密度区域,然后使用这些区域来帮助有监督的 SVM 找到决策边界。我们使用两种不同的分类任务,将我们的方法与其他监督和半监督的最新技术进行了比较,这些技术应用于乳腺病理学数据集。我们发现,与其他先进的监督和半监督方法相比,当可用的标记数据实例数量有限时,我们的 SSL 方法能够提高分类性能。我们还表明,在应用 SSL 技术之前,检查数据空间的底层分布很重要,以确保数据不会违反半监督学习的假设。