Castro Gertrudes Jadson, Zimek Arthur, Sander Jörg, Campello Ricardo J G B
SCC/ICMC/USP, University of São Paulo, Avenue Trabalhador São-carlense, 400 - Center, São Carlos, SP 13566-590 Brazil.
IMADA, University of Southern Denmark, Campusvej 55, 5230 Odense M, Denmark.
Data Min Knowl Discov. 2019;33(6):1894-1952. doi: 10.1007/s10618-019-00651-1. Epub 2020 Jul 27.
Semi-supervised learning is drawing increasing attention in the era of big data, as the gap between the abundance of cheap, automatically collected unlabeled data and the scarcity of labeled data that are laborious and expensive to obtain is dramatically increasing. In this paper, we first introduce a unified view of density-based clustering algorithms. We then build upon this view and bridge the areas of semi-supervised clustering and classification under a common umbrella of density-based techniques. We show that there are close relations between density-based clustering algorithms and the graph-based approach for transductive classification. These relations are then used as a basis for a new framework for semi-supervised classification based on building-blocks from density-based clustering. This framework is not only efficient and effective, but it is also statistically sound. In addition, we generalize the core algorithm in our framework, HDBSCAN*, so that it can also perform semi-supervised clustering by directly taking advantage of any fraction of labeled data that may be available. Experimental results on a large collection of datasets show the advantages of the proposed approach both for semi-supervised classification as well as for semi-supervised clustering.
在大数据时代,半监督学习正受到越来越多的关注,因为大量廉价的自动收集的未标记数据与获取成本高昂且费力的标记数据之间的差距正在急剧扩大。在本文中,我们首先介绍基于密度的聚类算法的统一观点。然后,我们在此观点的基础上,在基于密度的技术这一共同框架下,将半监督聚类和分类领域联系起来。我们表明,基于密度的聚类算法与基于图的转导分类方法之间存在密切关系。这些关系随后被用作基于基于密度聚类的构建块的半监督分类新框架的基础。该框架不仅高效有效,而且在统计上也是合理的。此外,我们对框架中的核心算法HDBSCAN*进行了推广,使其也能通过直接利用任何可用的标记数据部分来执行半监督聚类。在大量数据集上的实验结果表明了所提出方法在半监督分类和半监督聚类方面的优势。