基于密度的半监督聚类和分类方法的统一观点。

A unified view of density-based methods for semi-supervised clustering and classification.

作者信息

Castro Gertrudes Jadson, Zimek Arthur, Sander Jörg, Campello Ricardo J G B

机构信息

SCC/ICMC/USP, University of São Paulo, Avenue Trabalhador São-carlense, 400 - Center, São Carlos, SP 13566-590 Brazil.

IMADA, University of Southern Denmark, Campusvej 55, 5230 Odense M, Denmark.

出版信息

Data Min Knowl Discov. 2019;33(6):1894-1952. doi: 10.1007/s10618-019-00651-1. Epub 2020 Jul 27.

DOI:10.1007/s10618-019-00651-1

PMID:32831623

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7410108/

Abstract

Semi-supervised learning is drawing increasing attention in the era of big data, as the gap between the abundance of cheap, automatically collected unlabeled data and the scarcity of labeled data that are laborious and expensive to obtain is dramatically increasing. In this paper, we first introduce a unified view of density-based clustering algorithms. We then build upon this view and bridge the areas of semi-supervised clustering and classification under a common umbrella of density-based techniques. We show that there are close relations between density-based clustering algorithms and the graph-based approach for transductive classification. These relations are then used as a basis for a new framework for semi-supervised classification based on building-blocks from density-based clustering. This framework is not only efficient and effective, but it is also statistically sound. In addition, we generalize the core algorithm in our framework, HDBSCAN*, so that it can also perform semi-supervised clustering by directly taking advantage of any fraction of labeled data that may be available. Experimental results on a large collection of datasets show the advantages of the proposed approach both for semi-supervised classification as well as for semi-supervised clustering.

摘要

在大数据时代，半监督学习正受到越来越多的关注，因为大量廉价的自动收集的未标记数据与获取成本高昂且费力的标记数据之间的差距正在急剧扩大。在本文中，我们首先介绍基于密度的聚类算法的统一观点。然后，我们在此观点的基础上，在基于密度的技术这一共同框架下，将半监督聚类和分类领域联系起来。我们表明，基于密度的聚类算法与基于图的转导分类方法之间存在密切关系。这些关系随后被用作基于基于密度聚类的构建块的半监督分类新框架的基础。该框架不仅高效有效，而且在统计上也是合理的。此外，我们对框架中的核心算法HDBSCAN*进行了推广，使其也能通过直接利用任何可用的标记数据部分来执行半监督聚类。在大量数据集上的实验结果表明了所提出方法在半监督分类和半监督聚类方面的优势。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f73e/7410108/3a466957196d/10618_2019_651_Fig1_HTML.jpg

相似文献

A unified view of density-based methods for semi-supervised clustering and classification.

Data Min Knowl Discov. 2019;33(6):1894-1952. doi: 10.1007/s10618-019-00651-1. Epub 2020 Jul 27.

Semi Supervised Learning with Deep Embedded Clustering for Image Classification and Segmentation.

IEEE Access. 2019;7:11093-11104. doi: 10.1109/ACCESS.2019.2891970. Epub 2019 Jan 9.

A classification-based approach to semi-supervised clustering with pairwise constraints.

Neural Netw. 2020 Jul;127:193-203. doi: 10.1016/j.neunet.2020.04.017. Epub 2020 Apr 25.

SemiBoost: boosting for semi-supervised learning.

IEEE Trans Pattern Anal Mach Intell. 2009 Nov;31(11):2000-14. doi: 10.1109/TPAMI.2008.235.

Semi-supervised and unsupervised extreme learning machines.

IEEE Trans Cybern. 2014 Dec;44(12):2405-17. doi: 10.1109/TCYB.2014.2307349.

A Cluster-then-label Semi-supervised Learning Approach for Pathology Image Classification.

Sci Rep. 2018 May 8;8(1):7193. doi: 10.1038/s41598-018-24876-0.

Semi-Supervised Deep Learning Using Pseudo Labels for Hyperspectral Image Classification.

IEEE Trans Image Process. 2018 Mar;27(3):1259-1270. doi: 10.1109/TIP.2017.2772836. Epub 2017 Nov 13.

Enhanced manifold regularization for semi-supervised classification.

J Opt Soc Am A Opt Image Sci Vis. 2016 Jun 1;33(6):1207-13. doi: 10.1364/JOSAA.33.001207.

A unified semi-supervised model with joint estimation of graph, soft labels and latent subspace.

Neural Netw. 2023 Sep;166:248-259. doi: 10.1016/j.neunet.2023.07.014. Epub 2023 Jul 17.

Auto-Weighted Multi-View Learning for Image Clustering and Semi-Supervised Classification.

IEEE Trans Image Process. 2018 Mar;27(3):1501-1511. doi: 10.1109/TIP.2017.2754939. Epub 2017 Sep 20.

引用本文的文献

Extended methods for spatial cell classification with DBSCAN-CellX.

Sci Rep. 2023 Nov 1;13(1):18868. doi: 10.1038/s41598-023-45190-4.

Performance Evaluation of Hospital Economic Management with the Clustering Algorithm Oriented towards Electronic Health Management.

J Healthc Eng. 2022 Apr 6;2022:3603353. doi: 10.1155/2022/3603353. eCollection 2022.

Predictors of incident viral symptoms ascertained in the era of COVID-19.

PLoS One. 2021 Jun 17;16(6):e0253120. doi: 10.1371/journal.pone.0253120. eCollection 2021.

RAMRSGL: A Robust Adaptive Multinomial Regression Model for Multicancer Classification.

Comput Math Methods Med. 2021 May 25;2021:5584684. doi: 10.1155/2021/5584684. eCollection 2021.

Constraint-Based Hierarchical Cluster Selection in Automotive Radar Data.

Sensors (Basel). 2021 May 13;21(10):3410. doi: 10.3390/s21103410.

本文引用的文献

The ChEMBL database in 2017.

Nucleic Acids Res. 2017 Jan 4;45(D1):D945-D954. doi: 10.1093/nar/gkw1074. Epub 2016 Nov 28.

Comparison of combinatorial clustering methods on pharmacological data sets represented by machine learning-selected real molecular descriptors.

J Chem Inf Model. 2011 Dec 27;51(12):3036-49. doi: 10.1021/ci2000083. Epub 2011 Dec 9.

Clustering cancer gene expression data: a comparative study.

BMC Bioinformatics. 2008 Nov 27;9:497. doi: 10.1186/1471-2105-9-497.

Anchor-GRIND: filling the gap between standard 3D QSAR and the GRid-INdependent descriptors.

J Med Chem. 2005 Apr 7;48(7):2687-94. doi: 10.1021/jm049113+.

A comparison of methods for modeling quantitative structure-activity relationships.

J Med Chem. 2004 Oct 21;47(22):5541-54. doi: 10.1021/jm0497141.

Clustering gene-expression data with repeated measurements.

Genome Biol. 2003;4(5):R34. doi: 10.1186/gb-2003-4-5-r34. Epub 2003 Apr 25.

Model-based clustering and data transformations for gene expression data.

Bioinformatics. 2001 Oct;17(10):977-87. doi: 10.1093/bioinformatics/17.10.977.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于密度的半监督聚类和分类方法的统一观点。

A unified view of density-based methods for semi-supervised clustering and classification.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献