Suppr超能文献

迈向高维数据的多维度集成聚类:从子空间到度量及其他

Toward Multidiversified Ensemble Clustering of High-Dimensional Data: From Subspaces to Metrics and Beyond.

作者信息

Huang Dong, Wang Chang-Dong, Lai Jian-Huang, Kwoh Chee-Keong

出版信息

IEEE Trans Cybern. 2022 Nov;52(11):12231-12244. doi: 10.1109/TCYB.2021.3049633. Epub 2022 Oct 17.

Abstract

The rapid emergence of high-dimensional data in various areas has brought new challenges to current ensemble clustering research. To deal with the curse of dimensionality, recently considerable efforts in ensemble clustering have been made by means of different subspace-based techniques. However, besides the emphasis on subspaces, rather limited attention has been paid to the potential diversity in similarity/dissimilarity metrics. It remains a surprisingly open problem in ensemble clustering how to create and aggregate a large population of diversified metrics, and furthermore, how to jointly investigate the multilevel diversity in the large populations of metrics, subspaces, and clusters in a unified framework. To tackle this problem, this article proposes a novel multidiversified ensemble clustering approach. In particular, we create a large number of diversified metrics by randomizing a scaled exponential similarity kernel, which are then coupled with random subspaces to form a large set of metric-subspace pairs. Based on the similarity matrices derived from these metric-subspace pairs, an ensemble of diversified base clusterings can be thereby constructed. Furthermore, an entropy-based criterion is utilized to explore the cluster wise diversity in ensembles, based on which three specific ensemble clustering algorithms are presented by incorporating three types of consensus functions. Extensive experiments are conducted on 30 high-dimensional datasets, including 18 cancer gene expression datasets and 12 image/speech datasets, which demonstrate the superiority of our algorithms over the state of the art. The source code is available at https://github.com/huangdonghere/MDEC.

摘要

高维数据在各个领域的迅速出现给当前的集成聚类研究带来了新的挑战。为了应对维度诅咒,最近在集成聚类中通过不同的基于子空间的技术付出了相当大的努力。然而,除了对子空间的重视之外,对相似性/不相似性度量中潜在的多样性关注相当有限。在集成聚类中,如何创建和聚合大量多样化的度量,以及如何在统一框架中联合研究大量度量、子空间和聚类中的多层次多样性,仍然是一个令人惊讶的开放性问题。为了解决这个问题,本文提出了一种新颖的多多样化集成聚类方法。具体来说,我们通过对缩放指数相似性核进行随机化来创建大量多样化的度量,然后将它们与随机子空间相结合,形成一大组度量 - 子空间对。基于从这些度量 - 子空间对导出的相似性矩阵,可以构建一个多样化的基础聚类集成。此外,利用基于熵的准则来探索集成中的聚类级多样性,在此基础上通过纳入三种类型的共识函数提出了三种具体的集成聚类算法。在30个高维数据集上进行了广泛的实验,包括18个癌症基因表达数据集和12个图像/语音数据集,实验结果表明我们的算法优于现有技术。源代码可在https://github.com/huangdonghere/MDEC获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验