Centro de Informática, Universidade Federal de Pernambuco, 50740560, Recife, Brazil.
Centro de Informática, Universidade Federal de Pernambuco, 50740560, Recife, Brazil.
Neural Netw. 2020 Oct;130:253-268. doi: 10.1016/j.neunet.2020.06.022. Epub 2020 Jul 3.
A surge in the availability of data from multiple sources and modalities is correlated with advances in how to obtain, compress, store, transfer, and process large amounts of complex high-dimensional data. The clustering challenge increases with the growth of data dimensionality which decreases the discriminate power of the distance metrics. Subspace clustering aims to group data drawn from a union of subspaces. In such a way, there is a large number of state-of-the-art approaches and we divide them into families regarding the method used in the clustering. We introduce a soft subspace clustering algorithm, a Self-organizing Map (SOM) with a time-varying structure, to cluster data without any prior knowledge of the number of categories or of the neural network topology, both determined during the training process. The model also assigns proper relevancies (weights) to different dimensions, capturing from the learning process the influence of each dimension on uncovering clusters. We employ a number of real-world datasets to validate the model. This algorithm presents a competitive performance in a diverse range of contexts among them data mining, gene expression, multi-view, computer vision and text clustering problems which include high-dimensional data. Extensive experiments suggest that our method very often outperforms the state-of-the-art approaches in all types of problems considered.
多源和多模态数据的可用性激增,这与如何获取、压缩、存储、传输和处理大量复杂高维数据的技术进步相关。聚类挑战随着数据维度的增加而增加,这降低了距离度量的辨别能力。子空间聚类旨在将来自多个子空间的联合的数据进行分组。在这种情况下,有大量的最新方法,我们根据聚类中使用的方法将它们分为不同的类别。我们引入了一种软子空间聚类算法,即具有时变结构的自组织映射 (SOM),用于在没有任何关于类别数量或神经网络拓扑结构的先验知识的情况下对数据进行聚类,这两个参数都是在训练过程中确定的。该模型还为不同的维度分配适当的相关性(权重),从学习过程中捕获每个维度对揭示聚类的影响。我们使用了许多真实世界的数据集来验证模型。该算法在各种不同的背景下都具有竞争力,包括数据挖掘、基因表达、多视图、计算机视觉和文本聚类问题,这些问题都涉及到高维数据。大量实验表明,在所有考虑的问题类型中,我们的方法通常都优于最新方法。