IEEE Trans Cybern. 2015 May;45(5):1069-80. doi: 10.1109/TCYB.2014.2344015. Epub 2014 Sep 18.
Clustering, as one of the most classical research problems in pattern recognition and data mining, has been widely explored and applied to various applications. Due to the rapid evolution of data on the Web, more emerging challenges have been posed on traditional clustering techniques: 1) correlations among related clustering tasks and/or within individual task are not well captured; 2) the problem of clustering out-of-sample data is seldom considered; and 3) the discriminative property of cluster label matrix is not well explored. In this paper, we propose a novel clustering model, namely multitask spectral clustering (MTSC), to cope with the above challenges. Specifically, two types of correlations are well considered: 1) intertask clustering correlation, which refers the relations among different clustering tasks and 2) intratask learning correlation, which enables the processes of learning cluster labels and learning mapping function to reinforce each other. We incorporate a novel l2,p -norm regularizer to control the coherence of all the tasks based on an assumption that related tasks should share a common low-dimensional representation. Moreover, for each individual task, an explicit mapping function is simultaneously learnt for predicting cluster labels by mapping features to the cluster label matrix. Meanwhile, we show that the learning process can naturally incorporate discriminative information to further improve clustering performance. We explore and discuss the relationships between our proposed model and several representative clustering techniques, including spectral clustering, k -means and discriminative k -means. Extensive experiments on various real-world datasets illustrate the advantage of the proposed MTSC model compared to state-of-the-art clustering approaches.
聚类是模式识别和数据挖掘中最经典的研究问题之一,已经得到了广泛的探索和应用。由于 Web 上的数据快速演变,传统的聚类技术面临着更多新的挑战:1)相关聚类任务之间和/或单个任务内部的相关性未被很好地捕捉;2)很少考虑对样本外数据进行聚类的问题;3)聚类标签矩阵的判别特性未被充分探索。在本文中,我们提出了一种新的聚类模型,即多任务谱聚类(MTSC),以应对上述挑战。具体来说,我们充分考虑了两种类型的相关性:1)任务间聚类相关性,指的是不同聚类任务之间的关系;2)任务内学习相关性,使学习聚类标签和学习映射函数的过程相互加强。我们基于相关任务应该共享一个共同的低维表示的假设,引入了一种新的 l2,p-范数正则化项来控制所有任务的一致性。此外,对于每个单独的任务,我们同时学习一个显式映射函数,通过将特征映射到聚类标签矩阵来预测聚类标签。同时,我们表明学习过程可以自然地融入判别信息,从而进一步提高聚类性能。我们探讨并讨论了所提出的模型与几种代表性聚类技术之间的关系,包括谱聚类、k-均值和判别 k-均值。在各种真实数据集上的广泛实验表明,与最先进的聚类方法相比,所提出的 MTSC 模型具有优势。