Law Martin H C, Figueiredo Mário A T, Jain Anil K
Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan 48824-1226, USA.
IEEE Trans Pattern Anal Mach Intell. 2004 Sep;26(9):1154-66. doi: 10.1109/TPAMI.2004.71.
Clustering is a common unsupervised learning technique used to discover group structure in a set of data. While there exist many algorithms for clustering, the important issue of feature selection, that is, what attributes of the data should be used by the clustering algorithms, is rarely touched upon. Feature selection for clustering is difficult because, unlike in supervised learning, there are no class labels for the data and, thus, no obvious criteria to guide the search. Another important problem in clustering is the determination of the number of clusters, which clearly impacts and is influenced by the feature selection issue. In this paper, we propose the concept of feature saliency and introduce an expectation-maximization (EM) algorithm to estimate it, in the context of mixture-based clustering. Due to the introduction of a minimum message length model selection criterion, the saliency of irrelevant features is driven toward zero, which corresponds to performing feature selection. The criterion and algorithm are then extended to simultaneously estimate the feature saliencies and the number of clusters.
聚类是一种常用的无监督学习技术,用于在一组数据中发现组结构。虽然存在许多聚类算法,但特征选择这一重要问题,即聚类算法应使用数据的哪些属性,却很少被涉及。聚类的特征选择很困难,因为与监督学习不同,数据没有类别标签,因此没有明显的标准来指导搜索。聚类中的另一个重要问题是簇数的确定,这显然会影响特征选择问题,同时也受其影响。在本文中,我们提出了特征显著性的概念,并在基于混合的聚类背景下引入了一种期望最大化(EM)算法来估计它。由于引入了最小消息长度模型选择标准,无关特征的显著性被驱动为零,这相当于执行特征选择。然后将该标准和算法进行扩展,以同时估计特征显著性和簇数。