Hino Hideitsu, Murata Noboru
Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, Ibaraki, Japan, 305-8573
Neural Comput. 2014 Sep;26(9):2074-101. doi: 10.1162/NECO_a_00628. Epub 2014 Jun 12.
Clustering is a representative of unsupervised learning and one of the important approaches in exploratory data analysis. By its very nature, clustering without strong assumption on data distribution is desirable. Information-theoretic clustering is a class of clustering methods that optimize information-theoretic quantities such as entropy and mutual information. These quantities can be estimated in a nonparametric manner, and information-theoretic clustering algorithms are capable of capturing various intrinsic data structures. It is also possible to estimate information-theoretic quantities using a data set with sampling weight for each datum. Assuming the data set is sampled from a certain cluster and assigning different sampling weights depending on the clusters, the cluster-conditional information-theoretic quantities are estimated. In this letter, a simple iterative clustering algorithm is proposed based on a nonparametric estimator of the log likelihood for weighted data sets. The clustering algorithm is also derived from the principle of conditional entropy minimization with maximum entropy regularization. The proposed algorithm does not contain a tuning parameter. The algorithm is experimentally shown to be comparable to or outperform conventional nonparametric clustering methods.
聚类是无监督学习的一种代表方法,也是探索性数据分析中的重要方法之一。就其本质而言,聚类无需对数据分布做出强假设。信息论聚类是一类通过优化诸如熵和互信息等信息论量的聚类方法。这些量可以用非参数方式进行估计,并且信息论聚类算法能够捕捉各种内在的数据结构。使用对每个数据带有采样权重的数据集来估计信息论量也是可行的。假设数据集是从某个聚类中采样得到的,并根据聚类分配不同的采样权重,进而估计聚类条件下的信息论量。在本信函中,基于加权数据集对数似然的非参数估计器,提出了一种简单的迭代聚类算法。该聚类算法也是从具有最大熵正则化的条件熵最小化原理推导而来的。所提出的算法不包含调优参数。实验表明,该算法与传统非参数聚类方法相当或更优。