Still Susanne, Bialek William
Department of Physics and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA.
Neural Comput. 2004 Dec;16(12):2483-506. doi: 10.1162/0899766042321751.
Clustering provides a common means of identifying structure in complex data, and there is renewed interest in clustering as a tool for the analysis of large data sets in many fields. A natural question is how many clusters are appropriate for the description of a given system. Traditional approaches to this problem are based on either a framework in which clusters of a particular shape are assumed as a model of the system or on a two-step procedure in which a clustering criterion determines the optimal assignments for a given number of clusters and a separate criterion measures the goodness of the classification to determine the number of clusters. In a statistical mechanics approach, clustering can be seen as a trade-off between energy- and entropy-like terms, with lower temperature driving the proliferation of clusters to provide a more detailed description of the data. For finite data sets, we expect that there is a limit to the meaningful structure that can be resolved and therefore a minimum temperature beyond which we will capture sampling noise. This suggests that correcting the clustering criterion for the bias that arises due to sampling errors will allow us to find a clustering solution at a temperature that is optimal in the sense that we capture maximal meaningful structure--without having to define an external criterion for the goodness or stability of the clustering. We show that in a general information-theoretic framework, the finite size of a data set determines an optimal temperature, and we introduce a method for finding the maximal number of clusters that can be resolved from the data in the hard clustering limit.
聚类提供了一种识别复杂数据结构的常用方法,并且在许多领域中,作为分析大数据集的工具,聚类再次引起了人们的兴趣。一个自然的问题是,对于描述给定系统而言,多少个聚类才是合适的。解决这个问题的传统方法要么基于这样一种框架,即假设特定形状的聚类作为系统的模型,要么基于一种两步法,其中聚类标准确定给定数量聚类的最优分配,而另一个单独的标准衡量分类的优劣以确定聚类的数量。在统计力学方法中,聚类可以被视为能量项和熵项之间的一种权衡,较低的温度会促使聚类扩散,从而提供对数据更详细的描述。对于有限数据集,我们预计可解析的有意义结构存在一个极限,因此存在一个最低温度,超过这个温度我们将捕捉到采样噪声。这表明校正由于采样误差而产生的偏差的聚类标准,将使我们能够在一个温度下找到聚类解决方案,这个温度在我们捕捉到最大有意义结构的意义上是最优的——而无需为聚类的优劣或稳定性定义外部标准。我们表明,在一个一般的信息论框架中,数据集的有限大小决定了一个最优温度,并且我们引入了一种方法来找到在硬聚类极限下可以从数据中解析出的最大聚类数。