Steinley Douglas
Department of Psychological Sciences, University of Missouri-Columbia, Columbia, MO 65203, USA.
Psychol Methods. 2006 Jun;11(2):178-92. doi: 10.1037/1082-989X.11.2.178.
Using the cluster generation procedure proposed by D. Steinley and R. Henson (2005), the author investigated the performance of K-means clustering under the following scenarios: (a) different probabilities of cluster overlap; (b) different types of cluster overlap; (c) varying samples sizes, clusters, and dimensions; (d) different multivariate distributions of clusters; and (e) various multidimensional data structures. The results are evaluated in terms of the Hubert-Arabie adjusted Rand index, and several observations concerning the performance of K-means clustering are made. Finally, the article concludes with the proposal of a diagnostic technique indicating when the partitioning given by a K-means cluster analysis can be trusted. By combining the information from several observable characteristics of the data (number of clusters, number of variables, sample size, etc.) with the prevalence of unique local optima in several thousand implementations of the K-means algorithm, the author provides a method capable of guiding key data-analysis decisions.
作者使用了D. 斯坦利和R. 亨森(2005年)提出的聚类生成程序,研究了K均值聚类在以下几种情况下的性能:(a)不同的聚类重叠概率;(b)不同类型的聚类重叠;(c)不同的样本大小、聚类数量和维度;(d)不同的聚类多元分布;以及(e)各种多维数据结构。结果根据休伯特 - 阿拉比调整兰德指数进行评估,并得出了一些关于K均值聚类性能的观察结果。最后,文章提出了一种诊断技术,表明何时可以信任K均值聚类分析给出的划分。通过将数据的几个可观察特征(聚类数量、变量数量、样本大小等)的信息与K均值算法数千次实现中独特局部最优解的普遍性相结合,作者提供了一种能够指导关键数据分析决策的方法。