Pan Wei, Shen Xiaotong, Liu Binghui
Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455.
School of Statistics, University of Minnesota, Minneapolis, MN 55455.
J Mach Learn Res. 2013 Jul 1;14(7):1865.
Clustering analysis is widely used in many fields. Traditionally clustering is regarded as unsupervised learning for its lack of a class label or a quantitative response variable, which in contrast is present in supervised learning such as classification and regression. Here we formulate clustering as penalized regression with grouping pursuit. In addition to the novel use of a non-convex group penalty and its associated unique operating characteristics in the proposed clustering method, a main advantage of this formulation is its allowing borrowing some well established results in classification and regression, such as model selection criteria to select the number of clusters, a difficult problem in clustering analysis. In particular, we propose using the generalized cross-validation (GCV) based on generalized degrees of freedom (GDF) to select the number of clusters. We use a few simple numerical examples to compare our proposed method with some existing approaches, demonstrating our method's promising performance.
聚类分析在许多领域都有广泛应用。传统上,聚类被视为无监督学习,因为它缺乏类别标签或定量响应变量,而在诸如分类和回归等监督学习中则存在此类变量。在此,我们将聚类表述为带有分组追踪的惩罚回归。除了在所提出的聚类方法中新颖地使用非凸分组惩罚及其相关的独特操作特性外,这种表述的一个主要优点是它允许借鉴分类和回归中一些已确立的结果,例如用于选择聚类数目的模型选择标准,这在聚类分析中是一个难题。特别是,我们提出基于广义自由度(GDF)使用广义交叉验证(GCV)来选择聚类数目。我们使用一些简单的数值示例将我们提出的方法与一些现有方法进行比较,证明了我们方法的良好性能。