Wu Chong, Kwon Sunghoon, Shen Xiaotong, Pan Wei
Division of Biostatistics, University of Minnesota, Minneapolis, MN 55455, USA.
Department of Applied Statistics, Konkuk University, Seoul, South Korea School of Statistics, University of Minnesota, Minneapolis, MN 55455, USA.
J Mach Learn Res. 2016;17.
Clustering is unsupervised and exploratory in nature. Yet, it can be performed through penalized regression with grouping pursuit, as demonstrated in Pan et al. (2013). In this paper, we develop a more efficient algorithm for scalable computation and a new theory of clustering consistency for the method. This algorithm, called DC-ADMM, combines difference of convex (DC) programming with the alternating direction method of multipliers (ADMM). This algorithm is shown to be more computationally efficient than the quadratic penalty based algorithm of Pan et al. (2013) because of the former's closed-form updating formulas. Numerically, we compare the DC-ADMM algorithm with the quadratic penalty algorithm to demonstrate its utility and scalability. Theoretically, we establish a finite-sample mis-clustering error bound for penalized regression based clustering with the constrained regularization in a general setting. On this ground, we provide conditions for clustering consistency of the penalized clustering method. As an end product, we put R package implementing PRclust with various loss and grouping penalty functions available on GitHub and CRAN.
聚类本质上是无监督且探索性的。然而,正如Pan等人(2013年)所展示的,它可以通过带分组追踪的惩罚回归来执行。在本文中,我们为可扩展计算开发了一种更高效的算法,并为该方法建立了一种新的聚类一致性理论。这种算法称为DC - ADMM,它将凸差(DC)规划与乘子交替方向法(ADMM)相结合。由于前者具有闭式更新公式,该算法在计算上比Pan等人(2013年)基于二次惩罚的算法更高效。在数值上,我们将DC - ADMM算法与二次惩罚算法进行比较,以证明其效用和可扩展性。在理论上,我们在一般情况下为基于惩罚回归的聚类建立了一个有限样本误聚类误差界,其中带有约束正则化。在此基础上,我们为惩罚聚类方法的聚类一致性提供了条件。作为最终成果,我们在GitHub和CRAN上发布了实现PRclust的R包,其中包含各种损失和分组惩罚函数可用。