Bunea Florentina, Giraud Christophe, Luo Xi, Royer Martin, Verzelen Nicolas
Department of Statistical Science, Cornell University.
Laboratoire de Mathématiques d'Orsay, CNRS, Université Paris-Sud, Université Paris-Saclay.
Ann Stat. 2020 Feb;48(1):111-137. doi: 10.1214/18-aos1794. Epub 2020 Feb 17.
The problem of variable clustering is that of estimating groups of similar components of a -dimensional vector = ( , … , ) from independent copies of . There exists a large number of algorithms that return data-dependent groups of variables, but their interpretation is limited to the algorithm that produced them. An alternative is model-based clustering, in which one begins by defining population level clusters relative to a model that embeds notions of similarity. Algorithms tailored to such models yield estimated clusters with a clear statistical interpretation. We take this view here and introduce the class of -block covariance models as a background model for variable clustering. In such models, two variables in a cluster are deemed similar if they have similar associations will all other variables. This can arise, for instance, when groups of variables are noise corrupted versions of the same latent factor. We quantify the difficulty of clustering data generated from a -block covariance model in terms of cluster proximity, measured with respect to two related, but different, cluster separation metrics. We derive minimax cluster separation thresholds, which are the metric values below which no algorithm can recover the model-defined clusters exactly, and show that they are different for the two metrics. We therefore develop two algorithms, COD and PECOK, tailored to -block covariance models, and study their minimax-optimality with respect to each metric. Of independent interest is the fact that the analysis of the PECOK algorithm, which is based on a corrected convex relaxation of the popular -means algorithm, provides the first statistical analysis of such algorithms for variable clustering. Additionally, we compare our methods with another popular clustering method, spectral clustering. Extensive simulation studies, as well as our data analyses, confirm the applicability of our approach.
变量聚类问题是指从(n)个(\mathbf{x})的独立副本中估计(p)维向量(\mathbf{x} = (x_1, \ldots, x_p))的相似分量组。存在大量算法可返回依赖于数据的变量组,但它们的解释仅限于生成它们的算法。另一种方法是基于模型的聚类,其中首先相对于嵌入相似性概念的模型定义总体水平的聚类。针对此类模型量身定制的算法会产生具有清晰统计解释的估计聚类。我们在此采用这种观点,并引入(p)块协方差模型类作为变量聚类的背景模型。在这样的模型中,如果一个聚类中的两个变量与所有其他变量具有相似的关联,则认为它们是相似的。例如,当变量组是同一潜在因子的噪声 corrupted 版本时,就会出现这种情况。我们根据聚类接近度来量化从(p)块协方差模型生成的数据聚类的难度,该接近度是相对于两个相关但不同的聚类分离度量来衡量的。我们推导了极小极大聚类分离阈值,即低于该阈值没有算法能够精确恢复模型定义的聚类的度量值,并表明它们对于这两个度量是不同的。因此,我们开发了两种针对(p)块协方差模型量身定制的算法,COD 和 PECOK,并研究它们相对于每个度量的极小极大最优性。独立有趣的是,基于流行的(k)均值算法的校正凸松弛的 PECOK 算法的分析为变量聚类的此类算法提供了首次统计分析。此外,我们将我们的方法与另一种流行的聚类方法谱聚类进行比较。广泛的模拟研究以及我们的数据分析证实了我们方法的适用性。