模型辅助变量聚类：极小极大最优恢复与算法

MODEL ASSISTED VARIABLE CLUSTERING: MINIMAX-OPTIMAL RECOVERY AND ALGORITHMS.

作者信息

Bunea Florentina, Giraud Christophe, Luo Xi, Royer Martin, Verzelen Nicolas

机构信息

Department of Statistical Science, Cornell University.

Laboratoire de Mathématiques d'Orsay, CNRS, Université Paris-Sud, Université Paris-Saclay.

出版信息

Ann Stat. 2020 Feb;48(1):111-137. doi: 10.1214/18-aos1794. Epub 2020 Feb 17.

DOI:10.1214/18-aos1794

PMID:35847529

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9286061/

Abstract

The problem of variable clustering is that of estimating groups of similar components of a -dimensional vector = ( , … , ) from independent copies of . There exists a large number of algorithms that return data-dependent groups of variables, but their interpretation is limited to the algorithm that produced them. An alternative is model-based clustering, in which one begins by defining population level clusters relative to a model that embeds notions of similarity. Algorithms tailored to such models yield estimated clusters with a clear statistical interpretation. We take this view here and introduce the class of -block covariance models as a background model for variable clustering. In such models, two variables in a cluster are deemed similar if they have similar associations will all other variables. This can arise, for instance, when groups of variables are noise corrupted versions of the same latent factor. We quantify the difficulty of clustering data generated from a -block covariance model in terms of cluster proximity, measured with respect to two related, but different, cluster separation metrics. We derive minimax cluster separation thresholds, which are the metric values below which no algorithm can recover the model-defined clusters exactly, and show that they are different for the two metrics. We therefore develop two algorithms, COD and PECOK, tailored to -block covariance models, and study their minimax-optimality with respect to each metric. Of independent interest is the fact that the analysis of the PECOK algorithm, which is based on a corrected convex relaxation of the popular -means algorithm, provides the first statistical analysis of such algorithms for variable clustering. Additionally, we compare our methods with another popular clustering method, spectral clustering. Extensive simulation studies, as well as our data analyses, confirm the applicability of our approach.

摘要

变量聚类问题是指从(n)个(\mathbf{x})的独立副本中估计(p)维向量(\mathbf{x} = (x_1, \ldots, x_p))的相似分量组。存在大量算法可返回依赖于数据的变量组，但它们的解释仅限于生成它们的算法。另一种方法是基于模型的聚类，其中首先相对于嵌入相似性概念的模型定义总体水平的聚类。针对此类模型量身定制的算法会产生具有清晰统计解释的估计聚类。我们在此采用这种观点，并引入(p)块协方差模型类作为变量聚类的背景模型。在这样的模型中，如果一个聚类中的两个变量与所有其他变量具有相似的关联，则认为它们是相似的。例如，当变量组是同一潜在因子的噪声 corrupted 版本时，就会出现这种情况。我们根据聚类接近度来量化从(p)块协方差模型生成的数据聚类的难度，该接近度是相对于两个相关但不同的聚类分离度量来衡量的。我们推导了极小极大聚类分离阈值，即低于该阈值没有算法能够精确恢复模型定义的聚类的度量值，并表明它们对于这两个度量是不同的。因此，我们开发了两种针对(p)块协方差模型量身定制的算法，COD 和 PECOK，并研究它们相对于每个度量的极小极大最优性。独立有趣的是，基于流行的(k)均值算法的校正凸松弛的 PECOK 算法的分析为变量聚类的此类算法提供了首次统计分析。此外，我们将我们的方法与另一种流行的聚类方法谱聚类进行比较。广泛的模拟研究以及我们的数据分析证实了我们方法的适用性。

相似文献

MODEL ASSISTED VARIABLE CLUSTERING: MINIMAX-OPTIMAL RECOVERY AND ALGORITHMS.

Ann Stat. 2020 Feb;48(1):111-137. doi: 10.1214/18-aos1794. Epub 2020 Feb 17.

Metric for measuring the effectiveness of clustering of DNA microarray expression.

BMC Bioinformatics. 2006 Sep 6;7 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2105-7-S2-S5.

Statistical power for cluster analysis.

BMC Bioinformatics. 2022 May 31;23(1):205. doi: 10.1186/s12859-022-04675-1.

Automatic Spectroscopic Data Categorization by Clustering Analysis (ASCLAN): A Data-Driven Approach for Distinguishing Discriminatory Metabolites for Phenotypic Subclasses.

Anal Chem. 2016 Jun 7;88(11):5670-9. doi: 10.1021/acs.analchem.5b04020. Epub 2016 May 13.

Penalized model-based clustering with unconstrained covariance matrices.

Electron J Stat. 2009 Jan 1;3:1473-1496. doi: 10.1214/09-EJS487.

Simultaneous clustering and variable selection: A novel algorithm and model selection procedure.

Behav Res Methods. 2023 Aug;55(5):2157-2174. doi: 10.3758/s13428-022-01795-7. Epub 2022 Sep 9.

Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale.

PLoS One. 2016 Jul 8;11(7):e0159161. doi: 10.1371/journal.pone.0159161. eCollection 2016.

Comparison of machine learning clustering algorithms for detecting heterogeneity of treatment effect in acute respiratory distress syndrome: A secondary analysis of three randomised controlled trials.

EBioMedicine. 2021 Dec;74:103697. doi: 10.1016/j.ebiom.2021.103697. Epub 2021 Dec 1.

Comparing large covariance matrices under weak conditions on the dependence structure and its application to gene clustering.

Biometrics. 2017 Mar;73(1):31-41. doi: 10.1111/biom.12552. Epub 2016 Jul 5.

A comparison of spectral clustering and the walktrap algorithm for community detection in network psychometrics.

Psychol Methods. 2024 Aug;29(4):704-722. doi: 10.1037/met0000509. Epub 2022 Jul 7.

引用本文的文献

Optimal variable clustering for high-dimensional matrix valued data.

Inf inference. 2025 Mar 12;14(1):iaaf001. doi: 10.1093/imaiai/iaaf001. eCollection 2025 Mar.

Tree-based Node Aggregation in Sparse Graphical Models.

J Mach Learn Res. 2022 Sep;23.

本文引用的文献

ENTRYWISE EIGENVECTOR ANALYSIS OF RANDOM MATRICES WITH LOW EXPECTED RANK.

Ann Stat. 2020 Jun;48(3):1452-1474. doi: 10.1214/19-aos1854. Epub 2020 Jul 17.

Spatial Topography of Individual-Specific Cortical Networks Predicts Human Cognition, Personality, and Emotion.

Cereb Cortex. 2019 Jun 1;29(6):2533-2551. doi: 10.1093/cercor/bhy123.

Individual parcellation of resting fMRI with a group functional connectivity prior.

Neuroimage. 2017 Aug 1;156:87-100. doi: 10.1016/j.neuroimage.2017.04.054. Epub 2017 May 3.

A multi-modal parcellation of human cerebral cortex.

Nature. 2016 Aug 11;536(7615):171-178. doi: 10.1038/nature18933. Epub 2016 Jul 20.

A human brain atlas derived via n-cut parcellation of resting-state and task-based fMRI data.

Magn Reson Imaging. 2016 Feb;34(2):209-18. doi: 10.1016/j.mri.2015.10.036. Epub 2015 Oct 31.

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families.

BMC Bioinformatics. 2015 Feb 5;16:34. doi: 10.1186/s12859-014-0445-4.

GEM2Net: from gene expression modeling to -omics networks, a new CATdb module to investigate Arabidopsis thaliana genes involved in stress response.

Nucleic Acids Res. 2015 Jan;43(Database issue):D1010-7. doi: 10.1093/nar/gku1155. Epub 2014 Nov 11.

Functional analysis of Arabidopsis immune-related MAPKs uncovers a role for MPK3 as negative regulator of inducible defences.

Genome Biol. 2014 Jun 30;15(6):R87. doi: 10.1186/gb-2014-15-6-r87.

Ensemble Clustering using Semidefinite Programming.

Adv Neural Inf Process Syst. 2007 Dec 31;20:3283.

Functional network organization of the human brain.

Neuron. 2011 Nov 17;72(4):665-78. doi: 10.1016/j.neuron.2011.09.006.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

模型辅助变量聚类：极小极大最优恢复与算法

MODEL ASSISTED VARIABLE CLUSTERING: MINIMAX-OPTIMAL RECOVERY AND ALGORITHMS.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献