具有无约束协方差矩阵的基于惩罚模型的聚类

Penalized model-based clustering with unconstrained covariance matrices.

作者信息

Zhou Hui, Pan Wei, Shen Xiaotong

机构信息

Division of Biostatistics, School of Public Health, University of Minnesota

出版信息

Electron J Stat. 2009 Jan 1;3:1473-1496. doi: 10.1214/09-EJS487.

DOI:10.1214/09-EJS487

PMID:20463857

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2867492/

Abstract

Clustering is one of the most useful tools for high-dimensional analysis, e.g., for microarray data. It becomes challenging in presence of a large number of noise variables, which may mask underlying clustering structures. Therefore, noise removal through variable selection is necessary. One effective way is regularization for simultaneous parameter estimation and variable selection in model-based clustering. However, existing methods focus on regularizing the mean parameters representing centers of clusters, ignoring dependencies among variables within clusters, leading to incorrect orientations or shapes of the resulting clusters. In this article, we propose a regularized Gaussian mixture model permitting a treatment of general covariance matrices, taking various dependencies into account. At the same time, this approach shrinks the means and covariance matrices, achieving better clustering and variable selection. To overcome one technical challenge in estimating possibly large covariance matrices, we derive an E-M algorithm utilizing the graphical lasso (Friedman et al 2007) for parameter estimation. Numerical examples, including applications to microarray gene expression data, demonstrate the utility of the proposed method.

摘要

聚类是高维分析中最有用的工具之一，例如用于微阵列数据。在存在大量噪声变量的情况下，聚类变得具有挑战性，这些噪声变量可能会掩盖潜在的聚类结构。因此，通过变量选择去除噪声是必要的。一种有效的方法是在基于模型的聚类中进行正则化以同时进行参数估计和变量选择。然而，现有方法侧重于对表示聚类中心的均值参数进行正则化，而忽略了聚类内变量之间的依赖性，导致所得聚类的方向或形状不正确。在本文中，我们提出了一种正则化高斯混合模型，该模型允许处理一般协方差矩阵，同时考虑各种依赖性。同时，这种方法会收缩均值和协方差矩阵，从而实现更好的聚类和变量选择。为了克服估计可能很大的协方差矩阵时的一个技术挑战，我们推导了一种利用图形套索（Friedman等人，2007年）进行参数估计的期望最大化（E-M）算法。数值示例，包括在微阵列基因表达数据中的应用，证明了所提出方法的实用性。

相似文献

Penalized model-based clustering with unconstrained covariance matrices.

Electron J Stat. 2009 Jan 1;3:1473-1496. doi: 10.1214/09-EJS487.

Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables.

Electron J Stat. 2008;2:168-212. doi: 10.1214/08-EJS194.

Penalized mixtures of factor analyzers with application to clustering high-dimensional microarray data.

Bioinformatics. 2010 Feb 15;26(4):501-8. doi: 10.1093/bioinformatics/btp707. Epub 2009 Dec 23.

Regularized Gaussian Mixture Model for High-Dimensional Clustering.

IEEE Trans Cybern. 2019 Oct;49(10):3677-3688. doi: 10.1109/TCYB.2018.2846404. Epub 2018 Jun 27.

Regularized parameter estimation in high-dimensional gaussian mixture models.

Neural Comput. 2011 Jun;23(6):1605-22. doi: 10.1162/NECO_a_00128. Epub 2011 Mar 11.

A Penalized Matrix Normal Mixture Model for Clustering Matrix Data.

Entropy (Basel). 2021 Sep 26;23(10):1249. doi: 10.3390/e23101249.

Regularized estimation of large-scale gene association networks using graphical Gaussian models.

BMC Bioinformatics. 2009 Nov 24;10:384. doi: 10.1186/1471-2105-10-384.

Estimation of multiple networks in Gaussian mixture models.

Electron J Stat. 2016;10:1133-1154. doi: 10.1214/16-EJS1135. Epub 2016 May 2.

Simultaneous Multiple Response Regression and Inverse Covariance Matrix Estimation via Penalized Gaussian Maximum Likelihood.

J Multivar Anal. 2012 Oct 1;111:241-255. doi: 10.1016/j.jmva.2012.03.013. Epub 2012 Apr 27.

Comparing large covariance matrices under weak conditions on the dependence structure and its application to gene clustering.

Biometrics. 2017 Mar;73(1):31-41. doi: 10.1111/biom.12552. Epub 2016 Jul 5.

引用本文的文献

OUTCOME-GUIDED DISEASE SUBTYPING BY GENERATIVE MODEL AND WEIGHTED JOINT LIKELIHOOD IN TRANSCRIPTOMIC APPLICATIONS.

Ann Appl Stat. 2024 Sep;18(3):1947-1964. doi: 10.1214/23-aoas1865. Epub 2024 Aug 5.

Sparse kernel -means clustering.

J Appl Stat. 2024 Jun 5;52(1):158-182. doi: 10.1080/02664763.2024.2362266. eCollection 2025.

Simultaneous clustering and estimation of networks in multiple graphical models.

Biostatistics. 2024 Dec 31;26(1). doi: 10.1093/biostatistics/kxae015.

Avoiding inferior clusterings with misspecified Gaussian mixture models.

Sci Rep. 2023 Nov 6;13(1):19164. doi: 10.1038/s41598-023-44608-3.

Simultaneous cluster structure learning and estimation of heterogeneous graphs for matrix-variate fMRI data.

Biometrics. 2023 Sep;79(3):2246-2259. doi: 10.1111/biom.13753. Epub 2022 Sep 13.

Integrative clustering methods for multi-omics data.

Wiley Interdiscip Rev Comput Stat. 2022 May-Jun;14(3). doi: 10.1002/wics.1553. Epub 2021 Feb 7.

A Penalized Matrix Normal Mixture Model for Clustering Matrix Data.

Entropy (Basel). 2021 Sep 26;23(10):1249. doi: 10.3390/e23101249.

A sparse negative binomial mixture model for clustering RNA-seq count data.

Biostatistics. 2022 Dec 12;24(1):68-84. doi: 10.1093/biostatistics/kxab025.

Penalized model-based clustering of fMRI data.

Biostatistics. 2022 Jul 18;23(3):825-843. doi: 10.1093/biostatistics/kxaa061.

Graph-based sparse linear discriminant analysis for high-dimensional classification.

J Multivar Anal. 2019 May;171:250-269. doi: 10.1016/j.jmva.2018.12.007. Epub 2018 Dec 17.

本文引用的文献

NETWORK EXPLORATION VIA THE ADAPTIVE LASSO AND SCAD PENALTIES.

Ann Appl Stat. 2009 Jun 1;3(2):521-541. doi: 10.1214/08-AOAS215SUPP.

Penalized mixtures of factor analyzers with application to clustering high-dimensional microarray data.

Bioinformatics. 2010 Feb 15;26(4):501-8. doi: 10.1093/bioinformatics/btp707. Epub 2009 Dec 23.

Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables.

Electron J Stat. 2008;2:168-212. doi: 10.1214/08-EJS194.

Pairwise variable selection for high-dimensional model-based clustering.

Biometrics. 2010 Sep;66(3):793-804. doi: 10.1111/j.1541-0420.2009.01341.x.

Variable selection in penalized model-based clustering via regularization on grouped parameters.

Biometrics. 2008 Sep;64(3):921-930. doi: 10.1111/j.1541-0420.2007.00955.x. Epub 2007 Dec 20.

Sparse inverse covariance estimation with the graphical lasso.

Biostatistics. 2008 Jul;9(3):432-41. doi: 10.1093/biostatistics/kxm045. Epub 2007 Dec 12.

Variable selection for model-based high-dimensional clustering and its application to microarray data.

Biometrics. 2008 Jun;64(2):440-8. doi: 10.1111/j.1541-0420.2007.00922.x. Epub 2007 Oct 26.

A practical question based on cross-platform microarray data normalization: are BOEC more like large vessel or microvascular endothelial cells or neither of them?

J Bioinform Comput Biol. 2007 Aug;5(4):875-93. doi: 10.1142/s0219720007002989.

Penalized and weighted K-means for clustering with scattered objects and prior information in high-throughput biological data.

Bioinformatics. 2007 Sep 1;23(17):2247-55. doi: 10.1093/bioinformatics/btm320. Epub 2007 Jun 27.

Logistic regression for disease classification using microarray data: model selection in a large p and small n case.

Bioinformatics. 2007 Aug 1;23(15):1945-51. doi: 10.1093/bioinformatics/btm287. Epub 2007 May 31.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

具有无约束协方差矩阵的基于惩罚模型的聚类

Penalized model-based clustering with unconstrained covariance matrices.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献