Zhou Hui, Pan Wei, Shen Xiaotong
Division of Biostatistics, School of Public Health, University of Minnesota
Electron J Stat. 2009 Jan 1;3:1473-1496. doi: 10.1214/09-EJS487.
Clustering is one of the most useful tools for high-dimensional analysis, e.g., for microarray data. It becomes challenging in presence of a large number of noise variables, which may mask underlying clustering structures. Therefore, noise removal through variable selection is necessary. One effective way is regularization for simultaneous parameter estimation and variable selection in model-based clustering. However, existing methods focus on regularizing the mean parameters representing centers of clusters, ignoring dependencies among variables within clusters, leading to incorrect orientations or shapes of the resulting clusters. In this article, we propose a regularized Gaussian mixture model permitting a treatment of general covariance matrices, taking various dependencies into account. At the same time, this approach shrinks the means and covariance matrices, achieving better clustering and variable selection. To overcome one technical challenge in estimating possibly large covariance matrices, we derive an E-M algorithm utilizing the graphical lasso (Friedman et al 2007) for parameter estimation. Numerical examples, including applications to microarray gene expression data, demonstrate the utility of the proposed method.
聚类是高维分析中最有用的工具之一,例如用于微阵列数据。在存在大量噪声变量的情况下,聚类变得具有挑战性,这些噪声变量可能会掩盖潜在的聚类结构。因此,通过变量选择去除噪声是必要的。一种有效的方法是在基于模型的聚类中进行正则化以同时进行参数估计和变量选择。然而,现有方法侧重于对表示聚类中心的均值参数进行正则化,而忽略了聚类内变量之间的依赖性,导致所得聚类的方向或形状不正确。在本文中,我们提出了一种正则化高斯混合模型,该模型允许处理一般协方差矩阵,同时考虑各种依赖性。同时,这种方法会收缩均值和协方差矩阵,从而实现更好的聚类和变量选择。为了克服估计可能很大的协方差矩阵时的一个技术挑战,我们推导了一种利用图形套索(Friedman等人,2007年)进行参数估计的期望最大化(E-M)算法。数值示例,包括在微阵列基因表达数据中的应用,证明了所提出方法的实用性。