Tan Kean Ming, Witten Daniela, Shojaie Ali
Department of Biostatistics, University of Washington, Seattle, WA 98195-7232, USA.
Comput Stat Data Anal. 2015 May;85:23-36. doi: 10.1016/j.csda.2014.11.015.
The task of estimating a Gaussian graphical model in the high-dimensional setting is considered. The graphical lasso, which involves maximizing the Gaussian log likelihood subject to a penalty, is a well-studied approach for this task. A surprising connection between the graphical lasso and hierarchical clustering is introduced: the graphical lasso in effect performs a two-step procedure, in which (1) single linkage hierarchical clustering is performed on the variables in order to identify connected components, and then (2) a penalized log likelihood is maximized on the subset of variables within each connected component. Thus, the graphical lasso determines the connected components of the estimated network via single linkage clustering. The single linkage clustering is known to perform poorly in certain finite-sample settings. Therefore, the , which involves clustering the features using an alternative to single linkage clustering, and then performing the graphical lasso on the subset of variables within each cluster, is proposed. Model selection consistency for this technique is established, and its improved performance relative to the graphical lasso is demonstrated in a simulation study, as well as in applications to a university webpage and a gene expression data sets.
考虑在高维环境下估计高斯图形模型的任务。图形拉索法是针对此任务经过充分研究的一种方法,它涉及在惩罚条件下最大化高斯对数似然。本文介绍了图形拉索法与层次聚类之间令人惊讶的联系:图形拉索法实际上执行了一个两步过程,其中:(1)对变量执行单链层次聚类以识别连通分量,然后(2)在每个连通分量内的变量子集上最大化惩罚对数似然。因此,图形拉索法通过单链聚类确定估计网络的连通分量。已知单链聚类在某些有限样本设置下表现不佳。因此,本文提出了一种方法,该方法涉及使用单链聚类的替代方法对特征进行聚类,然后对每个聚类内的变量子集执行图形拉索法。建立了该技术的模型选择一致性,并在模拟研究以及大学网页和基因表达数据集的应用中证明了其相对于图形拉索法的改进性能。