Sun Hokeun, Li Hongzhe
Department of Biostatistics and Epidemiology, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA.
Biometrics. 2012 Dec;68(4):1197-206. doi: 10.1111/j.1541-0420.2012.01785.x. Epub 2012 Sep 28.
Gaussian graphical models have been widely used as an effective method for studying the conditional independency structure among genes and for constructing genetic networks. However, gene expression data typically have heavier tails or more outlying observations than the standard Gaussian distribution. Such outliers in gene expression data can lead to wrong inference on the dependency structure among the genes. We propose a l(1) penalized estimation procedure for the sparse Gaussian graphical models that is robustified against possible outliers. The likelihood function is weighted according to how the observation is deviated, where the deviation of the observation is measured based on its own likelihood. An efficient computational algorithm based on the coordinate gradient descent method is developed to obtain the minimizer of the negative penalized robustified-likelihood, where nonzero elements of the concentration matrix represents the graphical links among the genes. After the graphical structure is obtained, we re-estimate the positive definite concentration matrix using an iterative proportional fitting algorithm. Through simulations, we demonstrate that the proposed robust method performs much better than the graphical Lasso for the Gaussian graphical models in terms of both graph structure selection and estimation when outliers are present. We apply the robust estimation procedure to an analysis of yeast gene expression data and show that the resulting graph has better biological interpretation than that obtained from the graphical Lasso.
高斯图形模型已被广泛用作研究基因间条件独立性结构和构建遗传网络的有效方法。然而,基因表达数据通常比标准高斯分布具有更重的尾部或更多的异常观测值。基因表达数据中的此类异常值可能导致对基因间依赖结构的错误推断。我们为稀疏高斯图形模型提出了一种(l(1))惩罚估计程序,该程序对可能的异常值具有鲁棒性。似然函数根据观测值的偏离程度进行加权,其中观测值的偏离是基于其自身的似然性来衡量的。开发了一种基于坐标梯度下降法的高效计算算法,以获得负惩罚鲁棒化似然的最小值,其中浓度矩阵的非零元素表示基因间的图形链接。在获得图形结构后,我们使用迭代比例拟合算法重新估计正定浓度矩阵。通过模拟,我们证明了在存在异常值的情况下,就图形结构选择和估计而言,所提出的鲁棒方法在高斯图形模型方面比图形拉索方法表现得好得多。我们将鲁棒估计程序应用于酵母基因表达数据的分析,并表明所得图形比从图形拉索方法获得的图形具有更好的生物学解释。