Witten Daniela M, Tibshirani Robert, Hastie Trevor
Department of Statistics, Stanford University, Stanford, CA 94305, USA.
Biostatistics. 2009 Jul;10(3):515-34. doi: 10.1093/biostatistics/kxp008. Epub 2009 Apr 17.
We present a penalized matrix decomposition (PMD), a new framework for computing a rank-K approximation for a matrix. We approximate the matrix X as circumflexX = sigma(k=1)(K) d(k)u(k)v(k)(T), where d(k), u(k), and v(k) minimize the squared Frobenius norm of X - circumflexX, subject to penalties on u(k) and v(k). This results in a regularized version of the singular value decomposition. Of particular interest is the use of L(1)-penalties on u(k) and v(k), which yields a decomposition of X using sparse vectors. We show that when the PMD is applied using an L(1)-penalty on v(k) but not on u(k), a method for sparse principal components results. In fact, this yields an efficient algorithm for the "SCoTLASS" proposal (Jolliffe and others 2003) for obtaining sparse principal components. This method is demonstrated on a publicly available gene expression data set. We also establish connections between the SCoTLASS method for sparse principal component analysis and the method of Zou and others (2006). In addition, we show that when the PMD is applied to a cross-products matrix, it results in a method for penalized canonical correlation analysis (CCA). We apply this penalized CCA method to simulated data and to a genomic data set consisting of gene expression and DNA copy number measurements on the same set of samples.
我们提出了一种惩罚矩阵分解(PMD)方法,这是一种用于计算矩阵的秩-K近似值的新框架。我们将矩阵X近似为(\hat{X} = \sum_{k = 1}^{K} d_{(k)} u_{(k)} v_{(k)}^T),其中(d_{(k)})、(u_{(k)})和(v_{(k)})使(X - \hat{X})的Frobenius范数平方最小化,同时对(u_{(k)})和(v_{(k)})施加惩罚。这就产生了奇异值分解的正则化版本。特别值得关注的是对(u_{(k)})和(v_{(k)})使用(L_1)惩罚,这会使用稀疏向量对X进行分解。我们表明,当对(v_{(k)})而不是(u_{(k)})应用带有(L_1)惩罚的PMD时,会得到一种稀疏主成分分析方法。实际上,这为获取稀疏主成分的“SCoTLASS”提议(Jolliffe等人,2003年)产生了一种高效算法。该方法在一个公开可用的基因表达数据集上得到了验证。我们还建立了用于稀疏主成分分析的SCoTLASS方法与Zou等人(2006年)方法之间的联系。此外,我们表明,当将PMD应用于交叉乘积矩阵时,它会产生一种惩罚典型相关分析(CCA)方法。我们将这种惩罚CCA方法应用于模拟数据以及一个基因组数据集,该基因组数据集包含同一组样本上的基因表达和DNA拷贝数测量值。