Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden.
BMC Bioinformatics. 2010 Jun 2;11:296. doi: 10.1186/1471-2105-11-296.
Principal component analysis (PCA) has gained popularity as a method for the analysis of high-dimensional genomic data. However, it is often difficult to interpret the results because the principal components are linear combinations of all variables, and the coefficients (loadings) are typically nonzero. These nonzero values also reflect poor estimation of the true vector loadings; for example, for gene expression data, biologically we expect only a portion of the genes to be expressed in any tissue, and an even smaller fraction to be involved in a particular process. Sparse PCA methods have recently been introduced for reducing the number of nonzero coefficients, but these existing methods are not satisfactory for high-dimensional data applications because they still give too many nonzero coefficients.
Here we propose a new PCA method that uses two innovations to produce an extremely sparse loading vector: (i) a random-effect model on the loadings that leads to an unbounded penalty at the origin and (ii) shrinkage of the singular values obtained from the singular value decomposition of the data matrix. We develop a stable computing algorithm by modifying nonlinear iterative partial least square (NIPALS) algorithm, and illustrate the method with an analysis of the NCI cancer dataset that contains 21,225 genes.
The new method has better performance than several existing methods, particularly in the estimation of the loading vectors.
主成分分析(PCA)作为一种分析高维基因组数据的方法已经越来越受欢迎。然而,由于主成分是所有变量的线性组合,并且系数(载荷)通常是非零的,因此通常难以解释结果。这些非零值还反映了对真实向量载荷的估计不佳;例如,对于基因表达数据,我们期望在任何组织中只有一部分基因表达,而在特定过程中只有一小部分基因参与。最近已经引入了稀疏 PCA 方法来减少非零系数的数量,但这些现有的方法对于高维数据应用并不令人满意,因为它们仍然给出了太多的非零系数。
在这里,我们提出了一种新的 PCA 方法,该方法使用两项创新来产生极其稀疏的加载向量:(i)对加载的随机效应模型,导致原点处的无界惩罚,以及(ii)对数据矩阵奇异值分解得到的奇异值进行收缩。我们通过修改非线性迭代偏最小二乘(NIPALS)算法来开发一种稳定的计算算法,并通过对包含 21,225 个基因的 NCI 癌症数据集的分析来说明该方法。
该新方法的性能优于几种现有方法,特别是在载荷向量的估计方面。