Wang Ya-Xuan, Gao Ying-Lian, Liu Jin-Xing, Kong Xiang-Zhen, Li Hai-Jun
IEEE Trans Nanobioscience. 2017 Sep;16(6):447-454. doi: 10.1109/TNB.2017.2723439. Epub 2017 Jul 4.
Identifying differentially expressed genes from the thousands of genes is a challenging task. Robust principal component analysis (RPCA) is an efficient method in the identification of differentially expressed genes. RPCA method uses nuclear norm to approximate the rank function. However, theoretical studies showed that the nuclear norm minimizes all singular values, so it may not be the best solution to approximate the rank function. The truncated nuclear norm is defined as the sum of some smaller singular values, which may achieve a better approximation of the rank function than nuclear norm. In this paper, a novel method is proposed by replacing nuclear norm of RPCA with the truncated nuclear norm, which is named robust principal component analysis regularized by truncated nuclear norm (TRPCA). The method decomposes the observation matrix of genomic data into a low-rank matrix and a sparse matrix. Because the significant genes can be considered as sparse signals, the differentially expressed genes are viewed as the sparse perturbation signals. Thus, the differentially expressed genes can be identified according to the sparse matrix. The experimental results on The Cancer Genome Atlas data illustrate that the TRPCA method outperforms other state-of-the-art methods in the identification of differentially expressed genes.
从数千个基因中识别差异表达基因是一项具有挑战性的任务。稳健主成分分析(RPCA)是识别差异表达基因的一种有效方法。RPCA方法使用核范数来近似秩函数。然而,理论研究表明,核范数会使所有奇异值最小化,因此它可能不是近似秩函数的最佳解决方案。截断核范数被定义为一些较小奇异值的和,它可能比核范数能更好地近似秩函数。本文提出了一种新方法,即用截断核范数替代RPCA的核范数,该方法被称为截断核范数正则化的稳健主成分分析(TRPCA)。该方法将基因组数据的观测矩阵分解为一个低秩矩阵和一个稀疏矩阵。由于显著基因可被视为稀疏信号,差异表达基因被视为稀疏扰动信号。因此,可以根据稀疏矩阵来识别差异表达基因。在癌症基因组图谱数据上的实验结果表明,TRPCA方法在识别差异表达基因方面优于其他现有最先进的方法。