Wu Ming-Juan, Gao Ying-Lian, Liu Jin-Xing, Zhu Rong, Wang Juan
School of Information Science and Engineering, Qufu Normal University, Rizhao, China.
Library of Qufu Normal University, Qufu Normal University, Rizhao, China,
Hum Hered. 2019;84(1):47-58. doi: 10.1159/000501653. Epub 2019 Aug 29.
Principal component analysis (PCA) is a widely used method for evaluating low-dimensional data. Some variants of PCA have been proposed to improve the interpretation of the principal components (PCs). One of the most common methods is sparse PCA which aims at finding a sparse basis to improve the interpretability over the dense basis of PCA. However, the performances of these improved methods are still far from satisfactory because the data still contain redundant PCs. In this paper, a novel method called PCA based on graph Laplacian and double sparse constraints (GDSPCA) is proposed to improve the interpretation of the PCs and consider the internal geometry of the data. In detail, GDSPCA utilizes L2,1-norm and L1-norm regularization terms simultaneously to enforce the matrix to be sparse by filtering redundant and irrelative PCs, where the L2,1-norm regularization term can produce row sparsity, while the L1-norm regularization term can enforce element sparsity. This way, we can make a better interpretation of the new PCs in low-dimensional subspace. Meanwhile, the method of GDSPCA integrates graph Laplacian into PCA to explore the geometric structure hidden in the data. A simple and effective optimization solution is provided. Extensive experiments on multi-view biological data demonstrate the feasibility and effectiveness of the proposed approach.
主成分分析(PCA)是一种广泛用于评估低维数据的方法。人们提出了一些PCA的变体来改进主成分(PC)的可解释性。最常用的方法之一是稀疏PCA,其目的是找到一个稀疏基,以提高相对于PCA密集基的可解释性。然而,这些改进方法的性能仍然远不能令人满意,因为数据中仍然包含冗余的主成分。本文提出了一种基于图拉普拉斯算子和双稀疏约束的PCA新方法(GDSPCA),以改进主成分的可解释性并考虑数据的内部几何结构。具体而言,GDSPCA同时利用L2,1范数和L1范数正则化项,通过过滤冗余和不相关的主成分来强制矩阵稀疏,其中L2,1范数正则化项可产生行稀疏性,而L1范数正则化项可强制元素稀疏性。通过这种方式,我们可以在低维子空间中对新的主成分进行更好的解释。同时,GDSPCA方法将图拉普拉斯算子集成到PCA中,以探索隐藏在数据中的几何结构。并提供了一种简单有效的优化解决方案。对多视图生物数据进行的大量实验证明了所提方法的可行性和有效性。