稀疏指数族主成分分析

Sparse Exponential Family Principal Component Analysis.

作者信息

Lu Meng, Huang Jianhua Z, Qian Xiaoning

机构信息

Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, US, 77840.

Department of Statistics, Texas A&M University, College Station, TX, US, 77840.

出版信息

Pattern Recognit. 2016 Dec;60:681-691. doi: 10.1016/j.patcog.2016.05.024. Epub 2016 May 21.

DOI:10.1016/j.patcog.2016.05.024

PMID:28066030

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5210214/

Abstract

We propose a Sparse exponential family Principal Component Analysis (SePCA) method suitable for any type of data following exponential family distributions, to achieve simultaneous dimension reduction and variable selection for better interpretation of the results. Because of the generality of exponential family distributions, the method can be applied to a wide range of applications, in particular when analyzing high dimensional next-generation sequencing data and genetic mutation data in genomics. The use of sparsity-inducing penalty helps produce sparse principal component loading vectors such that the principal components can focus on informative variables. By using an equivalent dual form of the formulated optimization problem for SePCA, we derive optimal solutions with efficient iterative closed-form updating rules. The results from both simulation experiments and real-world applications have demonstrated the superiority of our SePCA in reconstruction accuracy and computational efficiency over traditional exponential family PCA (ePCA), the existing Sparse PCA (SPCA) and Sparse Logistic PCA (SLPCA) algorithms.

摘要

我们提出了一种适用于任何遵循指数族分布的数据类型的稀疏指数族主成分分析（SePCA）方法，以实现同时降维和变量选择，从而更好地解释结果。由于指数族分布具有一般性，该方法可应用于广泛的应用场景，特别是在分析基因组学中的高维下一代测序数据和基因突变数据时。使用稀疏诱导惩罚有助于产生稀疏的主成分载荷向量，使得主成分能够聚焦于信息变量。通过使用SePCA公式化优化问题的等效对偶形式，我们推导出了具有高效迭代闭式更新规则的最优解。模拟实验和实际应用的结果均表明，我们的SePCA在重构精度和计算效率方面优于传统的指数族主成分分析（ePCA）、现有的稀疏主成分分析（SPCA）和稀疏逻辑主成分分析（SLPCA）算法。