Li Ziyi, Safo Sandra E, Long Qi
Department of Biostatistics and Bioinformatics, Emory University, 1518 Clifton Road, Atlanta, 30322, GA, USA.
Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, 423 Guardian Drive, Philadelphia, 19104, PA, USA.
BMC Bioinformatics. 2017 Jul 11;18(1):332. doi: 10.1186/s12859-017-1740-7.
Sparse principal component analysis (PCA) is a popular tool for dimensionality reduction, pattern recognition, and visualization of high dimensional data. It has been recognized that complex biological mechanisms occur through concerted relationships of multiple genes working in networks that are often represented by graphs. Recent work has shown that incorporating such biological information improves feature selection and prediction performance in regression analysis, but there has been limited work on extending this approach to PCA. In this article, we propose two new sparse PCA methods called Fused and Grouped sparse PCA that enable incorporation of prior biological information in variable selection.
Our simulation studies suggest that, compared to existing sparse PCA methods, the proposed methods achieve higher sensitivity and specificity when the graph structure is correctly specified, and are fairly robust to misspecified graph structures. Application to a glioblastoma gene expression dataset identified pathways that are suggested in the literature to be related with glioblastoma.
The proposed sparse PCA methods Fused and Grouped sparse PCA can effectively incorporate prior biological information in variable selection, leading to improved feature selection and more interpretable principal component loadings and potentially providing insights on molecular underpinnings of complex diseases.
稀疏主成分分析(PCA)是一种用于高维数据降维、模式识别和可视化的常用工具。人们已经认识到,复杂的生物学机制是通过多个基因在通常由图表示的网络中协同作用的关系发生的。最近的研究表明,在回归分析中纳入此类生物学信息可提高特征选择和预测性能,但将这种方法扩展到PCA的研究还很有限。在本文中,我们提出了两种新的稀疏PCA方法,即融合稀疏PCA和分组稀疏PCA,它们能够在变量选择中纳入先验生物学信息。
我们的模拟研究表明,与现有的稀疏PCA方法相比,当图结构正确指定时,所提出的方法具有更高的灵敏度和特异性,并且对错误指定的图结构具有相当的鲁棒性。应用于胶质母细胞瘤基因表达数据集,识别出了文献中提示与胶质母细胞瘤相关的通路。
所提出的稀疏PCA方法,即融合稀疏PCA和分组稀疏PCA,能够在变量选择中有效地纳入先验生物学信息,从而改善特征选择,使主成分载荷更易于解释,并有可能为复杂疾病的分子基础提供见解。