Suppr
超能文献

超高通量基因组数据的超稀疏主成分分析。

Super-sparse principal component analyses for high-throughput genomic data.

机构信息

Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden.

出版信息

BMC Bioinformatics. 2010 Jun 2;11:296. doi: 10.1186/1471-2105-11-296.

DOI:10.1186/1471-2105-11-296

PMID:20525176

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2902448/

Abstract

BACKGROUND

Principal component analysis (PCA) has gained popularity as a method for the analysis of high-dimensional genomic data. However, it is often difficult to interpret the results because the principal components are linear combinations of all variables, and the coefficients (loadings) are typically nonzero. These nonzero values also reflect poor estimation of the true vector loadings; for example, for gene expression data, biologically we expect only a portion of the genes to be expressed in any tissue, and an even smaller fraction to be involved in a particular process. Sparse PCA methods have recently been introduced for reducing the number of nonzero coefficients, but these existing methods are not satisfactory for high-dimensional data applications because they still give too many nonzero coefficients.

RESULTS

Here we propose a new PCA method that uses two innovations to produce an extremely sparse loading vector: (i) a random-effect model on the loadings that leads to an unbounded penalty at the origin and (ii) shrinkage of the singular values obtained from the singular value decomposition of the data matrix. We develop a stable computing algorithm by modifying nonlinear iterative partial least square (NIPALS) algorithm, and illustrate the method with an analysis of the NCI cancer dataset that contains 21,225 genes.

CONCLUSIONS

The new method has better performance than several existing methods, particularly in the estimation of the loading vectors.

摘要

背景

主成分分析（PCA）作为一种分析高维基因组数据的方法已经越来越受欢迎。然而，由于主成分是所有变量的线性组合，并且系数（载荷）通常是非零的，因此通常难以解释结果。这些非零值还反映了对真实向量载荷的估计不佳；例如，对于基因表达数据，我们期望在任何组织中只有一部分基因表达，而在特定过程中只有一小部分基因参与。最近已经引入了稀疏 PCA 方法来减少非零系数的数量，但这些现有的方法对于高维数据应用并不令人满意，因为它们仍然给出了太多的非零系数。

结果

在这里，我们提出了一种新的 PCA 方法，该方法使用两项创新来产生极其稀疏的加载向量：（i）对加载的随机效应模型，导致原点处的无界惩罚，以及（ii）对数据矩阵奇异值分解得到的奇异值进行收缩。我们通过修改非线性迭代偏最小二乘（NIPALS）算法来开发一种稳定的计算算法，并通过对包含 21,225 个基因的 NCI 癌症数据集的分析来说明该方法。

结论

该新方法的性能优于几种现有方法，特别是在载荷向量的估计方面。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e13e/2902448/03b6d3df0332/1471-2105-11-296-1.jpg

相似文献

Super-sparse principal component analyses for high-throughput genomic data.

BMC Bioinformatics. 2010 Jun 2;11:296. doi: 10.1186/1471-2105-11-296.

Incorporating biological information in sparse principal component analysis with application to genomic data.

BMC Bioinformatics. 2017 Jul 11;18(1):332. doi: 10.1186/s12859-017-1740-7.

A critical assessment of sparse PCA (research): why (one should acknowledge that) weights are not loadings.

Behav Res Methods. 2024 Mar;56(3):1413-1432. doi: 10.3758/s13428-023-02099-0. Epub 2023 Aug 1.

A Class-Information-Based Sparse Component Analysis Method to Identify Differentially Expressed Genes on RNA-Seq Data.

IEEE/ACM Trans Comput Biol Bioinform. 2016 Mar-Apr;13(2):392-8. doi: 10.1109/TCBB.2015.2440265.

Biclustering via sparse singular value decomposition.

Biometrics. 2010 Dec;66(4):1087-95. doi: 10.1111/j.1541-0420.2010.01392.x.

Comparing the performance of linear and nonlinear principal components in the context of high-dimensional genomic data integration.

Stat Appl Genet Mol Biol. 2017 Jul 26;16(3):199-216. doi: 10.1515/sagmb-2016-0066.

A Guide for Sparse PCA: Model Comparison and Applications.

Psychometrika. 2021 Dec;86(4):893-919. doi: 10.1007/s11336-021-09773-2. Epub 2021 Jun 29.

Principal component analysis based methods in bioinformatics studies.

Brief Bioinform. 2011 Nov;12(6):714-22. doi: 10.1093/bib/bbq090. Epub 2011 Jan 17.

Sparse Exponential Family Principal Component Analysis.

Pattern Recognit. 2016 Dec;60:681-691. doi: 10.1016/j.patcog.2016.05.024. Epub 2016 May 21.

A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis.

Biostatistics. 2009 Jul;10(3):515-34. doi: 10.1093/biostatistics/kxp008. Epub 2009 Apr 17.

引用本文的文献

Cancer-associated fibroblast-secreted FGF7 as an ovarian cancer progression promoter.

J Transl Med. 2024 Mar 15;22(1):280. doi: 10.1186/s12967-024-05085-y.

PCA-Plus: Enhanced principal component analysis with illustrative applications to batch effects and their quantitation.

bioRxiv. 2024 Jan 3:2024.01.02.573793. doi: 10.1101/2024.01.02.573793.

HarmonizR enables data harmonization across independent proteomic datasets with appropriate handling of missing values.

Nat Commun. 2022 Jun 20;13(1):3523. doi: 10.1038/s41467-022-31007-x.

Gene Feature Extraction Based on Nonnegative Dual Graph Regularized Latent Low-Rank Representation.

Biomed Res Int. 2017;2017:1096028. doi: 10.1155/2017/1096028. Epub 2017 Mar 30.

The Spike-and-Slab Lasso Generalized Linear Models for Prediction and Associated Genes Detection.

Genetics. 2017 Jan;205(1):77-88. doi: 10.1534/genetics.116.192195. Epub 2016 Oct 31.

Principal component analysis: a review and recent developments.

Philos Trans A Math Phys Eng Sci. 2016 Apr 13;374(2065):20150202. doi: 10.1098/rsta.2015.0202.

Testing for associations between systolic blood pressure and single-nucleotide polymorphism profiles obtained from sparse principal component analysis.

BMC Proc. 2014 Jun 17;8(Suppl 1):S95. doi: 10.1186/1753-6561-8-S1-S95. eCollection 2014.

Variable selection in subdistribution hazard frailty models with competing risks data.

Stat Med. 2014 Nov 20;33(26):4590-604. doi: 10.1002/sim.6257. Epub 2014 Jul 10.

A better statistical method of predicting postsurgery soft tissue response in Class II patients.

Angle Orthod. 2014 Mar;84(2):322-8. doi: 10.2319/050313-338.1. Epub 2013 Aug 5.

Robust PCA based method for discovering differentially expressed genes.

BMC Bioinformatics. 2013;14 Suppl 8(Suppl 8):S3. doi: 10.1186/1471-2105-14-S8-S3. Epub 2013 May 9.

本文引用的文献

On Consistency and Sparsity for Principal Components Analysis in High Dimensions.

J Am Stat Assoc. 2009 Jun 1;104(486):682-693. doi: 10.1198/jasa.2009.0121.

A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis.

Biostatistics. 2009 Jul;10(3):515-34. doi: 10.1093/biostatistics/kxp008. Epub 2009 Apr 17.

Sparse canonical correlation analysis with application to genomic data integration.

Stat Appl Genet Mol Biol. 2009;8:Article 1. doi: 10.2202/1544-6115.1406. Epub 2009 Jan 6.

Discovering gene expression patterns in time course microarray experiments by ANOVA-SCA.

Bioinformatics. 2007 Jul 15;23(14):1792-800. doi: 10.1093/bioinformatics/btm251. Epub 2007 May 22.

Visualization and analysis of molecular data.

Methods Mol Biol. 2007;358:87-104. doi: 10.1007/978-1-59745-244-1_6.

PLS dimension reduction for classification with microarray data.

Stat Appl Genet Mol Biol. 2004;3:Article33. doi: 10.2202/1544-6115.1075. Epub 2004 Nov 23.

A web-based tool for principal component and significance analysis of microarray data.

Bioinformatics. 2005 May 15;21(10):2548-9. doi: 10.1093/bioinformatics/bti343. Epub 2005 Feb 25.

Vector algebra in the analysis of genome-wide expression data.

Genome Biol. 2002;3(3):RESEARCH0011. doi: 10.1186/gb-2002-3-3-research0011. Epub 2002 Feb 13.

Nonlinear dimensionality reduction by locally linear embedding.

Science. 2000 Dec 22;290(5500):2323-6. doi: 10.1126/science.290.5500.2323.

Singular value decomposition for genome-wide expression data processing and modeling.

Proc Natl Acad Sci U S A. 2000 Aug 29;97(18):10101-6. doi: 10.1073/pnas.97.18.10101.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr超能文献

超高通量基因组数据的超稀疏主成分分析。

Super-sparse principal component analyses for high-throughput genomic data.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译