School of Mathematical Sciences, University College Dublin, Ireland.
BMC Bioinformatics. 2010 Nov 23;11:571. doi: 10.1186/1471-2105-11-571.
Data from metabolomic studies are typically complex and high-dimensional. Principal component analysis (PCA) is currently the most widely used statistical technique for analyzing metabolomic data. However, PCA is limited by the fact that it is not based on a statistical model.
Here, probabilistic principal component analysis (PPCA) which addresses some of the limitations of PCA, is reviewed and extended. A novel extension of PPCA, called probabilistic principal component and covariates analysis (PPCCA), is introduced which provides a flexible approach to jointly model metabolomic data and additional covariate information. The use of a mixture of PPCA models for discovering the number of inherent groups in metabolomic data is demonstrated. The jackknife technique is employed to construct confidence intervals for estimated model parameters throughout. The optimal number of principal components is determined through the use of the Bayesian Information Criterion model selection tool, which is modified to address the high dimensionality of the data.
The methods presented are illustrated through an application to metabolomic data sets. Jointly modeling metabolomic data and covariates was successfully achieved and has the potential to provide deeper insight to the underlying data structure. Examination of confidence intervals for the model parameters, such as loadings, allows for principled and clear interpretation of the underlying data structure. A software package called MetabolAnalyze, freely available through the R statistical software, has been developed to facilitate implementation of the presented methods in the metabolomics field.
代谢组学数据通常是复杂的和高维的。主成分分析(PCA)是目前分析代谢组学数据最广泛使用的统计技术。然而,PCA 受到其不是基于统计模型的限制。
本文回顾并扩展了概率主成分分析(PPCA),它解决了 PCA 的一些局限性。引入了一种新的 PPCA 扩展,称为概率主成分和协变量分析(PPCCA),它提供了一种灵活的方法来联合建模代谢组学数据和其他协变量信息。用于发现代谢组学数据中固有组数量的混合 PPCA 模型的使用得到了证明。通过使用贝叶斯信息准则模型选择工具确定主成分的最佳数量,该工具经过修改以解决数据的高维性。
通过对代谢组学数据集的应用,展示了所提出的方法。成功地实现了代谢组学数据和协变量的联合建模,并有潜力提供对底层数据结构的更深入了解。检查模型参数(如载荷)的置信区间允许对底层数据结构进行有原则和清晰的解释。一个名为 MetabolAnalyze 的软件包已经开发出来,可以通过 R 统计软件免费获得,以促进在代谢组学领域实施所提出的方法。