Human Metabolome Technologies, Inc, 246-2 Mizukami, Kakuganji, Tsuruoka, Yamagata 997-0052, Japan.
BMC Bioinformatics. 2014 Feb 21;15:51. doi: 10.1186/1471-2105-15-51.
Principal component analysis (PCA) has been widely used to visualize high-dimensional metabolomic data in a two- or three-dimensional subspace. In metabolomics, some metabolites (e.g., the top 10 metabolites) have been subjectively selected when using factor loading in PCA, and biological inferences are made for these metabolites. However, this approach may lead to biased biological inferences because these metabolites are not objectively selected with statistical criteria.
We propose a statistical procedure that selects metabolites with statistical hypothesis testing of the factor loading in PCA and makes biological inferences about these significant metabolites with a metabolite set enrichment analysis (MSEA). This procedure depends on the fact that the eigenvector in PCA for autoscaled data is proportional to the correlation coefficient between the PC score and each metabolite level. We applied this approach to two sets of metabolomic data from mouse liver samples: 136 of 282 metabolites in the first case study and 66 of 275 metabolites in the second case study were statistically significant. This result suggests that to set the number of metabolites before the analysis is inappropriate because the number of significant metabolites differs in each study when factor loading is used in PCA. Moreover, when an MSEA of these significant metabolites was performed, significant metabolic pathways were detected, which were acceptable in terms of previous biological knowledge.
It is essential to select metabolites statistically to make unbiased biological inferences from metabolomic data when using factor loading in PCA. We propose a statistical procedure to select metabolites with statistical hypothesis testing of the factor loading in PCA, and to draw biological inferences about these significant metabolites with MSEA. We have developed an R package "mseapca" to facilitate this approach. The "mseapca" package is publicly available at the CRAN website.
主成分分析(PCA)已广泛应用于将多维代谢组学数据可视化到二维或三维子空间中。在代谢组学中,当使用 PCA 中的因子载荷时,一些代谢物(例如前 10 种代谢物)被主观选择,并对这些代谢物进行生物学推断。然而,这种方法可能导致有偏差的生物学推断,因为这些代谢物不是用统计标准客观选择的。
我们提出了一种统计程序,该程序通过 PCA 中因子载荷的假设检验选择代谢物,并通过代谢物集富集分析(MSEA)对这些显著代谢物进行生物学推断。该方法依赖于这样一个事实,即自标度数据 PCA 中的特征向量与 PC 得分与每个代谢物水平之间的相关系数成正比。我们将这种方法应用于来自小鼠肝样品的两组代谢组学数据:在第一个案例研究中,有 282 种代谢物中的 136 种,在第二个案例研究中,有 275 种代谢物中的 66 种具有统计学意义。这一结果表明,在分析之前设置代谢物的数量是不合适的,因为当在 PCA 中使用因子载荷时,每个研究中的显著代谢物数量都不同。此外,当对这些显著代谢物进行 MSEA 分析时,检测到了显著的代谢途径,这些途径在以前的生物学知识方面是可以接受的。
当在 PCA 中使用因子载荷时,必须对代谢物进行统计选择,以从代谢组学数据中得出无偏的生物学推断。我们提出了一种统计程序,通过 PCA 中因子载荷的假设检验选择代谢物,并通过 MSEA 对这些显著代谢物进行生物学推断。我们已经开发了一个 R 包“mseapca”来方便这种方法。该“mseapca”包可在 CRAN 网站上获得。