Department of Mathematical Sciences, University of Arkansas, Fayetteville, AR 72701, USA.
BMC Bioinformatics. 2020 Dec 3;21(Suppl 9):205. doi: 10.1186/s12859-020-3530-x.
Compositional data refer to the data that lie on a simplex, which are common in many scientific domains such as genomics, geology and economics. As the components in a composition must sum to one, traditional tests based on unconstrained data become inappropriate, and new statistical methods are needed to analyze this special type of data.
In this paper, we consider a general problem of testing for the compositional difference between K populations. Motivated by microbiome and metagenomics studies, where the data are often over-dispersed and high-dimensional, we formulate a well-posed hypothesis from a Bayesian point of view and suggest a nonparametric test based on inter-point distance to evaluate statistical significance. Unlike most existing tests for compositional data, our method does not rely on any data transformation, sparsity assumption or regularity conditions on the covariance matrix, but directly analyzes the compositions. Simulated data and two real data sets on the human microbiome are used to illustrate the promise of our method.
Our simulation studies and real data applications demonstrate that the proposed test is more sensitive to the compositional difference than the mean-based method, especially when the data are over-dispersed or zero-inflated. The proposed test is easy to implement and computationally efficient, facilitating its application to large-scale datasets.
组成数据是指位于单形上的数据,它们在基因组学、地质学和经济学等许多科学领域中都很常见。由于组成部分的总和必须为一,因此基于无约束数据的传统检验变得不合适,需要新的统计方法来分析这种特殊类型的数据。
在本文中,我们考虑了 K 个总体之间组成差异的一般检验问题。受微生物组学和宏基因组学研究的启发,这些数据通常是过离散和高维的,我们从贝叶斯的角度提出了一个恰当的假设,并提出了一种基于点间距离的非参数检验来评估统计显著性。与大多数现有的组成数据分析方法不同,我们的方法不依赖于任何数据转换、稀疏假设或协方差矩阵的正则条件,而是直接分析组成部分。模拟数据和两个关于人类微生物组的真实数据集用于说明我们方法的前景。
我们的模拟研究和真实数据应用表明,与基于均值的方法相比,所提出的检验方法对组成差异更敏感,特别是当数据过离散或零膨胀时。所提出的检验方法易于实现且计算效率高,有利于其应用于大规模数据集。