基于距离的多维组合数据多样本检验及其在人类微生物组中的应用。

A distance based multisample test for high-dimensional compositional data with applications to the human microbiome.

机构信息

Department of Mathematical Sciences, University of Arkansas, Fayetteville, AR 72701, USA.

出版信息

BMC Bioinformatics. 2020 Dec 3;21(Suppl 9):205. doi: 10.1186/s12859-020-3530-x.

DOI:10.1186/s12859-020-3530-x

PMID:33272203

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7713147/

Abstract

BACKGROUND

Compositional data refer to the data that lie on a simplex, which are common in many scientific domains such as genomics, geology and economics. As the components in a composition must sum to one, traditional tests based on unconstrained data become inappropriate, and new statistical methods are needed to analyze this special type of data.

RESULTS

In this paper, we consider a general problem of testing for the compositional difference between K populations. Motivated by microbiome and metagenomics studies, where the data are often over-dispersed and high-dimensional, we formulate a well-posed hypothesis from a Bayesian point of view and suggest a nonparametric test based on inter-point distance to evaluate statistical significance. Unlike most existing tests for compositional data, our method does not rely on any data transformation, sparsity assumption or regularity conditions on the covariance matrix, but directly analyzes the compositions. Simulated data and two real data sets on the human microbiome are used to illustrate the promise of our method.

CONCLUSIONS

Our simulation studies and real data applications demonstrate that the proposed test is more sensitive to the compositional difference than the mean-based method, especially when the data are over-dispersed or zero-inflated. The proposed test is easy to implement and computationally efficient, facilitating its application to large-scale datasets.

摘要

背景

组成数据是指位于单形上的数据，它们在基因组学、地质学和经济学等许多科学领域中都很常见。由于组成部分的总和必须为一，因此基于无约束数据的传统检验变得不合适，需要新的统计方法来分析这种特殊类型的数据。

结果

在本文中，我们考虑了 K 个总体之间组成差异的一般检验问题。受微生物组学和宏基因组学研究的启发，这些数据通常是过离散和高维的，我们从贝叶斯的角度提出了一个恰当的假设，并提出了一种基于点间距离的非参数检验来评估统计显著性。与大多数现有的组成数据分析方法不同，我们的方法不依赖于任何数据转换、稀疏假设或协方差矩阵的正则条件，而是直接分析组成部分。模拟数据和两个关于人类微生物组的真实数据集用于说明我们方法的前景。