Evans Steven N, Matsen Frederick A
University of California at Berkeley, USA.
J R Stat Soc Series B Stat Methodol. 2012 Jun 1;74(3):569-592. doi: 10.1111/j.1467-9868.2011.01018.x. Epub 2012 Feb 15.
It is now common to survey microbial communities by sequencing nucleic acid material extracted in bulk from a given environment. Comparative methods are needed that indicate the extent to which two communities differ given data sets of this type. UniFrac, which gives a somewhat ad hoc phylogenetics-based distance between two communities, is one of the most commonly used tools for these analyses. We provide a foundation for such methods by establishing that, if we equate a metagenomic sample with its empirical distribution on a reference phylogenetic tree, then the weighted UniFrac distance between two samples is just the classical Kantorovich-Rubinstein, or earth mover's, distance between the corresponding empirical distributions. We demonstrate that this Kantorovich-Rubinstein distance and extensions incorporating uncertainty in the sample locations can be written as a readily computable integral over the tree, we develop L(p) Zolotarev-type generalizations of the metric, and we show how the p-value of the resulting natural permutation test of the null hypothesis 'no difference between two communities' can be approximated by using a Gaussian process functional. We relate the L(2)-case to an analysis-of-variance type of decomposition, finding that the distribution of its associated Gaussian functional is that of a computable linear combination of independent [Formula: see text] random variables.
现在,通过对从给定环境中批量提取的核酸材料进行测序来调查微生物群落已很常见。需要有比较方法来表明在给定此类数据集的情况下,两个群落的差异程度。UniFrac给出了两个群落之间基于系统发育的某种特别的距离,是这些分析中最常用的工具之一。我们为这类方法奠定了基础,即通过证明,如果我们将宏基因组样本与其在参考系统发育树上的经验分布等同起来,那么两个样本之间的加权UniFrac距离恰好就是相应经验分布之间的经典 Kantorovich - Rubinstein 距离,即推土机距离。我们证明了这个 Kantorovich - Rubinstein 距离以及纳入样本位置不确定性的扩展形式可以写成树上易于计算的积分,我们开发了该度量的L(p) Zolotarev型推广,并且我们展示了如何通过使用高斯过程泛函来近似原假设“两个群落无差异”的所得自然置换检验的p值。我们将L(2)情形与方差分析类型的分解相关联,发现其相关高斯泛函的分布是独立[公式:见原文]随机变量的可计算线性组合的分布。