McMurdie Paul J, Holmes Susan
Statistics Department, Stanford University, Stanford, California, United States of America.
PLoS Comput Biol. 2014 Apr 3;10(4):e1003531. doi: 10.1371/journal.pcbi.1003531. eCollection 2014 Apr.
Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.
目前微生物组计数数据标准化的做法在统计学意义上效率低下。由于明显的历史原因,常见的方法要么是使用简单比例(这无法解决异方差问题),要么是使用计数的稀疏化,尽管这两种方法都不适合检测差异丰富的物种。已有成熟的统计理论,可使用适当的混合模型同时考虑文库大小差异和生物变异性。此外,针对DNA测序读数计数数据的特定实现(例如基于负二项式模型)已在诸如edgeR和DESeq等专注于RNA-Seq的R包中可用。在这里,我们总结了支持性的统计理论,并使用模拟和实证数据来证明相关混合模型框架相对于简单比例或稀疏化所带来的显著改进。我们展示了比例和稀疏化计数如何在跨样本类差异丰富的物种测试中导致高假阳性率。关于微生物组样本聚类,我们还表明,稀疏化过程通常会丢弃可以通过其他方法准确聚类的样本。我们进一步将不同的负二项式方法与最近描述的零膨胀高斯混合方法进行比较,该方法在一个名为metagenomeSeq的包中实现。我们发现,当有足够数量的生物学重复时,metagenomeSeq表现良好,但它仍然倾向于较高的假阳性率。基于这些结果和成熟的统计理论,我们主张研究人员完全避免使用稀疏化方法。我们已在R包phyloseq中为这些工具提供了针对微生物组的扩展。