Department of Chemical and Biological Engineering, University of Colorado at Boulder, Boulder, CO, 80309, USA.
Departments of Pediatrics, University of California San Diego, 9500 Gilman Drive, MC 0763, La Jolla, CA, 92093, USA.
Microbiome. 2017 Mar 3;5(1):27. doi: 10.1186/s40168-017-0237-y.
Data from 16S ribosomal RNA (rRNA) amplicon sequencing present challenges to ecological and statistical interpretation. In particular, library sizes often vary over several ranges of magnitude, and the data contains many zeros. Although we are typically interested in comparing relative abundance of taxa in the ecosystem of two or more groups, we can only measure the taxon relative abundance in specimens obtained from the ecosystems. Because the comparison of taxon relative abundance in the specimen is not equivalent to the comparison of taxon relative abundance in the ecosystems, this presents a special challenge. Second, because the relative abundance of taxa in the specimen (as well as in the ecosystem) sum to 1, these are compositional data. Because the compositional data are constrained by the simplex (sum to 1) and are not unconstrained in the Euclidean space, many standard methods of analysis are not applicable. Here, we evaluate how these challenges impact the performance of existing normalization methods and differential abundance analyses.
Effects on normalization: Most normalization methods enable successful clustering of samples according to biological origin when the groups differ substantially in their overall microbial composition. Rarefying more clearly clusters samples according to biological origin than other normalization techniques do for ordination metrics based on presence or absence. Alternate normalization measures are potentially vulnerable to artifacts due to library size. Effects on differential abundance testing: We build on a previous work to evaluate seven proposed statistical methods using rarefied as well as raw data. Our simulation studies suggest that the false discovery rates of many differential abundance-testing methods are not increased by rarefying itself, although of course rarefying results in a loss of sensitivity due to elimination of a portion of available data. For groups with large (10×) differences in the average library size, rarefying lowers the false discovery rate. DESeq2, without addition of a constant, increased sensitivity on smaller datasets (<20 samples per group) but tends towards a higher false discovery rate with more samples, very uneven (10×) library sizes, and/or compositional effects. For drawing inferences regarding taxon abundance in the ecosystem, analysis of composition of microbiomes (ANCOM) is not only very sensitive (for >20 samples per group) but also critically the only method tested that has a good control of false discovery rate.
These findings guide which normalization and differential abundance techniques to use based on the data characteristics of a given study.
16S 核糖体 RNA(rRNA)扩增子测序的数据对生态和统计解释提出了挑战。特别是,文库大小通常跨越几个数量级变化,并且数据包含许多零。尽管我们通常有兴趣比较两个或更多组的生态系统中分类群的相对丰度,但我们只能测量从生态系统中获得的标本中的分类群相对丰度。由于标本中分类群相对丰度的比较与生态系统中分类群相对丰度的比较不等效,因此这是一个特殊的挑战。其次,由于标本(以及生态系统)中分类群的相对丰度总和为 1,因此这些是组成数据。由于组成数据受单纯形(总和为 1)约束,并且在欧几里得空间中不受约束,因此许多标准分析方法不适用。在这里,我们评估这些挑战如何影响现有归一化方法和差异丰度分析的性能。
对归一化的影响:当组在整体微生物组成上有很大差异时,大多数归一化方法都能够成功地根据生物起源对样品进行聚类。与其他归一化技术相比,稀少化更清楚地根据生物起源对样品进行聚类,而其他归一化技术则根据存在或不存在的顺序度量标准对样品进行聚类。替代归一化度量标准可能由于文库大小而容易受到伪影的影响。对差异丰度测试的影响:我们在前一项工作的基础上,使用稀少化和原始数据评估了七种拟议的统计方法。我们的模拟研究表明,许多差异丰度测试方法的错误发现率不会因稀少化本身而增加,尽管当然稀少化会由于消除一部分可用数据而导致灵敏度降低。对于平均文库大小差异较大(约 10×)的组,稀少化会降低错误发现率。未添加常数的 DESeq2 在较小的数据集(每组<20 个样本)上提高了灵敏度,但随着样本数量的增加,灵敏度趋于更高的错误发现率,并且具有非常不均匀(约 10×)的文库大小和/或组成效应。为了对生态系统中分类群丰度的推断,微生物组组成分析(ANCOM)不仅非常敏感(每组>20 个样本),而且是唯一经过测试的方法,具有良好的错误发现率控制。
这些发现指导根据给定研究的数据特征选择合适的归一化和差异丰度技术。