Center for Microbial Ecology, Michigan State University, East Lansing, MI 48824, USA.
Proc Natl Acad Sci U S A. 2011 Aug 30;108(35):14637-42. doi: 10.1073/pnas.1111435108. Epub 2011 Aug 22.
High-throughput sequencing of 16S rRNA genes has increased our understanding of microbial community structure, but now even higher-throughput methods to the Illumina scale allow the creation of much larger datasets with more samples and orders-of-magnitude more sequences that swamp current analytic methods. We developed a method capable of handling these larger datasets on the basis of assignment of sequences into an existing taxonomy using a supervised learning approach (taxonomy-supervised analysis). We compared this method with a commonly used clustering approach based on sequence similarity (taxonomy-unsupervised analysis). We sampled 211 different bacterial communities from various habitats and obtained ∼1.3 million 16S rRNA sequences spanning the V4 hypervariable region by pyrosequencing. Both methodologies gave similar ecological conclusions in that β-diversity measures calculated by using these two types of matrices were significantly correlated to each other, as were the ordination configurations and hierarchical clustering dendrograms. In addition, our taxonomy-supervised analyses were also highly correlated with phylogenetic methods, such as UniFrac. The taxonomy-supervised analysis has the advantages that it is not limited by the exhaustive computation required for the alignment and clustering necessary for the taxonomy-unsupervised analysis, is more tolerant of sequencing errors, and allows comparisons when sequences are from different regions of the 16S rRNA gene. With the tremendous expansion in 16S rRNA data acquisition underway, the taxonomy-supervised approach offers the potential to provide more rapid and extensive community comparisons across habitats and samples.
高通量测序 16S rRNA 基因增加了我们对微生物群落结构的理解,但现在甚至更高通量的方法(如 Illumina 技术)可以创建更大的数据集,其中包含更多的样本和数量级更多的序列,这使得当前的分析方法相形见绌。我们开发了一种方法,能够基于使用监督学习方法(分类监督分析)将序列分配到现有分类学中,从而处理这些更大的数据集。我们将这种方法与一种常用的基于序列相似性的聚类方法(分类无监督分析)进行了比较。我们从各种生境中采样了 211 个不同的细菌群落,并通过焦磷酸测序获得了约 130 万个跨越 V4 高变区的 16S rRNA 序列。这两种方法都得出了相似的生态结论,即使用这两种类型的矩阵计算的β多样性测量值彼此之间显著相关,排序配置和层次聚类树状图也是如此。此外,我们的分类监督分析也与系统发育方法(如 UniFrac)高度相关。分类监督分析具有以下优势:它不受分类无监督分析所需的详尽计算限制,对测序错误更具容忍性,并且允许在序列来自 16S rRNA 基因的不同区域时进行比较。随着 16S rRNA 数据采集的巨大扩展,分类监督方法有可能在不同的生境和样本之间提供更快速和广泛的群落比较。