Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.
Appl Environ Microbiol. 2013 Nov;79(21):6593-603. doi: 10.1128/AEM.00342-13. Epub 2013 Aug 23.
16S rRNA sequencing, commonly used to survey microbial communities, begins by grouping individual reads into operational taxonomic units (OTUs). There are two major challenges in calling OTUs: identifying bacterial population boundaries and differentiating true diversity from sequencing errors. Current approaches to identifying taxonomic groups or eliminating sequencing errors rely on sequence data alone, but both of these activities could be informed by the distribution of sequences across samples. Here, we show that using the distribution of sequences across samples can help identify population boundaries even in noisy sequence data. The logic underlying our approach is that bacteria in different populations will often be highly correlated in their abundance across different samples. Conversely, 16S rRNA sequences derived from the same population, whether slightly different copies in the same organism, variation of the 16S rRNA gene within a population, or sequences generated randomly in error, will have the same underlying distribution across sampled environments. We present a simple OTU-calling algorithm (distribution-based clustering) that uses both genetic distance and the distribution of sequences across samples and demonstrate that it is more accurate than other methods at grouping reads into OTUs in a mock community. Distribution-based clustering also performs well on environmental samples: it is sensitive enough to differentiate between OTUs that differ by a single base pair yet predicts fewer overall OTUs than most other methods. The program can decrease the total number of OTUs with redundant information and improve the power of many downstream analyses to describe biologically relevant trends.
16S rRNA 测序常用于调查微生物群落,其首先将个体读取序列聚类为操作分类单元 (OTU)。在调用 OTU 时存在两个主要挑战:识别细菌种群边界和区分真实多样性与测序误差。目前,识别分类群或消除测序错误的方法依赖于序列数据本身,但这两种活动都可以通过序列在样本中的分布来提供信息。在这里,我们表明,即使在噪声序列数据中,使用序列在样本中的分布也可以帮助识别种群边界。我们方法的基本逻辑是,不同种群中的细菌在不同样本中的丰度通常高度相关。相反,来自同一种群的 16S rRNA 序列,无论是同一生物体中略有不同的副本、种群内 16S rRNA 基因的变异,还是随机错误生成的序列,在采样环境中都具有相同的基本分布。我们提出了一种简单的 OTU 调用算法(基于分布的聚类),该算法同时使用遗传距离和序列在样本中的分布,并证明它比其他方法更能准确地将读取序列聚类为 OTU 在模拟群落中。基于分布的聚类在环境样本上也表现良好:它足够敏感,可以区分仅相差一个碱基的 OTU,但预测的总体 OTU 比大多数其他方法少。该程序可以减少具有冗余信息的 OTU 总数,并提高许多下游分析描述生物相关趋势的能力。