Westcott Sarah L, Schloss Patrick D
Department of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan, USA.
mSphere. 2017 Mar 8;2(2). doi: 10.1128/mSphereDirect.00073-17. eCollection 2017 Mar-Apr.
Assignment of 16S rRNA gene sequences to operational taxonomic units (OTUs) is a computational bottleneck in the process of analyzing microbial communities. Although this has been an active area of research, it has been difficult to overcome the time and memory demands while improving the quality of the OTU assignments. Here, we developed a new OTU assignment algorithm that iteratively reassigns sequences to new OTUs to optimize the Matthews correlation coefficient (MCC), a measure of the quality of OTU assignments. To assess the new algorithm, OptiClust, we compared it to 10 other algorithms using 16S rRNA gene sequences from two simulated and four natural communities. Using the OptiClust algorithm, the MCC values averaged 15.2 and 16.5% higher than the OTUs generated when we used the average neighbor and distance-based greedy clustering with VSEARCH, respectively. Furthermore, on average, OptiClust was 94.6 times faster than the average neighbor algorithm and just as fast as distance-based greedy clustering with VSEARCH. An empirical analysis of the efficiency of the algorithms showed that the time and memory required to perform the algorithm scaled quadratically with the number of unique sequences in the data set. The significant improvement in the quality of the OTU assignments over previously existing methods will significantly enhance downstream analysis by limiting the splitting of similar sequences into separate OTUs and merging of dissimilar sequences into the same OTU. The development of the OptiClust algorithm represents a significant advance that is likely to have numerous other applications. The analysis of microbial communities from diverse environments using 16S rRNA gene sequencing has expanded our knowledge of the biogeography of microorganisms. An important step in this analysis is the assignment of sequences into taxonomic groups based on their similarity to sequences in a database or based on their similarity to each other, irrespective of a database. In this study, we present a new algorithm for the latter approach. The algorithm, OptiClust, seeks to optimize a metric of assignment quality by shuffling sequences between taxonomic groups. We found that OptiClust produces more robust assignments and does so in a rapid and memory-efficient manner. This advance will allow for a more robust analysis of microbial communities and the factors that shape them.
将16S rRNA基因序列分配到操作分类单元(OTU)是微生物群落分析过程中的一个计算瓶颈。尽管这一直是一个活跃的研究领域,但在提高OTU分配质量的同时,难以克服时间和内存需求。在这里,我们开发了一种新的OTU分配算法,该算法通过迭代地将序列重新分配到新的OTU来优化马修斯相关系数(MCC),这是一种衡量OTU分配质量的指标。为了评估新算法OptiClust,我们使用来自两个模拟群落和四个自然群落的16S rRNA基因序列将其与其他10种算法进行了比较。使用OptiClust算法时,MCC值分别比我们使用VSEARCH的平均邻居法和基于距离的贪婪聚类法生成的OTU平均高出15.2%和16.5%。此外,平均而言,OptiClust比平均邻居算法快94.6倍,与使用VSEARCH的基于距离的贪婪聚类法速度相当。对算法效率的实证分析表明,执行该算法所需的时间和内存与数据集中唯一序列的数量呈二次方比例关系。与现有方法相比,OTU分配质量的显著提高将通过限制将相似序列拆分为单独的OTU以及将不同序列合并到同一个OTU中来显著增强下游分析。OptiClust算法的开发代表了一项重大进展,可能还有许多其他应用。使用16S rRNA基因测序对来自不同环境的微生物群落进行分析,扩展了我们对微生物生物地理学的认识。该分析中的一个重要步骤是根据序列与数据库中序列的相似性或它们彼此之间的相似性(不考虑数据库)将序列分配到分类组中。在本研究中,我们提出了一种用于后一种方法的新算法。该算法OptiClust试图通过在分类组之间改组序列来优化分配质量的度量。我们发现OptiClust能产生更可靠的分配,并且是以快速且内存高效的方式做到的。这一进展将使对微生物群落及其形成因素的分析更加可靠。