Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA.
Department of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan, USA.
mSphere. 2022 Feb 23;7(1):e0091621. doi: 10.1128/msphere.00916-21. Epub 2022 Feb 2.
Assigning amplicon sequences to operational taxonomic units (OTUs) is an important step in characterizing microbial communities across large data sets. A notable difference between clustering and database-dependent reference clustering methods is that OTU assignments from methods may change when new sequences are added. However, one may wish to incorporate new samples to previously clustered data sets without clustering all sequences again, such as when comparing across data sets or deploying machine learning models. Existing reference-based methods produce consistent OTUs but only consider the similarity of each query sequence to a single reference sequence in an OTU, resulting in assignments that are worse than those generated by methods. To provide an efficient method to fit sequences to existing OTUs, we developed the OptiFit algorithm. Inspired by the OptiClust algorithm, OptiFit considers the similarity of all pairs of reference and query sequences to produce OTUs of the best possible quality. We tested OptiFit using four data sets with two strategies: (i) clustering to a reference database and (ii) splitting the data set into a reference and query set, clustering the references using OptiClust, and then clustering the queries to the references. The result is an improved implementation of reference-based clustering. OptiFit produces OTUs of a quality similar to that of OptiClust at faster speeds when using the split data set strategy. OptiFit provides a suitable option for users requiring consistent OTU assignments at the same quality as afforded by clustering methods. Advancements in DNA sequencing technology have allowed researchers to affordably generate millions of sequence reads from microorganisms in diverse environments. Efficient and robust software tools are needed to assign microbial sequences into taxonomic groups for characterization and comparison of communities. The OptiClust algorithm produces high-quality groups by comparing sequences to each other, but the assignments can change when new sequences are added to a data set, making it difficult to compare different studies. Other approaches assign sequences to groups by comparing them to sequences in a reference database to produce consistent assignments, but the quality of the groups produced is reduced compared to that with OptiClust. We developed OptiFit, a new reference-based algorithm that produces consistent yet high-quality assignments like OptiClust. OptiFit allows researchers to compare microbial communities across different studies or add new data to existing studies without sacrificing the quality of the group assignments.
将扩增子序列分配给操作分类单元 (OTU) 是对大量数据集进行微生物群落特征描述的重要步骤。聚类和基于数据库的参考聚类方法之间的一个显著区别是,使用方法进行的 OTU 分配可能会随着新序列的添加而发生变化。然而,人们可能希望在不再次对所有序列进行聚类的情况下将新样本纳入先前聚类的数据集中,例如在比较数据集或部署机器学习模型时。现有的基于参考的方法可生成一致的 OTU,但仅考虑每个查询序列与 OTU 中单个参考序列的相似性,导致分配结果不如方法生成的结果好。为了提供一种将序列适配到现有 OTU 的有效方法,我们开发了 OptiFit 算法。受 OptiClust 算法的启发,OptiFit 考虑了所有参考序列和查询序列对的相似性,以生成质量尽可能好的 OTU。我们使用四个数据集并采用两种策略来测试 OptiFit:(i) 聚类到参考数据库和 (ii) 将数据集分为参考集和查询集,使用 OptiClust 对参考进行聚类,然后将查询聚类到参考。结果是对基于参考的聚类的一种改进实现。当使用分割数据集策略时,OptiFit 以更快的速度生成与 OptiClust 相似质量的 OTU。OptiFit 为需要与方法提供的聚类方法一样质量的一致 OTU 分配的用户提供了一个合适的选择。
DNA 测序技术的进步使得研究人员能够以较低的成本从不同环境中的微生物中生成数百万个序列读取。需要高效且强大的软件工具将微生物序列分配到分类群中,以对群落进行特征描述和比较。OptiClust 算法通过相互比较序列来生成高质量的群组,但当向数据集添加新序列时,分配可能会发生变化,从而难以比较不同的研究。其他方法通过将序列与参考数据库中的序列进行比较来将序列分配到组中,从而产生一致的分配,但与 OptiClust 相比,所产生的组的质量会降低。我们开发了 OptiFit,这是一种新的基于参考的算法,它可以像 OptiClust 一样生成一致且高质量的分配。OptiFit 允许研究人员在不牺牲组分配质量的情况下比较不同研究中的微生物群落或向现有研究添加新数据。