Fares Mohamed, Tharwat Engy K, Cleenwerck Ilse, Monsieurs Pieter, Van Houdt Rob, Vandamme Peter, El-Hadidi Mohamed, Mysara Mohamed
Bioinformatics Group, Center for Informatics Science, School of Information Technology and Computer Science, Nile University, Giza, Egypt.
Veterinary Research Institute, National Research Centre, Giza, Egypt.
Environ Microbiome. 2025 May 13;20(1):51. doi: 10.1186/s40793-025-00705-6.
Although 16S rRNA gene amplicon sequencing has become an indispensable method for microbiome studies, this analysis is not error-free, and remains prone to several biases and errors. Numerous algorithms have been developed to eliminate these errors and consolidate the output into distance-based Operational Taxonomic Units (OTUs) or denoising-based Amplicon Sequence Variants (ASVs). An objective comparison between them has been obscured by various experimental setups and parameters. In the present study, we conducted a comprehensive benchmarking analysis of the error rates, microbial composition, over-merging/over-splitting of reference sequences, and diversity analyses using the most complex mock community, comprising 227 bacterial strains and the Mockrobiota database. Using unified preprocessing steps, we were able to compare DADA2, Deblur, MED, UNOISE3, UPARSE, DGC (Distance-based Greedy Clustering), AN (Average Neighborhood), and Opticlust objectively.
ASV algorithms-led by DADA2- resulted in having a consistent output, yet suffered from over-splitting, while OTU algorithms-led by UPARSE-achieved clusters with lower errors, yet with more over-merging. Notably, UPARSE and DADA2 showed the closest resemblance to the intended microbial community, especially when considering measures for alpha and beta diversity.
Our unbiased comparative evaluation examined the performance of eight algorithms dedicated to the analysis of 16S rRNA amplicon sequences with a wide range of mock datasets. Our analysis shed light on the pros and cons of each algorithm and the accuracy of the produced OTUs or ASVs. The utilization of the most complex mock community and the benchmarking comparison presented here offer a framework for the comparison between OTU/ASV algorithms and an objective method for the assessment of new tools and algorithms.
尽管16S rRNA基因扩增子测序已成为微生物组研究中不可或缺的方法,但该分析并非没有错误,仍然容易出现多种偏差和误差。已经开发了许多算法来消除这些错误,并将输出结果整合为基于距离的操作分类单元(OTU)或基于去噪的扩增子序列变体(ASV)。各种实验设置和参数掩盖了它们之间的客观比较。在本研究中,我们使用包含227种细菌菌株的最复杂模拟群落和Mockrobiota数据库,对错误率、微生物组成、参考序列的过度合并/过度拆分以及多样性分析进行了全面的基准分析。通过统一的预处理步骤,我们能够客观地比较DADA2、Deblur、MED、UNOISE3、UPARSE、DGC(基于距离的贪婪聚类)、AN(平均邻域)和Opticlust。
以DADA2为首的ASV算法产生了一致的输出,但存在过度拆分的问题,而以UPARSE为首的OTU算法实现了误差较低的聚类,但过度合并的情况更多。值得注意的是,UPARSE和DADA2与预期的微生物群落最为相似,尤其是在考虑α和β多样性指标时。
我们的无偏比较评估检验了八种致力于分析16S rRNA扩增子序列的算法在各种模拟数据集上的性能。我们的分析揭示了每种算法的优缺点以及所产生的OTU或ASV的准确性。此处使用的最复杂模拟群落和基准比较为OTU/ASV算法之间的比较提供了一个框架,以及一种评估新工具和算法的客观方法。