Department of Biological Sciences, University of Massachusetts Lowell, Lowell, MA, USA.
BMC Bioinformatics. 2021 Oct 12;22(1):493. doi: 10.1186/s12859-021-04410-2.
Taxonomic classification of genetic markers for microbiome analysis is affected by the numerous choices made from sample preparation to bioinformatics analysis. Paired-end read merging is routinely used to capture the entire amplicon sequence when the read ends overlap. However, the exclusion of unmerged reads from further analysis can result in underestimating the diversity in the sequenced microbial community and is influenced by bioinformatic processes such as read trimming and the choice of reference database. A potential solution to overcome this is to concatenate (join) reads that do not overlap and keep them for taxonomic classification. The use of concatenated reads can outperform taxonomic recovery from single-end reads, but it remains unclear how their performance compares to merged reads. Using various sequenced mock communities with different amplicons, read length, read depth, taxonomic composition, and sequence quality, we tested how merging and concatenating reads performed for genus recall and precision in bioinformatic pipelines combining different parameters for read trimming and taxonomic classification using different reference databases.
The addition of concatenated reads to merged reads always increased pipeline performance. The top two performing pipelines both included read concatenation, with variable strengths depending on the mock community. The pipeline that combined merged and concatenated reads that were quality-trimmed performed best for mock communities with larger amplicons and higher average quality sequences. The pipeline that used length-trimmed concatenated reads outperformed quality trimming in mock communities with lower quality sequences but lost a significant amount of input sequences for taxonomic classification during processing. Genus level classification was more accurate using the SILVA reference database compared to Greengenes.
Merged sequences with the addition of concatenated sequences that were unable to be merged increased performance of taxonomic classifications. This was especially beneficial in mock communities with larger amplicons. We have shown for the first time, using an in-depth comparison of pipelines containing merged vs concatenated reads combined with different trimming parameters and reference databases, the potential advantages of concatenating sequences in improving resolution in microbiome investigations.
微生物组分析中遗传标记的分类学分类受到从样品制备到生物信息学分析的众多选择的影响。当读取端重叠时,通常使用配对末端读取合并来捕获整个扩增子序列。然而,将未合并的读取排除在进一步分析之外可能会导致对测序微生物群落多样性的低估,并受到生物信息学过程的影响,例如读取修剪和参考数据库的选择。克服这一问题的一种潜在解决方案是连接(合并)不重叠的读取并将其保留用于分类学分类。使用拼接读取可以提高从单端读取中进行分类学恢复的性能,但尚不清楚它们的性能与合并读取相比如何。我们使用不同的扩增子、读取长度、读取深度、分类组成和序列质量的各种测序模拟群落,测试了在不同参考数据库中使用不同参数进行读取修剪和分类学分类的生物信息学管道中,合并和拼接读取在属级召回率和精度方面的性能如何。
添加拼接读取总是会增加管道性能。性能排名前两位的管道都包含读取拼接,具体取决于模拟群落,其强度有所不同。对于具有较大扩增子和较高平均质量序列的模拟群落,组合使用经质量修剪的合并和拼接读取的管道表现最佳。对于具有较低质量序列的模拟群落,使用长度修剪的拼接读取的管道在长度修剪方面表现优于质量修剪,但在处理过程中,用于分类学分类的输入序列大量丢失。与 Greengenes 相比,使用 SILVA 参考数据库进行属级分类更准确。
添加无法合并的拼接序列的合并序列提高了分类学分类的性能。这在具有较大扩增子的模拟群落中尤为有益。我们首次使用包含合并和拼接读取的管道的深入比较,结合不同的修剪参数和参考数据库,展示了在微生物组研究中拼接序列提高分辨率的潜在优势。