Müller Robert, Nebel Markus
Faculty of Technology, Bielefeld University, Bielefeld, Germany.
PeerJ. 2021 Aug 16;9:e11717. doi: 10.7717/peerj.11717. eCollection 2021.
High-throughput sequencing has become an essential technology in life science research. Despite continuous improvements in technology, the produced sequences are still not entirely accurate. Consequently, the sequences are usually equipped with error probabilities. The quality information is already employed to find better solutions to a number of bioinformatics problems (. read mapping). Data processing pipelines benefit in particular (especially when incorporating the quality information early), since enhanced outcomes of one step can improve all subsequent ones. Preprocessing steps, thus, quite regularly consider the sequence quality to fix errors or discard low-quality data. Other steps, however, like clustering sequences into operational taxonomic units (OTUs), a common task in the analysis of microbial communities, are typically performed without making use of the available quality information.
In this paper, we present quality-aware clustering methods inspired by quality-weighted alignments and model-based denoising, and explore their applicability to OTU clustering. We implemented the quality-aware methods in a revised version of our clustering tool GeFaST and evaluated their clustering quality and performance on mock-community data sets. Quality-weighted alignments were able to improve the clustering quality of GeFaST by up to 10%. The examination of the model-supported methods provided a more diverse picture, hinting at a narrower applicability, but they were able to attain similar improvements. Considering the quality information enlarged both runtime and memory consumption, even though the increase of the former depended heavily on the applied method and clustering threshold.
The quality-aware methods expand the iterative, clustering approach by new clustering and cluster refinement methods. Our results indicate that OTU clustering constitutes yet another analysis step benefiting from the integration of quality information. Beyond the shown potential, the quality-aware methods offer a range of opportunities for fine-tuning and further extensions.
高通量测序已成为生命科学研究中的一项重要技术。尽管技术不断改进,但所产生的序列仍不完全准确。因此,这些序列通常配备有错误概率。质量信息已被用于为许多生物信息学问题找到更好的解决方案(例如读段映射)。数据处理流程尤其受益(特别是在早期纳入质量信息时),因为一个步骤的增强结果可以改善所有后续步骤。因此,预处理步骤经常会考虑序列质量来修复错误或丢弃低质量数据。然而,其他步骤,如将序列聚类为操作分类单元(OTU),这是微生物群落分析中的一项常见任务,通常在不利用可用质量信息的情况下进行。
在本文中,我们提出了受质量加权比对和基于模型的去噪启发的质量感知聚类方法,并探讨了它们在OTU聚类中的适用性。我们在聚类工具GeFaST的修订版中实现了质量感知方法,并在模拟群落数据集上评估了它们的聚类质量和性能。质量加权比对能够将GeFaST的聚类质量提高多达10%。对模型支持方法的研究提供了一幅更加多样化的图景,表明其适用性较窄,但它们能够实现类似的改进。考虑质量信息会增加运行时间和内存消耗,尽管前者的增加在很大程度上取决于所应用的方法和聚类阈值。
质量感知方法通过新的聚类和聚类细化方法扩展了迭代聚类方法。我们的结果表明,OTU聚类是受益于质量信息整合的又一个分析步骤。除了所展示的潜力之外,质量感知方法还提供了一系列微调及进一步扩展的机会。