Muralidharan Harihara Subrahmaniam, Shah Nidhi, Meisel Jacquelyn S, Pop Mihai
Pop Lab, Department of Computer Science, Center for Bioinformatics and Computational Biology, UMIACS, University of Maryland, College Park, MD, United States.
Front Microbiol. 2021 Feb 24;12:638561. doi: 10.3389/fmicb.2021.638561. eCollection 2021.
High-throughput sequencing has revolutionized the field of microbiology, however, reconstructing complete genomes of organisms from whole metagenomic shotgun sequencing data remains a challenge. Recovered genomes are often highly fragmented, due to uneven abundances of organisms, repeats within and across genomes, sequencing errors, and strain-level variation. To address the fragmented nature of metagenomic assemblies, scientists rely on a process called binning, which clusters together contigs inferred to originate from the same organism. Existing binning algorithms use oligonucleotide frequencies and contig abundance (coverage) within and across samples to group together contigs from the same organism. However, these algorithms often miss short contigs and contigs from regions with unusual coverage or DNA composition characteristics, such as mobile elements. Here, we propose that information from assembly graphs can assist current strategies for metagenomic binning. We use MetaCarvel, a metagenomic scaffolding tool, to construct assembly graphs where contigs are nodes and edges are inferred based on paired-end reads. We developed a tool, Binnacle, that extracts information from the assembly graphs and clusters scaffolds into comprehensive bins. Binnacle also provides wrapper scripts to integrate with existing binning methods. The Binnacle pipeline can be found on GitHub (https://github.com/marbl/binnacle). We show that binning graph-based scaffolds, rather than contigs, improves the contiguity and quality of the resulting bins, and captures a broader set of the genes of the organisms being reconstructed.
高通量测序彻底改变了微生物学领域,然而,从全宏基因组鸟枪法测序数据中重建生物体的完整基因组仍然是一项挑战。由于生物体丰度不均、基因组内和基因组间的重复序列、测序错误以及菌株水平的变异,所恢复的基因组通常高度碎片化。为了解决宏基因组组装的碎片化问题,科学家们依赖于一种称为分箱的过程,该过程将推断来自同一生物体的重叠群聚集在一起。现有的分箱算法利用样本内和样本间的寡核苷酸频率以及重叠群丰度(覆盖度)将来自同一生物体的重叠群分组。然而,这些算法常常遗漏短重叠群以及来自具有异常覆盖度或DNA组成特征(如移动元件)区域的重叠群。在此,我们提出组装图中的信息可以辅助当前的宏基因组分箱策略。我们使用宏基因组支架搭建工具MetaCarvel来构建组装图,其中重叠群为节点,边则基于双端读段推断得出。我们开发了一个名为Binnacle的工具,它从组装图中提取信息并将支架聚类到综合的箱中。Binnacle还提供包装脚本以与现有的分箱方法集成。Binnacle流程可在GitHub(https://github.com/marbl/binnacle)上找到。我们表明,基于分箱图的支架而非重叠群,提高了所得箱的连续性和质量,并捕获了更广泛的正在重建的生物体基因集。