Department of Computer Science and Engineering and Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, USA.
Bioinformatics. 2014 Jun 15;30(12):i319-i328. doi: 10.1093/bioinformatics/btu291.
De novo assemblies of genomes remain one of the most challenging applications in next-generation sequencing. Usually, their results are incomplete and fragmented into hundreds of contigs. Repeats in genomes and sequencing errors are the main reasons for these complications. With the rapidly growing number of sequenced genomes, it is now feasible to improve assemblies by guiding them with genomes from related species.
Here we introduce AlignGraph, an algorithm for extending and joining de novo-assembled contigs or scaffolds guided by closely related reference genomes. It aligns paired-end (PE) reads and preassembled contigs or scaffolds to a close reference. From the obtained alignments, it builds a novel data structure, called the PE multipositional de Bruijn graph. The incorporated positional information from the alignments and PE reads allows us to extend the initial assemblies, while avoiding incorrect extensions and early terminations. In our performance tests, AlignGraph was able to substantially improve the contigs and scaffolds from several assemblers. For instance, 28.7-62.3% of the contigs of Arabidopsis thaliana and human could be extended, resulting in improvements of common assembly metrics, such as an increase of the N50 of the extendable contigs by 89.9-94.5% and 80.3-165.8%, respectively. In another test, AlignGraph was able to improve the assembly of a published genome (Arabidopsis strain Landsberg) by increasing the N50 of its extendable scaffolds by 86.6%. These results demonstrate AlignGraph's efficiency in improving genome assemblies by taking advantage of closely related references.
The AlignGraph software can be downloaded for free from this site: https://github.com/baoe/AlignGraph.
从头组装基因组仍然是下一代测序中最具挑战性的应用之一。通常,它们的结果是不完整的,并被分割成数百个 contigs。基因组中的重复和测序错误是导致这些问题的主要原因。随着测序基因组数量的快速增长,现在可以通过使用相关物种的基因组来指导组装来改进组装结果。
在这里,我们介绍了 AlignGraph,这是一种用于扩展和连接从头组装的 contigs 或 scaffolds 的算法,该算法由密切相关的参考基因组引导。它将配对末端 (PE) 读取和预组装的 contigs 或 scaffolds 与接近的参考进行比对。从获得的比对中,它构建了一种新的数据结构,称为 PE 多位置 de Bruijn 图。来自比对和 PE 读取的包含位置信息允许我们扩展初始组装,同时避免不正确的扩展和过早终止。在我们的性能测试中,AlignGraph 能够显著改进来自几个组装器的 contigs 和 scaffolds。例如,拟南芥和人类的 contigs 中有 28.7-62.3%可以被扩展,从而提高了常见的组装指标,例如可扩展 contigs 的 N50 分别增加了 89.9-94.5%和 80.3-165.8%。在另一个测试中,AlignGraph 能够通过增加可扩展 scaffolds 的 N50 来改进已发表的基因组(拟南芥株系 Landsberg)的组装,使其增加了 86.6%。这些结果表明,AlignGraph 通过利用密切相关的参考来提高基因组组装的效率。
AlignGraph 软件可从以下网址免费下载:https://github.com/baoe/AlignGraph。