Group of Interdisciplinary Information Sciences, School of Software Engineering, Beijing Jiaotong University, China.
College of Information and Computer Engineering, Northeast Forestry University, China.
Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab022.
Contigs assembled from the third-generation sequencing long reads are usually more complete than the second-generation short reads. However, the current algorithms still have difficulty in assembling the long reads into the ideal complete and accurate genome, or the theoretical best result [1]. To improve the long read contigs and with more and more fully sequenced genomes available, it could still be possible to use the similar genome-assisted reassembly method [2], which was initially proposed for the short reads making use of a closely related genome (similar genome) to the sequencing genome (target genome). The method aligns the contigs and reads to the similar genome, and then extends and refines the aligned contigs with the aligned reads. Here, we introduce AlignGraph2, a similar genome-assisted reassembly pipeline for the PacBio long reads. The AlignGraph2 pipeline is the second version of AlignGraph algorithm proposed by us but completely redesigned, can be inputted with either error-prone or HiFi long reads, and contains four novel algorithms: similarity-aware alignment algorithm and alignment filtration algorithm for alignment of the long reads and preassembled contigs to the similar genome, and reassembly algorithm and weight-adjusted consensus algorithm for extension and refinement of the preassembled contigs. In our performance tests on both error-prone and HiFi long reads, AlignGraph2 can align 5.7-27.2% more long reads and 7.3-56.0% more bases than some current alignment algorithm and is more efficient or comparable to the others. For contigs assembled with various de novo algorithms and aligned to similar genomes (aligned contigs), AlignGraph2 can extend 8.7-94.7% of them (extendable contigs), and obtain contigs of 7.0-249.6% larger N50 value and 5.2-87.7% smaller number of indels per 100 kbp (extended contigs). With genomes of decreased similarities, AlignGraph2 also has relatively stable performance. The AlignGraph2 software can be downloaded for free from this site: https://github.com/huangs001/AlignGraph2.
从第三代测序长读长组装的 contigs 通常比第二代短读长更完整。然而,当前的算法仍然难以将长读长组装成理想的完整和准确的基因组,或者理论上的最佳结果[1]。为了提高长读长 contigs 的质量,并且随着越来越多的全基因组序列可用,仍然有可能使用类似的基因组辅助重新组装方法[2],该方法最初是针对利用与测序基因组(目标基因组)密切相关的基因组(相似基因组)的短读长提出的。该方法将 contigs 和读长与相似基因组进行比对,然后使用比对的读长扩展和细化对齐的 contigs。在这里,我们介绍了用于 PacBio 长读长的类似基因组辅助重新组装管道 AlignGraph2。AlignGraph2 管道是我们提出的 AlignGraph 算法的第二个版本,但完全重新设计,可以输入易错或 HiFi 长读长,并包含四个新算法:用于将长读长和预组装 contigs 与相似基因组进行比对的相似性感知比对算法和比对过滤算法,以及用于扩展和细化预组装 contigs 的重新组装算法和加权一致算法。在我们对易错和 HiFi 长读长的性能测试中,AlignGraph2 可以比对更多的长读长,比一些当前的比对算法多 5.7-27.2%,比对更多的碱基,而且比其他算法更高效或相当。对于用各种从头组装算法组装并与相似基因组比对的 contigs(对齐 contigs),AlignGraph2 可以扩展它们的 8.7-94.7%(可扩展 contigs),并获得更大的 N50 值和更小的插入缺失数(每 100 kbp 5.2-87.7%)的 contigs。随着相似基因组的减少,AlignGraph2 也具有相对稳定的性能。AlignGraph2 软件可以从以下网址免费下载:https://github.com/huangs001/AlignGraph2。