Department of Information Technology, Ghent University-imec, IDLab, Ghent, B-9052, Belgium.
Bioinformatics Institute Ghent, Ghent, B-9052, Belgium.
BMC Bioinformatics. 2018 Sep 4;19(1):311. doi: 10.1186/s12859-018-2319-7.
Aligning short reads to a reference genome is an important task in many genome analysis pipelines. This task is computationally more complex when the reference genome is provided in the form of a de Bruijn graph instead of a linear sequence string.
We present a branch and bound alignment algorithm that uses the seed-and-extend paradigm to accurately align short Illumina reads to a graph. Given a seed, the algorithm greedily explores all branches of the tree until the optimal alignment path is found. To reduce the search space we compute upper bounds to the alignment score for each branch and discard the branch if it cannot improve the best solution found so far. Additionally, by using a two-pass alignment strategy and a higher-order Markov model, paths in the de Bruijn graph that do not represent a subsequence in the original reference genome are discarded from the search procedure.
BrownieAligner is applied to both synthetic and real datasets. It generally outperforms other state-of-the-art tools in terms of accuracy, while having similar runtime and memory requirements. Our results show that using the higher-order Markov model in BrownieAligner improves the accuracy, while the branch and bound algorithm reduces runtime. BrownieAligner is written in standard C++11 and released under GPL license. BrownieAligner relies on multithreading to take advantage of multi-core/multi-CPU systems. The source code is available at: https://github.com/biointec/browniealigner.
将短读序列比对到参考基因组是许多基因组分析流程中的重要任务。当参考基因组以 De Bruijn 图而不是线性序列字符串的形式提供时,该任务在计算上更加复杂。
我们提出了一种分支和界限对齐算法,该算法使用种子和扩展范例来准确地将短 Illumina 读取序列比对到图上。给定一个种子,该算法贪婪地探索树的所有分支,直到找到最佳的对齐路径。为了减少搜索空间,我们为每个分支计算对齐得分的上限,并丢弃不能改善迄今为止找到的最佳解决方案的分支。此外,通过使用两阶段对齐策略和高阶马尔可夫模型,从搜索过程中丢弃在 De Bruijn 图中不表示原始参考基因组中的子序列的路径。
BrownieAligner 应用于合成和真实数据集。它在准确性方面通常优于其他最先进的工具,同时具有相似的运行时和内存要求。我们的结果表明,在 BrownieAligner 中使用高阶马尔可夫模型可以提高准确性,而分支和界限算法可以减少运行时间。BrownieAligner 是用标准的 C++11 编写的,并根据 GPL 许可证发布。BrownieAligner 依赖于多线程来利用多核/多 CPU 系统。源代码可在以下网址获得:https://github.com/biointec/browniealigner。