University of California Los Angeles, Los Angeles, CA, United States.
School of Biological Sciences, Artificial Intelligence Institute, Institute of Molecular Biology and Genetics, Seoul National University, Seoul, South Korea.
Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad487.
Efficiently aligning sequences is a fundamental problem in bioinformatics. Many recent algorithms for computing alignments through Smith-Waterman-Gotoh dynamic programming (DP) exploit Single Instruction Multiple Data (SIMD) operations on modern CPUs for speed. However, these advances have largely ignored difficulties associated with efficiently handling complex scoring matrices or large gaps (insertions or deletions).
We propose a new SIMD-accelerated algorithm called Block Aligner for aligning nucleotide and protein sequences against other sequences or position-specific scoring matrices. We introduce a new paradigm that uses blocks in the DP matrix that greedily shift, grow, and shrink. This approach allows regions of the DP matrix to be adaptively computed. Our algorithm reaches over 5-10 times faster than some previous methods while incurring an error rate of less than 3% on protein and long read datasets, despite large gaps and low sequence identities.
Our algorithm is implemented for global, local, and X-drop alignments. It is available as a Rust library (with C bindings) at https://github.com/Daniel-Liu-c0deb0t/block-aligner.
有效地对齐序列是生物信息学中的一个基本问题。许多最近的通过 Smith-Waterman-Gotoh 动态规划 (DP) 计算比对的算法利用现代 CPU 上的单指令多数据 (SIMD) 操作来提高速度。然而,这些进展在很大程度上忽略了有效处理复杂评分矩阵或大间隙(插入或缺失)的困难。
我们提出了一种新的 SIMD 加速算法,称为块对齐器,用于对齐核苷酸和蛋白质序列与其他序列或位置特定评分矩阵。我们引入了一种新的范例,该范例使用 DP 矩阵中的块进行贪婪地移动、增长和收缩。这种方法允许自适应地计算 DP 矩阵的区域。我们的算法的速度比一些以前的方法快 5-10 倍,而在蛋白质和长读数据集上的错误率不到 3%,尽管存在大间隙和低序列同一性。
我们的算法实现了全局、局部和 X 下降对齐。它作为一个 Rust 库(带有 C 绑定)在 https://github.com/Daniel-Liu-c0deb0t/block-aligner 上可用。