Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China.
Genomics Proteomics Bioinformatics. 2024 Jul 3;22(2). doi: 10.1093/gpbjnl/qzae025.
Increasing the accuracy of the nucleotide sequence alignment is an essential issue in genomics research. Although classic dynamic programming (DP) algorithms (e.g., Smith-Waterman and Needleman-Wunsch) guarantee to produce the optimal result, their time complexity hinders the application of large-scale sequence alignment. Many optimization efforts that aim to accelerate the alignment process generally come from three perspectives: redesigning data structures [e.g., diagonal or striped Single Instruction Multiple Data (SIMD) implementations], increasing the number of parallelisms in SIMD operations (e.g., difference recurrence relation), or reducing search space (e.g., banded DP). However, no methods combine all these three aspects to build an ultra-fast algorithm. In this study, we developed a Banded Striped Aligner (BSAlign) library that delivers accurate alignment results at an ultra-fast speed by knitting a series of novel methods together to take advantage of all of the aforementioned three perspectives with highlights such as active F-loop in striped vectorization and striped move in banded DP. We applied our new acceleration design on both regular and edit distance pairwise alignment. BSAlign achieved 2-fold speed-up than other SIMD-based implementations for regular pairwise alignment, and 1.5-fold to 4-fold speed-up in edit distance-based implementations for long reads. BSAlign is implemented in C programing language and is available at https://github.com/ruanjue/bsalign.
提高核苷酸序列比对的准确性是基因组学研究中的一个重要问题。虽然经典的动态规划(DP)算法(如 Smith-Waterman 和 Needleman-Wunsch)保证能得到最优结果,但它们的时间复杂度限制了大规模序列比对的应用。许多旨在加速比对过程的优化努力通常来自三个方面:重新设计数据结构[例如,对角线或带状单指令多数据(SIMD)实现],增加 SIMD 操作中的并行度(例如,差分递归关系),或减少搜索空间(例如,带状 DP)。然而,没有方法结合这三个方面来构建超快速算法。在这项研究中,我们开发了一个带状条纹对齐器(BSAlign)库,通过将一系列新方法编织在一起,利用上述三个方面的优势,实现了超快速、准确的对齐,其亮点包括条纹向量中的活动 F 环和带状 DP 中的条纹移动。我们将新的加速设计应用于常规和编辑距离比对。对于常规的两两比对,BSAlign 比其他基于 SIMD 的实现快 2 倍,而对于长读长的编辑距离比对,BSAlign 比其他实现快 1.5 到 4 倍。BSAlign 是用 C 编程语言实现的,可以在 https://github.com/ruanjue/bsalign 上获得。