Xia Zeyu, Yang Canqun, Peng Chenchen, Guo Yifei, Guo Yufei, Tang Tao, Cui Yingbo
College of Computer Science and Technology, National University of Defense Technology, 410073, Changsha, China.
National Supercomputer Center in Tianjin, 300457, Tianjin, China.
BMC Bioinformatics. 2025 May 2;26(1):118. doi: 10.1186/s12859-025-06129-w.
The advent of Single Molecule Real-Time (SMRT) sequencing has overcome many limitations of second-generation sequencing, such as limited read lengths, PCR amplification biases. However, longer reads increase data volume exponentially and high error rates make many existing alignment tools inapplicable. Additionally, a single CPU's performance bottleneck restricts the effectiveness of alignment algorithms for SMRT sequencing.
To address these challenges, we introduce ParaHAT, a parallel alignment algorithm for noisy long reads. ParaHAT utilizes vector-level, thread-level, process-level, and heterogeneous parallelism. We redesign the dynamic programming matrices layouts to eliminate data dependency in the base-level alignment, enabling effective vectorization. We further enhance computational speed through heterogeneous parallel technology and implement the algorithm for multi-node computing using MPI, overcoming the computational limits of a single node.
Performance evaluations show that ParaHAT got a 10.03x speedup in base-level alignment, with a parallel acceleration ratio and weak scalability metric of 94.61 and 98.98% on 128 nodes, respectively.
单分子实时(SMRT)测序技术的出现克服了第二代测序技术的许多局限性,如读长有限、PCR扩增偏差等。然而,更长的读长会使数据量呈指数级增长,且错误率高使得许多现有的比对工具无法适用。此外,单个CPU的性能瓶颈限制了针对SMRT测序的比对算法的有效性。
为应对这些挑战,我们引入了ParaHAT,一种用于有噪声长读段的并行比对算法。ParaHAT利用向量级、线程级、进程级和异构并行性。我们重新设计了动态规划矩阵布局,以消除碱基级比对中的数据依赖性,实现有效的向量化。我们通过异构并行技术进一步提高计算速度,并使用MPI实现了多节点计算算法,克服了单个节点的计算限制。
性能评估表明,ParaHAT在碱基级比对中实现了10.03倍的加速,在128个节点上的并行加速比和弱可扩展性指标分别为94.61和98.98%。