Pham Minh, Tu Yicheng, Lv Xiaoyi
University of South Florida, Tampa, FL, USA.
Xinjiang University, Ürümqi, China.
ICS. 2023 Jun;2023:155-166. doi: 10.1145/3577193.3593703. Epub 2023 Jun 21.
Advancements in Next-Generation Sequencing (NGS) have significantly reduced the cost of generating DNA sequence data and increased the speed of data production. However, such high-throughput data production has increased the need for efficient data analysis programs. One of the most computationally demanding steps in analyzing sequencing data is mapping short reads produced by NGS to a reference DNA sequence, such as a human genome. The mapping program BWA-MEM and its newer version BWA-MEM2, optimized for CPUs, are some of the most popular choices for this task. In this study, we discuss the implementation of BWA-MEM on GPUs. This is a challenging task because many algorithms and data structures in BWA-MEM do not execute efficiently on the GPU architecture. This paper identifies major challenges in developing efficient GPU code on all major stages of the BWA-MEM program, including seeding, seed chaining, Smith-Waterman alignment, memory management, and I/O handling. We conduct comparison experiments against BWA-MEM and BWA-MEM2 running on a 64-thread CPU. The results show that our implementation achieved up to 3.2x speedup over BWA-MEM2 and up to 5.8x over BWA-MEM when using an NVIDIA A40. Using an NVIDIA A6000 and an NVIDIA A100, we achieved a wall-time speedup of up to 3.4x/3.8x over BWA-MEM2 and up to 6.1x/6.8x over BWA-MEM, respectively. In stage-wise comparison, the A40/A6000/A100 GPUs respectively achieved up to 3.7/3.8/4x, 2/2.3/2.5x, and 3.1/5/7.9x speedup on the three major stages of BWA-MEM: seeding and seed chaining, Smith-Waterman, and making SAM output. To the best of our knowledge, this is the first study that attempts to implement the entire BWA-MEM program on GPUs.
下一代测序(NGS)技术的进步显著降低了生成DNA序列数据的成本,并提高了数据生成速度。然而,这种高通量的数据生成增加了对高效数据分析程序的需求。在分析测序数据时,计算要求最高的步骤之一是将NGS产生的短读段映射到参考DNA序列,如人类基因组。针对CPU进行优化的映射程序BWA-MEM及其较新版本BWA-MEM2是这项任务中最受欢迎的选择之一。在本研究中,我们讨论了BWA-MEM在GPU上的实现。这是一项具有挑战性的任务,因为BWA-MEM中的许多算法和数据结构在GPU架构上无法高效执行。本文确定了在BWA-MEM程序的所有主要阶段开发高效GPU代码时面临的主要挑战,包括种子查找、种子链接、史密斯-沃特曼比对、内存管理和I/O处理。我们针对在64线程CPU上运行的BWA-MEM和BWA-MEM2进行了对比实验。结果表明,当使用NVIDIA A40时,我们的实现比BWA-MEM2加速了3.2倍,比BWA-MEM加速了5.8倍。使用NVIDIA A6000和NVIDIA A100时,我们分别比BWA-MEM2实现了高达3.4倍/3.8倍的墙钟时间加速,比BWA-MEM实现了高达6.1倍/6.8倍的加速。在逐阶段对比中,A40/A6000/A100 GPU在BWA-MEM的三个主要阶段:种子查找和种子链接、史密斯-沃特曼比对以及生成SAM输出上分别实现了高达3.7倍/3.8倍/4倍、2倍/2.3倍/2.5倍和3.1倍/5倍/7.9倍的加速。据我们所知,这是第一项尝试在GPU上实现整个BWA-MEM程序的研究。