cuBLASTP：蛋白质序列搜索在CPU+GPU上的细粒度并行化

cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on CPU+GPU.

作者信息

Zhang Jing, Wang Hao, Feng Wu-Chun

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2017 Jul-Aug;14(4):830-843. doi: 10.1109/TCBB.2015.2489662. Epub 2015 Oct 12.

DOI:10.1109/TCBB.2015.2489662

Abstract

BLAST, short for Basic Local Alignment Search Tool, is a ubiquitous tool used in the life sciences for pairwise sequence search. However, with the advent of next-generation sequencing (NGS), whether at the outset or downstream from NGS, the exponential growth of sequence databases is outstripping our ability to analyze the data. While recent studies have utilized the graphics processing unit (GPU) to speedup the BLAST algorithm for searching protein sequences (i.e., BLASTP), these studies use coarse-grained parallelism, where one sequence alignment is mapped to only one thread. Such an approach does not efficiently utilize the capabilities of a GPU, particularly due to the irregularity of BLASTP in both execution paths and memory-access patterns. To address the above shortcomings, we present a fine-grained approach to parallelize BLASTP, where each individual phase of sequence search is mapped to many threads on a GPU. This approach, which we refer to as cuBLASTP, reorders data-access patterns and reduces divergent branches of the most time-consuming phases (i.e., hit detection and ungapped extension). In addition, cuBLASTP optimizes the remaining phases (i.e., gapped extension and alignment with trace back) on a multicore CPU and overlaps their execution with the phases running on the GPU.

摘要

BLAST是基本局部比对搜索工具（Basic Local Alignment Search Tool）的缩写，是生命科学中用于成对序列搜索的常用工具。然而，随着下一代测序（NGS）的出现，无论是在NGS的初始阶段还是下游阶段，序列数据库的指数级增长都超过了我们分析数据的能力。虽然最近的研究利用图形处理单元（GPU）来加速用于搜索蛋白质序列的BLAST算法（即BLASTP），但这些研究使用的是粗粒度并行，其中一个序列比对仅映射到一个线程。这种方法没有有效利用GPU的能力，特别是由于BLASTP在执行路径和内存访问模式上的不规则性。为了解决上述缺点，我们提出了一种细粒度方法来并行化BLASTP，其中序列搜索的每个单独阶段都映射到GPU上的多个线程。这种方法我们称为cuBLASTP，它重新排序数据访问模式并减少最耗时阶段（即命中检测和无间隙扩展）的发散分支。此外，cuBLASTP在多核CPU上优化其余阶段（即有间隙扩展和带回溯比对），并使其执行与在GPU上运行的阶段重叠。