Zhang Jing, Misra Sanchit, Wang Hao, Feng Wu-Chun
Department of Computer Science, Virginia Tech, 225 Stanger Street, Blacksburg, 24060, VA, USA.
Parallel Computing Lab, Intel Corporation, Bengaluru, Karnataka, 560102, India.
BMC Bioinformatics. 2016 Nov 4;17(1):443. doi: 10.1186/s12859-016-1302-4.
The Basic Local Alignment Search Tool (BLAST) is a fundamental program in the life sciences that searches databases for sequences that are most similar to a query sequence. Currently, the BLAST algorithm utilizes a query-indexed approach. Although many approaches suggest that sequence search with a database index can achieve much higher throughput (e.g., BLAT, SSAHA, and CAFE), they cannot deliver the same level of sensitivity as the query-indexed BLAST, i.e., NCBI BLAST, or they can only support nucleotide sequence search, e.g., MegaBLAST. Due to different challenges and characteristics between query indexing and database indexing, the existing techniques for query-indexed search cannot be used into database indexed search.
muBLASTP, a novel database-indexed BLAST for protein sequence search, delivers identical hits returned to NCBI BLAST. On Intel Haswell multicore CPUs, for a single query, the single-threaded muBLASTP achieves up to a 4.41-fold speedup for alignment stages, and up to a 1.75-fold end-to-end speedup over single-threaded NCBI BLAST. For a batch of queries, the multithreaded muBLASTP achieves up to a 5.7-fold speedups for alignment stages, and up to a 4.56-fold end-to-end speedup over multithreaded NCBI BLAST.
With a newly designed index structure for protein database and associated optimizations in BLASTP algorithm, we re-factored BLASTP algorithm for modern multicore processors that achieves much higher throughput with acceptable memory footprint for the database index.
基本局部比对搜索工具(BLAST)是生命科学中的一个基础程序,用于在数据库中搜索与查询序列最相似的序列。目前,BLAST算法采用查询索引方法。尽管许多方法表明使用数据库索引进行序列搜索可以实现更高的吞吐量(例如BLAT、SSAHA和CAFE),但它们无法提供与查询索引的BLAST(即NCBI BLAST)相同水平的灵敏度,或者它们仅支持核苷酸序列搜索,例如MegaBLAST。由于查询索引和数据库索引之间存在不同的挑战和特性,现有的查询索引搜索技术无法用于数据库索引搜索。
muBLASTP是一种用于蛋白质序列搜索的新型数据库索引BLAST,它返回的命中结果与NCBI BLAST相同。在英特尔至强多核CPU上,对于单个查询,单线程的muBLASTP在比对阶段的加速比高达4.41倍,端到端加速比相对于单线程的NCBI BLAST高达1.75倍。对于一批查询,多线程的muBLASTP在比对阶段的加速比高达5.7倍,端到端加速比相对于多线程的NCBI BLAST高达4.56倍。
通过为蛋白质数据库新设计的索引结构以及在BLASTP算法中的相关优化,我们为现代多核处理器重新设计了BLASTP算法,该算法在数据库索引占用可接受内存的情况下实现了更高的吞吐量。